How to Reduce Noise, Resolve Faster, and Automate More Often with PagerDuty
When we asked how technology leaders are feeling about increased pressure on digital services, they reported that, unsurprisingly, their investments in digital have grown. In fact, 72% are ramping up digital transformation efforts. Yet while the C-suite is interested in AIOps and automation to help their teams, it’s not always clear what their approach should be and how this technology can be applied to solve problems for their teams today.
PagerDuty’s AIOps solution offers organizations with an easy-to-use, strategic lever for digital transformation by putting the action in actionable intelligence to keep teams productive and customers happy. In this blog, I’ll walk through where PagerDuty’s AIOps solution is now, the core problems that we help our customers tackle, and how our recently launched features make it easier than ever for customers to leverage our unique intelligence to drive to action for fewer incidents and faster resolution.
If you’re thinking, “Hold on – I thought PagerDuty just did on-call…when did PagerDuty have an AIOps solution?” this blog will catch you up.
A quick overview of PagerDuty’s AIOps solution
Let’s get started with a quick overview of AIOps in the first place and remind ourselves why people are turning to AIOps and automation in the first place: There’s too much system noise disrupting technical teams. There’s too much complexity, which slows down incident resolution. And there’s too much manual work causing employee burnout.
PagerDuty’s AIOps solution is made up of best-in-class incident response, Event Intelligence, and Rundeck. Our approach to AIOps addresses the core challenges by helping teams:
- Cut the noise to avoid alert fatigue. We have a feature set designed to reduce noise so responders aren’t bogged down by unnecessary alert noise and can focus on the signal.
- Provide situational awareness for faster resolution. We have a feature set to surface and correlate relevant information to help responders orient to what’s happening in and around an incident so they can decide what to do next and drive to resolution.
- Automate safely, as much as possible. With Rundeck and native Event Intelligence features like newly launched Event Orchestration, we help employees lean on automation to take care of repetitive manual tasks so they can focus on work that matters.
Starting with tackling noise reduction
PagerDuty started with layering noise reduction on top of core incident response when we launched Event Intelligence in its first iterations a few years ago. We’re now delivering up to 98% noise reduction for customers – providing them with a buffet of options based on how they want to tune their noise and what they’re comfortable with.
Sidebar: If you’re often thinking about how to tune the system to let in the relevant signals you want and keep everything else out, Leeor Engel, one of our engineering managers, goes into the nuance of this in this webinar.
Some teams are happy with time-based alert grouping, while others take the time to train Intelligent Alert Grouping so the machine learns to group and look for patterns to handle it for you. Still other teams are plagued by pesky flapping alerts and just want to have those eradicated – we have a feature called Pause Incident Notifications to deal with this exact use case to offer a manual pause setting. We’ve just launched a new feature Auto-Pause Incident Notifications to Early Access that will even offer to have our machine learning quiet these alerts for you! Contact your account team for early access if you’re interested.
For those teams who want to have even more granular control over their noise reduction settings, new feature Event Orchestration can be configured to target event routing based on conditions and specific nested rulesets. This can help avoid unnecessary interruptions even more.
Expanding to root cause analysis to help customers resolve faster
At the end of the day, in order to help our customers resolve incidents faster, we were going to need to build features that help tackle root cause analysis because troubleshooting is one of the most time-consuming parts of the incident response lifecycle.
Past Incidents and Related Incidents have been available for responders to look into how other teammates may have solved similar problems in the past. The acceleration in this area really started when PagerDuty began ingesting Change Events a year ago. Since then, we have continued to build out that feature set to help contextualize change in relation to incidents to help our users gain situational awareness when they’re in the heat of the incident. When you think about how 70% of all incidents have some kind of change as the root cause of the problem, keeping track of all Change Events and the context surrounding them (the who, what, and when of the change) makes it easier to choose the right next change to make.
The next evolution of this was introducing Change Correlation, which shows responders which recent Change Events are most relevant to an incident, saving responders precious time during triage and identifying potential root causes. As of August, Change Events and Change Correlation are now available on our mobile app so responders can triage incidents quickly and reduce time-to-resolution from wherever they are.
It can often be tricky to truly pinpoint ‘the’ root cause – as systems have become more complex and more interwoven, it’s increasingly rare to point out one single root cause. As a matter of fact, “root cause” doesn’t even necessarily mean that the “root” had something that was wrong – it just points to the fact that the complexity of that particular state caused an issue. The whole process involves figuring out which of a number of potential threads to investigate, then digging deeper to see what they need to fix upstream to resolve the incident. Now responders can leverage Probable Origin, a feature on the Incident Details page with a list of likely origin points for the incident at hand where the responder can use to guide where to look first. Combined with Recent Changes and Past/Related Incidents, Probable Origin and other features in this category are designed to provide helpful tips to guide responders to resolution faster so they can get back to their day jobs (or back to sleep).
On Rundeck and more automation everywhere
Many executives get excited about the idea of self-healing incidents through automation. But when specifically asked about which kinds of incidents they’d like to self-remediate, you quickly get to every engineer’s favorite answer, “it depends”. These dependencies include the team’s overall maturity in adopting operational automation, how well understood a problem and resolution might be, the impactfulness of an automated process to be run, and the maturity of a software service itself. PagerDuty supports both human triggered automated resolution for incidents still requiring human evaluation, and system-triggered automation for well understood circumstances.
First announced at PagerDuty Summit, Rundeck Actions, which just became generally available, connects diagnostic and remediation automation into the PagerDuty incident response workflow. It provides a user experience for engineers to curate and publish automation to first responders, safely delegating automation that previously required escalation to more specialized engineers. Now, for situations that require human judgement, responders can safely run low-impact diagnostic commands on services implicated in an incident to help determine probable cause. They can also run corrective actions when engineers feel its appropriate to publish such repair automation to their first line responders.
This work has been happening in parallel with the development of Event Orchestration, one of our newly launched features in Early Access. Event Orchestration is a powerful decision engine that introduces custom logic and nested rules to trigger actions, including automation actions using webhooks, opening the door to fewer, more complex rules to guide enrichment, modification, and routing of events at scale to drive to next best action. We’re already getting great feedback from customers who want to use this both before and after human mobilization to quiet noise before it becomes interruptions and to route or trigger actions to help drive to resolution once a human is required. Early next year, it will be possible to connect Event Orchestration with Rundeck Actions to trigger introspective diagnostics, and even corrective automation for well understood problems.
The PagerDuty Difference
We’ve come a long way since we first launched Event Intelligence a few years ago with noise reduction capabilities. PagerDuty’s AIOps solution provides truly differentiated offering with full end-to-end functionality from event ingest to incident resolution through built-in noise reduction, root cause analysis, and automation in a single, domain-agnostic platform.
I hope this blog outlined some of the ways that we’ve been continuously investing so that PagerDuty can help solve AIOps pain points right now:
- We help teams make better, data driven decisions because our solution is easy to get started with quick time to value, no data scientists required. We do this by providing deep insights into services, responders, incidents,monitoring, etc. allowing teams to make better operational decisions without having to be experts on the platform. Right away, teams can benefit from ML and data science algorithms we’ve developed with our unique dataset to enjoy reduced noise, faster root cause, and more automation.
- We democratize the platform to deliver self-service operations with decentralized configuration tailored for distributed teams and hybrid operating models. Whether providing central IT teams with an easy button to trigger diagnostics and autoremediation or “You Build It, You Own It” devops teams with a streamlined way to troubleshoot root cause, PagerDuty’s AIOps offering fits seamlessly into any tech stack with over 600 integration partners.
- We drive to the next best action throughout the incident response lifecycle, with built-in automation. We’re designed for critical work – whether it’s Event Orchestration to cut down on manual processing by providing fewer, smarter nested rules, surfacing probable cause and relevant changes right in line with incident details, or leveraging Rundeck to create fewer escalations and automate incident resolution.
To learn more about the PagerDuty AIOps offering and how it all comes together, I’d encourage you to watch this webinar where product specialist Heath Newburn talks about the power of putting the action in actionable intelligence. For Event Intelligence specifically, we’ve launched a lot of new features in the last year, so I’m walking through all that’s come to the Event Intelligence product on a webinar coming up on December 14th. You can register here, or just drop your questions below!