Using Event Orchestration to reduce noise and trigger next best action
We often hear from customers that they’re dealing with unmanageable levels of noise and complexity, which makes it harder to pinpoint root cause and get to resolution quickly. All this effort spent on sifting through noise, processing events, and gathering context results in a lot of wasted time.
That’s why we’ve launched Event Orchestration, which became generally available to our Event Intelligence and Digital Operations customers on Monday.
I sat down with Frank Emery, Senior Product Manager for Event Orchestration at PagerDuty to get more background on the feature – why we built it and how customer behavior and feedback fed into how he steered the development.
Q: Tell me about the new feature Event Orchestration – what problem is it trying to solve?
A: When we looked across the PagerDuty platform, we saw that 20% of incidents are resolved in five minutes or less. No major incident that’s difficult to solve can be solved in five minutes or less. What this tells us is that there are well understood processes involved with incident response like running diagnostics tests or restarting a server that are necessary but manual, and they’re taking up a lot of time for teams, which, in turn, eats away at productivity and focus. Now, these are the types of use cases where you could target with precise automation to shortcut these steps during incident response. In some cases, you could even start removing incidents from people’s plates. When you consider expanding those use cases to target repetitive tasks that result in incidents with 15 minutes or 30 minute resolution times, the potential for time savings and resulting productivity and focus get even better.
That was our north star: how can we help our customers use our platform to cut back on the amount of time that they actually have to spend doing the manual, repetitive stuff that teams are always having to do whenever they get an incident? How can we build in automation so that we can reduce the number of those easier to handle events from hitting responders, so they can direct their time towards incidents that actually need their subject matter expertise?
When thinking about Event Orchestration: if we give our customers more flexibility in how they configure rules and the ability to use more automation functions up front, could we cover as many of these well-understood tasks as possible before teams even get notified?
Q: What exactly does Event Orchestration let you do and how’s it different from Event Rules?
A: What we’ve effectively done is taken Event Rules and built out a decision engine that sits directly in the event ingestion pipeline. Event Orchestration lets you use new condition language that you build out with complex logic to trigger next best action based on conditions at scale–in some instances it’s suppression, in others it’ll be routing, and some teams will want to trigger automation actions like automated diagnostics or auto-remediation as they’re being ingested in real-time.
Setting orchestrations to handle specific situations based on conditions lets the machine use logic to help identify specific situations, and based on what they look like, determine how to deal with it. And this opens the door for the decision engine to take care of some of these tasks before somebody even gets a notification and really starts to augment the incident response process for the human if they’re needed in the first place.
Q: What are the low hanging fruit use cases for someone considering using Event Orchestration?
A: When you think about our customers – they’re most frequently going to use this in one of two ways off the bat.
The first is noise reduction. Noise is a very, very common problem – no surprise when you think about all the tools people are hooking up to monitor their stacks and how they all send alerts. We have other features like deduplication and suppression, or ML options like Intelligent Alert Grouping to help with this, but some of our customers want to get very precise – and that’s where event rules and orchestration specifically can help. Using Event Orchestration for noise reduction, a user can utilize precise rule conditions to set up a whole number of very targeted situations where you can deflect, consolidate, or suppress noise for your teams to only let the critical signals through.
The second is automation – how sophisticated this gets depends on their operational maturity. There’s a huge potential to automate some of the earlier phases of incident response, many of the steps are actually very repetitive.
Think about your noisiest service, and consider how many of the incidents on that service require the same initial diagnostic steps? We hear it from engineers all the time: whenever there’s an outage, they get called and then there are these steps that they have to take every time before anyone can do anything to start solving the problem. Typically these are things like running scripts and gathering information to find out the right context – all important, but they’re not unique to the incident and you can’t do anything with it until you get results back. Automating diagnostics is the perfect low hanging fruit in this scenario to get started automating these well-known repetitive tasks that are required in many scenarios.
We’re also hosting a webinar on February 15th where you can join to hear from Frank Emery, Senior Product Manager at PagerDuty, who will walk through an overview of the new feature, share common use cases we’ve seen in our Early Access program, and show a demo. Whether you’re looking to uplevel your automation game, hoping to consolidate event management tools, or just trying to start using more complex event rules, you do not want to miss this session. Register here.