PagerDuty Blog

Training Intelligent Alert Grouping

Complex incidents are both exhausting and commonplace. In this case, incidents that I am referring to as “complex” are incidents that involve multiple, disparate, notifications in your alert management platform. Perhaps these incidents are logically separated because the underlying systems or services were seen as less coupled than they turned out to be in reality. Or perhaps the behavior the notifications are drawing attention to has multiple potential underlying causes, and making incident associations difficult.

Our default behavior

The default behavior is to group titles that are textually similar. It’s important to understand that there is a difference between “textually similar” and how our minds might logically group common types of alerts. As an example, if you have messages like “memory usage on host is high (>90%)” and “memory usage on host is high (>95%)” these would likely be grouped because they are only one “word” different (the emphasized percentile). On the other hand, let’s say that you had alert messages that read “memory usage high (>X%) on server $NAME in region $REGION.” These messages would read similarly to our minds because they’re following a pattern, but they have too many different and distinct words to be successfully grouped by Intelligent Alert Grouping’s default. In the next post, I’ll thoroughly cover how to build titles that Intelligent Alert Grouping recognizes more readily and successfully by default – the goal of this paragraph is just to help you know your starting point.

Improve accuracy via merging

The goal of the default is to give you a place to start from. Once you start using Intelligent Alert Grouping regularly, it’s likely that you’ll need to tweak the behavior for your own environments. The first thing to remember is that machine learning trains only on the title field. In our next post, I’ll be describing more specifically how to improve how you’re titling incidents for use with this feature. The next thing to know is that Intelligent Alert Grouping uses merging to either reinforce or relearn patterns. To avoid pattern matching too aggressively, Intelligent Alert Grouping will shift it’s behavior after 5-10 merges.

How to merge incidents

There are a couple of ways that you can merge incidents together. One is when you select one or more incidents in the PagerDuty UI, you’ll see a “Merge Incidents” button appear.

This shows a list of active incidents:

I selected the top box next to “Status” to select all to merge:

Once the incidents are selected, a dialog box will appear and ask you to select which incident you want to merge into. In this case, I selected the most recent:

For clarity, I altered the title of the incident to show that the incidents were merged. The result looks like:

It’s important to know that when incidents are merged, the top-level incident remains unresolved and the merged incidents all resolve, which looks like this:

The other way to merge incidents is by opening the incident and selecting the “Merge With Another Incident” option from the “More” drop-down:

When you use this method, incidents will not be pre-populated in a drop-down, so you’ll need to know the incident number and click “Find Incident”:

For more information about merging incidents, please refer to our Support Documentation on this topic. It’s important to note that you cannot unmerge incidents at this time — merge with care!

If alerts need to be separated

There may be cases where you need to move alerts from incidents that have been incorrectly merged, either through grouping or the manual process above. The main constraint to be aware of here is that you cannot move the alerts back to their original source incident(s). The reason for this is that incidents are resolved (to close them) when merged, and you cannot move alerts into a resolved incident.

Instead, you need to create a new incident and move the desired alerts there, as, again, you cannot move alerts to a resolved incident. Users can manually create a new incident using the blue “New Incident” button in the UI. For more information about how to do this, please take a look at our Support Documentation on creating / managing incidents.

Key takeaways and where to go from here

This post was a lot! What you need to remember:

  • Intelligent Alert Grouping uses the incident title field to determine what incidents to group
  • Merging incidents that should be grouped allows you to manually alter matching behavior
  • It takes 5-10 merges for Intelligent Alert Grouping to start to alter its default behavior
  • Be cautious when merging, as you cannot directly unmerge incidents. You’ll need to create a new incident and move alerts there as needed.

In this post, I mentioned that Intelligent Alert Grouping uses the title field to determine what incidents are merged or distinct. In our next post, I’ll explain how you can take advantage of that when you create your incident titles.

All posts in this series will use the ei-architecture-series tag, please make sure to take a look to read the other posts in this series as well.