Intelligent Alert Grouping Series Summary
Co-authored by Chris Bonnell, PagerDuty Data Scientist VI
Welcome to our final post in our EI Architecture Series on Intelligent Alert Grouping. I hope you’ve enjoyed this series, and if you’d like to take a look at any of our prior posts, please use the ei-architecture-series tag. Let’s take a moment and recap everything we’ve learned.
The default behaviors for Intelligent Alert Grouping are based on abstracted patterns in incident management as well as making use of machine learning models. This means the tool can make a lot of educated guesses, so to speak, on implementation but may not generate perfect matches in each individual environment. To compensate for that, you can improve grouping behaviors by making use of merging, titles, and service design.
Incidents are grouped via a process called merging in the PagerDuty application. In general, any incident can be merged with any other incident. Intelligent Alert Grouping in particular analyzes the Alert Title field when trying to determine if an individual alert should be merged or separated into a new incident, as we reviewed in this post. In the event that alerts are inappropriately merged into a common incident, you can take steps to separate them and move them where they belong. The machine learning model reinforces behavior with every iteration, so whether the alerts stay, are merged, or moved will improve future behavior.
Since Intelligent Alert Grouping bases the merge behavior on the Alert Title field, we covered the basics of alert titles with some general machine learning principles in an earlier post. There are three important takeaways here:
- Alert titles should benefit both humans and machine learning, with a skew toward machine learning since the rest of the incident details should be in the description.
- Remember that since machines cannot understand context, it is important to take advantage of what a computer can identify as “unique” vs. “common.”
- Since there are short character limits on what portion of the alert title will display in a push notification, put the human-oriented text earlier in the title rather than later.
To dig into how to implement these, please take a look at the machine learning portion of the post as well as the Introduction to Natural Language Processing for Text blog post on the Towards Data Science blog.
The last concept we introduced was a discussion of service design. The general idea is that similar alerts on the same service are, by default, assumed to be more highly correlated than alerts on other services. There was quite a bit to say here, as figuring out how granular to be with your service definitions really drives how you implement “service” in the PagerDuty application. As a general rule, if you’re unsure whether two “things” should be separate services or not, mimic what the desired escalation pathway is. If they both are owned by the same team or people, then considering them one service in the PagerDuty application will continue to honor that escalation with the added benefit of having their alerts more highly correlated. If different teams are responsible for them, or if they are logically distinct in a way that you don’t want their alerts more highly correlated, then define them as separate services. As for the owning teams, if you’d like to read up more on best practices for service definition and ownership in general, please take a look at our Full Service Ownership Ops Guide.
Where To Go From Here
And that’s that! Thank you so much for taking the time to learn more about how to fully utilize Intelligent Alert Grouping. If you’d like to reference these posts in the longer term, please bookmark the ei-architecture-series tag. If you’d like any further discussions please take a look at our Community Forums. For in-depth Q&A, please reach out to our support team.