After the Disaster: How to Learn from Historical Incident Management Data
Your high school history teacher no doubt delivered to you some variation on George Santayana’s famous remark that, “those who cannot remember the past are condemned to repeat it.“
I’m pretty sure Santayana wasn’t thinking about incident management when he wrote that. But his wisdom still applies — and it’s worth heeding if you’re responsible for incident management.
True, the main purpose of incident management is to identify and resolve issues that affect your infrastructure, but your incident management operations shouldn’t stop there. Instead of just reacting to customer tickets, you should also take advantage of the rich volumes of data that your alerting systems generate to proactively detect and prevent issues, so you can gain insights that will help you make your infrastructure more resilient going forward.
In this post, I’ll outline some strategies for working with historical incident management data, including how to collect and analyze data, and what to look for when working with this information.
Save and standardize your data
The first step in analyzing historical incident management data is finding a standardized way to collect and parse the information. This can be challenging since the amount and format of historical log data varies widely between different monitoring systems.
Some monitoring systems don’t provide much at all in the way of logged data that you can examine after the fact. For example, Pingdom is a great tool for real-time monitoring, but since it was designed to tell you what’s happening now, not what happened yesterday, it doesn’t provide much historical data on its own.
Other monitoring systems keep data for limited periods of time or store it in formats that are hard to work with. For instance, to analyze Snort data, you may need to sift through packet dumps. Unless Wireshark is your favorite way to spend a Friday evening, that’s a lot of work.
Moreover, if you have lots of monitoring systems in place, they probably dump data to a number of scattered locations. Some tools write logs to /var/log on local machines, where they’re hard to find and may be deleted by maintenance scripts. Others keep logs in the cloud for varying lengths of time — not ideal if you want to analyze all of your historical data at once.
For these reasons, in order to make the most of your incident management data, you should make sure to do two things:
- Send alerts and logs to a central collection point where they can be stored as long as you need them (rather than as long as the original monitoring system or local storage will support them).
- Convert data at your collection point to a standard format — and extract actionable insights and takeaways that can be reinvested into the infrastructure (with a process like incident postmortems).
PagerDuty takes things a step further by allowing you to import data from these and other sources, converting it to a standardized format, and centralizing and cross-correlating data with visualizations that draw patterns and trends, and can be leveraged to identify root cause and more.
View and analyze your data
Saving your data is only half the battle. The other challenge is how to view and analyze it.
In most cases, the simplest way to view your data is over a web-based interface. Ideally, it’ll feature a sophisticated search that you can use to find specific events from your logs, monitor the current status of incidents, and so on. That’s why being able to filter and search across your entire infrastructure with normalized fields is so helpful.
While the web interface may be good for finding small-scale trends or tracing the history of a specific type of incident, to get the bigger picture you need, well, pictures. Tables and lists of alerts don’t help you understand system-wide trends. Visualizations based on your incident management data, like the kind PagerDuty includes in reports, help you to interpret information on a large scale.
Last but not least — especially if you’re into analyzing data programmatically — are APIs that let you export your log data as needed. The PagerDuty API makes it easy to collect and export log data in whatever format you need (and the Events API v2 also automatically normalizes all that data into a common format).
What to look for
Once you have your data analysis, what should you be looking for? Your exact needs will vary according to the type of infrastructure you’re monitoring, of course, but some general points of information to heed include:
- The frequency at which incidents are occurring. If this number changes over time, you’ll want to know why.
- Mean time to acknowledge (MTTA) and mean time to resolve (MTTR) incidents. By keeping track of these numbers, you’ll know how effectively your team is handling its incident management responsibilities.
- Who on your team is doing the most to handle alerts? Knowing this not only allows you to reward members for their hard work, but awareness will also determine whether your alerts are being distributed properly and going to the right people. For example, if one admin is receiving more than their fair share of alerts, you should tweak things so they don’t become overwhelmed — that leads to alert fatigue, and no one wants that.
- Which monitoring systems are generating the most alerts? If you amalgamate the alerts from your various monitoring systems into a single logging location, as I suggested above, you can also identify which systems are giving you the most information. You’ll be able to see if a system is underperforming or generating too much noise, and tune your alerting thresholds as needed.
If you follow these tips, you won’t be left repeating history by facing the same types of incidents over and over again. Instead, you’ll be able to identify the big-picture trends, which will help you to find ways to make your infrastructure more effective overall.
And that’s how incident management can really pay off. Remember another oft-quoted maxim — “An ounce of prevention is worth a pound of cure.” Incident response is the cure, but creating a continuous feedback loop with historical incident management data is the best practice that enables prevention.