(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
Your high school history teacher no doubt delivered to you some variation on George Santayana’s famous remark that, “those who cannot remember the past are condemned to repeat it.“
I’m pretty sure Santayana wasn’t thinking about incident management when he wrote that. But his wisdom still applies — and it’s worth heeding if you’re responsible for incident management.
True, the main purpose of incident management is to identify and resolve issues that affect your infrastructure, but your incident management operations shouldn’t stop there. Instead of just reacting to customer tickets, you should also take advantage of the rich volumes of data that your alerting systems generate to proactively detect and prevent issues, so you can gain insights that will help you make your infrastructure more resilient going forward.
In this post, I’ll outline some strategies for working with historical incident management data, including how to collect and analyze data, and what to look for when working with this information.
The first step in analyzing historical incident management data is finding a standardized way to collect and parse the information. This can be challenging since the amount and format of historical log data varies widely between different monitoring systems.
Some monitoring systems don’t provide much at all in the way of logged data that you can examine after the fact. For example, Pingdom is a great tool for real-time monitoring, but since it was designed to tell you what’s happening now, not what happened yesterday, it doesn’t provide much historical data on its own.
Other monitoring systems keep data for limited periods of time or store it in formats that are hard to work with. For instance, to analyze Snort data, you may need to sift through packet dumps. Unless Wireshark is your favorite way to spend a Friday evening, that’s a lot of work.
Moreover, if you have lots of monitoring systems in place, they probably dump data to a number of scattered locations. Some tools write logs to /var/log on local machines, where they’re hard to find and may be deleted by maintenance scripts. Others keep logs in the cloud for varying lengths of time — not ideal if you want to analyze all of your historical data at once.
For these reasons, in order to make the most of your incident management data, you should make sure to do two things:
PagerDuty takes things a step further by allowing you to import data from these and other sources, converting it to a standardized format, and centralizing and cross-correlating data with visualizations that draw patterns and trends, and can be leveraged to identify root cause and more.
Saving your data is only half the battle. The other challenge is how to view and analyze it.
In most cases, the simplest way to view your data is over a web-based interface. Ideally, it’ll feature a sophisticated search that you can use to find specific events from your logs, monitor the current status of incidents, and so on. That’s why being able to filter and search across your entire infrastructure with normalized fields is so helpful.
While the web interface may be good for finding small-scale trends or tracing the history of a specific type of incident, to get the bigger picture you need, well, pictures. Tables and lists of alerts don’t help you understand system-wide trends. Visualizations based on your incident management data, like the kind PagerDuty includes in reports, help you to interpret information on a large scale.
Last but not least — especially if you’re into analyzing data programmatically — are APIs that let you export your log data as needed. The PagerDuty API makes it easy to collect and export log data in whatever format you need (and the Events API v2 also automatically normalizes all that data into a common format).
Once you have your data analysis, what should you be looking for? Your exact needs will vary according to the type of infrastructure you’re monitoring, of course, but some general points of information to heed include:
If you follow these tips, you won’t be left repeating history by facing the same types of incidents over and over again. Instead, you’ll be able to identify the big-picture trends, which will help you to find ways to make your infrastructure more effective overall.
And that’s how incident management can really pay off. Remember another oft-quoted maxim — “An ounce of prevention is worth a pound of cure.” Incident response is the cure, but creating a continuous feedback loop with historical incident management data is the best practice that enables prevention.