Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Your high school history teacher no doubt delivered to you some variation on George Santayana’s famous remark that, “those who cannot remember the past are condemned to repeat it.“
I’m pretty sure Santayana wasn’t thinking about incident management when he wrote that. But his wisdom still applies — and it’s worth heeding if you’re responsible for incident management.
True, the main purpose of incident management is to identify and resolve issues that affect your infrastructure, but your incident management operations shouldn’t stop there. Instead of just reacting to customer tickets, you should also take advantage of the rich volumes of data that your alerting systems generate to proactively detect and prevent issues, so you can gain insights that will help you make your infrastructure more resilient going forward.
In this post, I’ll outline some strategies for working with historical incident management data, including how to collect and analyze data, and what to look for when working with this information.
The first step in analyzing historical incident management data is finding a standardized way to collect and parse the information. This can be challenging since the amount and format of historical log data varies widely between different monitoring systems.
Some monitoring systems don’t provide much at all in the way of logged data that you can examine after the fact. For example, Pingdom is a great tool for real-time monitoring, but since it was designed to tell you what’s happening now, not what happened yesterday, it doesn’t provide much historical data on its own.
Other monitoring systems keep data for limited periods of time or store it in formats that are hard to work with. For instance, to analyze Snort data, you may need to sift through packet dumps. Unless Wireshark is your favorite way to spend a Friday evening, that’s a lot of work.
Moreover, if you have lots of monitoring systems in place, they probably dump data to a number of scattered locations. Some tools write logs to /var/log on local machines, where they’re hard to find and may be deleted by maintenance scripts. Others keep logs in the cloud for varying lengths of time — not ideal if you want to analyze all of your historical data at once.
For these reasons, in order to make the most of your incident management data, you should make sure to do two things:
Tools like Logstash, Splunk and Papertrail can be helpful here. They assist in collecting data from siloed locations and directing it to a central storage point.
PagerDuty takes things a step further by allowing you to import data from these and other sources, converting it to a standardized format, and centralizing and cross-correlating data with visualizations that draw patterns and trends, and can be leveraged to identify root cause and more.
Saving your data is only half the battle. The other challenge is how to view and analyze it.
In most cases, the simplest way to view your data is over a web-based interface. Ideally, it’ll feature a sophisticated search that you can use to find specific events from your logs, monitor the current status of incidents, and so on. That’s why being able to filter and search across your entire infrastructure with normalized fields is so helpful.
While the web interface may be good for finding small-scale trends or tracing the history of a specific type of incident, to get the bigger picture you need, well, pictures. Tables and lists of alerts don’t help you understand system-wide trends. Visualizations based on your incident management data, like the kind PagerDuty includes in reports, help you to interpret information on a large scale.
Last but not least — especially if you’re into analyzing data programmatically — are APIs that let you export your log data as needed. The PagerDuty API makes it easy to collect and export log data in whatever format you need (and the Events API v2 also automatically normalizes all that data into a common format).
Once you have your data analysis, what should you be looking for? Your exact needs will vary according to the type of infrastructure you’re monitoring, of course, but some general points of information to heed include:
If you follow these tips, you won’t be left repeating history by facing the same types of incidents over and over again. Instead, you’ll be able to identify the big-picture trends, which will help you to find ways to make your infrastructure more effective overall.
And that’s how incident management can really pay off. Remember another oft-quoted maxim — “An ounce of prevention is worth a pound of cure.” Incident response is the cure, but creating a continuous feedback loop with historical incident management data is the best practice that enables prevention.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019