Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In a world where everything comes down to moments of truth, teams must respond to issues and opportunities in seconds. Rising customer expectations demand real-time...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and small. Any way you slice it, these outages likely resulted in a lot of time devoted to postmortems.
I want to reflect a bit on why we write postmortems and suggest some things for authors to think about when writing them. I think there’s room for all of us to improve when it comes to gathering information to better plan pro-active fixes before services catch fire.
Our incident response training docs put it this way: “Effective post-mortem[s] allow us to learn quickly from our mistakes and improve our services and processes for everyone.” The key takeaway for me is that organizations should use postmortems to capture what they learned from an incident. In other words:
I think these two points are what are generally talked about when people talk about “Root Cause Analysis and Causal Factors,” and “What Went Well” and “What Didn’t Go Well” in postmortems.
That’s not what I want to talk about here though.
I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.
For example, in one major incident, postmortems of minor incidents in the same service leading up to it highlighted nothing of concern—until the big incident happened. After it was resolved, the major incident postmortem looked at the “Role of Previous Incidents” and found that all identified immediate and P1 follow-ups were completed or canceled due to changing plans or new information (it’s easy and okay to de-prioritize or not do something if it looks like a single occurrence).
During the time of the minor incidents up until the big incident, there certainly was work going on with regards that particular platform, but I don’t think that anyone would say that the service was in good health! The postmortems for the incidents during this period focused on the immediate issues of the incident—they didn’t capture the health of the service as a whole. As humans, we’re bad at remembering things, so it’s important to look at broader trends to see if there is a recurring issue or not. I think there’s opportunity to level up processes by devoting more attention here when writing a postmortem report.
At PagerDuty, we’re service-owning engineering teams, so we have opinions about the ongoing stability of our teams’ services. When a major incident occurs involving a service, it forces us to think about our judgment of the stability, and whether our opinion about the long-term health has changed because of the incident. If it has, we then re-evaluate our plans to determine whether we need to prioritize large-scope work to improve that service. For a postmortem report, the crucially important thing to remember is that the things we choose not to do as action items are as important to capture as the action items we decide to do.
When looking over postmortem action items, we found that they tend to be very fine-grained and tightly scoped—upgrade this library, add this monitor, and so on. The guidance that floats around for action items timelines reinforces this. But it’s also important to communicate beyond that—needs for large-scoped remedial improvements that are spotted early on are much easier to work into the roadmaps of teams. I think engineering teams, since they’re the people closest to services, often have a lot of internal knowledge and good instincts about the health of services, but don’t always have a good way to share them and to highlight issues that need larger work. By including this information in postmortem reports, it’s an opportunity to be more transparent about these looming vulnerabilities.
The postmortem report is not just for the team conducting it and owning the service—the team prepares the report and conducts the postmortem investigation, but the final report itself is for the whole organization. A good report captures the risks of our current services, and will help Product and Engineering to more proactively prioritize work on services.
Someone from outside your team should be able to read your postmortem report and answer these five questions:
*Bonus question: Was there a previous incident that showed early signs pointing to this one?
I’d expect these usually to be used as introductory text to the “Action Items” the team intends to take, but sometimes “What Went Well” or “What Didn’t Go Well” will be more appropriate.
Additionally, if there are divergent views within the team preparing the report about the questions, that is also something to capture! Uncertainty is a valuable signal.
There are also some things to clarify about what we think we are accomplishing with the action items we are taking.
Ask yourselves, are we:
Learning more from and communicating better with postmortems will help you improve services and reduce the number and severity of incidents you encounter. We all want fewer major incidents and more sleep, and we can have that if we make sure we’re learning all we can from the incidents we do have.
Be sure to check out our Postmortem Handbook in which we share lessons learned from the trenches and how you can conduct better postmortems. Or dive directly into the product and try our streamlined postmortem process where you can create incident reports with a single click. Sign up for a free trial to get started!
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018