Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and small. Any way you slice it, these outages likely resulted in a lot of time devoted to postmortems.
I want to reflect a bit on why we write postmortems and suggest some things for authors to think about when writing them. I think there’s room for all of us to improve when it comes to gathering information to better plan pro-active fixes before services catch fire.
Our incident response training docs put it this way: “Effective post-mortem[s] allow us to learn quickly from our mistakes and improve our services and processes for everyone.” The key takeaway for me is that organizations should use postmortems to capture what they learned from an incident. In other words:
I think these two points are what are generally talked about when people talk about “Root Cause Analysis and Causal Factors,” and “What Went Well” and “What Didn’t Go Well” in postmortems.
That’s not what I want to talk about here though.
I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.
For example, in one major incident, postmortems of minor incidents in the same service leading up to it highlighted nothing of concern—until the big incident happened. After it was resolved, the major incident postmortem looked at the “Role of Previous Incidents” and found that all identified immediate and P1 follow-ups were completed or canceled due to changing plans or new information (it’s easy and okay to de-prioritize or not do something if it looks like a single occurrence).
During the time of the minor incidents up until the big incident, there certainly was work going on with regards that particular platform, but I don’t think that anyone would say that the service was in good health! The postmortems for the incidents during this period focused on the immediate issues of the incident—they didn’t capture the health of the service as a whole. As humans, we’re bad at remembering things, so it’s important to look at broader trends to see if there is a recurring issue or not. I think there’s opportunity to level up processes by devoting more attention here when writing a postmortem report.
At PagerDuty, we’re service-owning engineering teams, so we have opinions about the ongoing stability of our teams’ services. When a major incident occurs involving a service, it forces us to think about our judgment of the stability, and whether our opinion about the long-term health has changed because of the incident. If it has, we then re-evaluate our plans to determine whether we need to prioritize large-scope work to improve that service. For a postmortem report, the crucially important thing to remember is that the things we choose not to do as action items are as important to capture as the action items we decide to do.
When looking over postmortem action items, we found that they tend to be very fine-grained and tightly scoped—upgrade this library, add this monitor, and so on. The guidance that floats around for action items timelines reinforces this. But it’s also important to communicate beyond that—needs for large-scoped remedial improvements that are spotted early on are much easier to work into the roadmaps of teams. I think engineering teams, since they’re the people closest to services, often have a lot of internal knowledge and good instincts about the health of services, but don’t always have a good way to share them and to highlight issues that need larger work. By including this information in postmortem reports, it’s an opportunity to be more transparent about these looming vulnerabilities.
The postmortem report is not just for the team conducting it and owning the service—the team prepares the report and conducts the postmortem investigation, but the final report itself is for the whole organization. A good report captures the risks of our current services, and will help Product and Engineering to more proactively prioritize work on services.
Someone from outside your team should be able to read your postmortem report and answer these five questions:
*Bonus question: Was there a previous incident that showed early signs pointing to this one?
I’d expect these usually to be used as introductory text to the “Action Items” the team intends to take, but sometimes “What Went Well” or “What Didn’t Go Well” will be more appropriate.
Additionally, if there are divergent views within the team preparing the report about the questions, that is also something to capture! Uncertainty is a valuable signal.
There are also some things to clarify about what we think we are accomplishing with the action items we are taking.
Ask yourselves, are we:
Learning more from and communicating better with postmortems will help you improve services and reduce the number and severity of incidents you encounter. We all want fewer major incidents and more sleep, and we can have that if we make sure we’re learning all we can from the incidents we do have.
Be sure to check out our Postmortem Handbook in which we share lessons learned from the trenches and how you can conduct better postmortems. Or dive directly into the product and try our streamlined postmortem process where you can create incident reports with a single click. Sign up for a free trial to get started!