Using Postmortems to Understand Service Reliability
2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and small. Any way you slice it, these outages likely resulted in a lot of time devoted to postmortems.
I want to reflect a bit on why we write postmortems and suggest some things for authors to think about when writing them. I think there’s room for all of us to improve when it comes to gathering information to better plan pro-active fixes before services catch fire.
Why Do We Conduct Postmortems?
Our incident response training docs put it this way: “Effective post-mortem[s] allow us to learn quickly from our mistakes and improve our services and processes for everyone.” The key takeaway for me is that organizations should use postmortems to capture what they learned from an incident. In other words:
- Postmortems are an exercise to learn the specifics of why an incident happened and what needs to be done to prevent this incident in the future.
- Organizations should try and learn how effective their incident response process is and what areas can be improved.
I think these two points are what are generally talked about when people talk about “Root Cause Analysis and Causal Factors,” and “What Went Well” and “What Didn’t Go Well” in postmortems.
That’s not what I want to talk about here though.
I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.
For example, in one major incident, postmortems of minor incidents in the same service leading up to it highlighted nothing of concern—until the big incident happened. After it was resolved, the major incident postmortem looked at the “Role of Previous Incidents” and found that all identified immediate and P1 follow-ups were completed or canceled due to changing plans or new information (it’s easy and okay to de-prioritize or not do something if it looks like a single occurrence).
During the time of the minor incidents up until the big incident, there certainly was work going on with regards that particular platform, but I don’t think that anyone would say that the service was in good health! The postmortems for the incidents during this period focused on the immediate issues of the incident—they didn’t capture the health of the service as a whole. As humans, we’re bad at remembering things, so it’s important to look at broader trends to see if there is a recurring issue or not. I think there’s opportunity to level up processes by devoting more attention here when writing a postmortem report.
At PagerDuty, we’re service-owning engineering teams, so we have opinions about the ongoing stability of our teams’ services. When a major incident occurs involving a service, it forces us to think about our judgment of the stability, and whether our opinion about the long-term health has changed because of the incident. If it has, we then re-evaluate our plans to determine whether we need to prioritize large-scope work to improve that service. For a postmortem report, the crucially important thing to remember is that the things we choose not to do as action items are as important to capture as the action items we decide to do.
When looking over postmortem action items, we found that they tend to be very fine-grained and tightly scoped—upgrade this library, add this monitor, and so on. The guidance that floats around for action items timelines reinforces this. But it’s also important to communicate beyond that—needs for large-scoped remedial improvements that are spotted early on are much easier to work into the roadmaps of teams. I think engineering teams, since they’re the people closest to services, often have a lot of internal knowledge and good instincts about the health of services, but don’t always have a good way to share them and to highlight issues that need larger work. By including this information in postmortem reports, it’s an opportunity to be more transparent about these looming vulnerabilities.
The postmortem report is not just for the team conducting it and owning the service—the team prepares the report and conducts the postmortem investigation, but the final report itself is for the whole organization. A good report captures the risks of our current services, and will help Product and Engineering to more proactively prioritize work on services.
Five Questions to Answer During a Postmortem (None of Which Are “Why”)
Someone from outside your team should be able to read your postmortem report and answer these five questions:
- How did we view the health of the service involved prior to the incident?
- Did this incident teach us something that should change our views about this service’s health?
- Was this an isolated and specific bug—a failure in a class of problem we anticipated—or did it uncover a class of issue we did not architecturally anticipate in the service?
- Do we think an incident akin to this one will happen again if we don’t take larger systemic action beyond the action items captured here?
- Will this class of issue get worse/more likely to happen as we continue to grow and scale the use of the service?
*Bonus question: Was there a previous incident that showed early signs pointing to this one?
I’d expect these usually to be used as introductory text to the “Action Items” the team intends to take, but sometimes “What Went Well” or “What Didn’t Go Well” will be more appropriate.
Additionally, if there are divergent views within the team preparing the report about the questions, that is also something to capture! Uncertainty is a valuable signal.
There are also some things to clarify about what we think we are accomplishing with the action items we are taking.
Ask yourselves, are we:
- Dealing with a specific issue immediately in a narrow, targeted way?
- Taking action to eliminate what we see as an entire class of potential issues?
- Not taking action, because larger efforts are already underway and will rapidly obsolete a targeted fix? (If so, those larger efforts should be called out!)
- Not taking significant action because we don’t think it’s justified?
Learning more from and communicating better with postmortems will help you improve services and reduce the number and severity of incidents you encounter. We all want fewer major incidents and more sleep, and we can have that if we make sure we’re learning all we can from the incidents we do have.
Be sure to check out our Postmortem Handbook in which we share lessons learned from the trenches and how you can conduct better postmortems. Or dive directly into the product and try our streamlined postmortem process where you can create incident reports with a single click. Sign up for a free trial to get started!