Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
Your team had been fighting this major incident for hours, but your investigation was hitting one dead end after another. Finally, you managed to isolate the problem and your graphs started to improve. When all systems went back to normal, everyone let out a collective sigh of relief, shut down the response call, and went back to bed, never to think of this incident again.
Or so you thought.
There’s actually one more thing your team needed to do before moving on: perform a postmortem. Why? Postmortems are important because they help establish a culture of continuous improvement.
Without a postmortem, you and your team miss out on the opportunity to learn what you’re doing right, where you could improve, and most importantly, how to avoid making the same mistakes again and again. A well-designed, blameless postmortem will help your team improve their infrastructure and incident response process.
We’re excited to announce we’ve launched a comprehensive guide on how to conduct effective postmortems. No other resource (that we’ve found) covers the nuances of culture change, the details of how to perform an in-depth analysis, and the unique skills required to facilitate a calm and engaging conversation about failure. We explain why these concepts are important, describe the challenges associated with implementing them, and offer actionable instruction to conduct blameless postmortems.
If you are not yet doing postmortems, this guide will equip you with the knowledge and strategies needed to introduce a new process to your organization. For those of you with some experience doing postmortems, you will learn how to combat the natural tendency to blame, new lines of inquiry for deeper incident analysis, how to better utilize the postmortem meeting, and more ways to improve your existing process.
During incident response, the team is 100 percent focused on restoring service. They cannot, and should not, be wasting time and mental energy on thinking about how to do something optimally or performing a deep dive on what caused the incident. That’s why postmortems are essential—they provide a peacetime opportunity to reflect once the issue is no longer impacting users. The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be lost.
The incident postmortem goes by many names. You may know it as:
At its core, the postmortem is a document that describes in detail the situational factors that led to the incident, steps taken to respond to the incident, and planned work expected to prevent it from happening again. The postmortem process also includes a meeting to discuss the outcome of the analysis and sharing those learnings with the broader organization and your customers.
After resolving a major incident, you and your team should start thinking about the postmortem while the incident is still fresh in your minds. At PagerDuty, we complete postmortems within five days of every major incident. Just as resolving the incident becomes top priority when it occurs, completing the postmortem is prioritized over planned work. Postponing the postmortem delays key learnings that can prevent the incident from recurring.
As IT professionals, we understand that failure happens in complex systems—it’s unavoidable. And how we respond to failure when it occurs matters. An impulse to blame and punish individuals for causing incidents has the unintended effect of disincentivizing the knowledge-sharing required to prevent future incidents. Engineers will hesitate to speak up when incidents occur for fear of being blamed. This silence exacerbates the impact of incidents by increasing overall mean time to acknowledge and mean time to resolve.
For the postmortem process to result in system improvements and learning, we must treat human error as a symptom of a systemic problem, not the cause itself. In complex systems of software development, a variety of conditions interact to lead to failure. The goal of the postmortem is to understand what systemic factors led to the incident and identify actions that can prevent this kind of failure from recurring.
A blameless postmortem stays focused on how a mistake was made instead of who made the mistake. This is a crucial tool leveraged by many leading organizations, such as Etsy (a pioneer for blameless postmortems), for ensuring postmortems have the right tone, empowering engineers to give truly objective accounts of what happened by eliminating the fear of punishment.
It’s easy to agree that we want a culture of continuous improvement, but it’s difficult to practice the blamelessness required for learning. The inherently surprising nature of failure naturally leads humans to react in ways that interfere with our understanding of it. When processing information, the human mind unconsciously takes shortcuts to optimize for timeliness over accuracy, sometimes resulting in incorrect conclusions. In our guide, we detail many cognitive biases that interfere with postmortem analysis and strategies to overcome them.
The next time you encounter a major incident, remember your response is not done until the postmortem is done. Though major incident response is sometimes painful, it’s also an incredible opportunity to learn and make lasting improvements to your systems and processes.