What is an Incident Post-Mortem?
A post-mortem (or postmortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.
As your systems scale and become more complex, failure is inevitable, assessment and remediation is more involved and time-consuming, and it becomes increasingly painful to repeat recurring mistakes. Not having data when you need it is expensive.
The good news is, most organizations do have some kind of a post-mortem process in place to assess what happened once a service has been restored. Arguably, any resolution of an issue isn’t truly complete until a team has fully documented and reflected on it.
However, conducting a post-mortem can be a highly time-consuming task — teams often spend hours on each post-mortem trying to piece together the chronology of events from different sources of information.
Streamlining the post-mortem process is key to helping your team get the most from their post-mortem time investment: spending less time conducting the post-mortem, while extracting more effective learnings, is a faster path to increased operational maturity. In fact, the true value of post-mortems comes from helping institutionalize a positive culture around frequent and iterative improvement.
NOTE: Organizations may refer to the post-mortem process in slightly different ways. Other terms we’ve heard in the industry include:
- Learning Review
- After-Action Review
- Incident Review
- Incident Report
- Post-Incident Review
- Root Cause Analysis (or RCA)
Streamline the post-mortem process
The specifics around conducting post-mortems vary from organization to organization. Regardless of the process, the primary purpose of post-mortems should be learning, whether it’s about the systems being managed, the process being followed, or how the organization executes during a crisis. Additional goals, including identification and implementation of system or process improvements, may be realized depending on the process followed.
In general, an effective post-mortem report tells a story. Incident post-mortem reports should include the following:
A high-level summary of what happened
Which services and customers were affected? How long and severe was the issue? Who was involved in the response? How did we ultimately fix the problem?
A root cause analysis
What were the origins of failure? Why do we think this happened?
Steps taken to diagnose, assess, and resolve
What actions were taken? Which were effective? Which were detrimental?
A timeline of significant activity
Centralize key activities from chat conversations, incident details, and more.
Learnings and next steps
What went well? What didn’t go well? How do we prevent this issue from happening again?
Why do post-mortems?
During incident response, the team is 100% focused on restoring service. They can not, and should not, be wasting time and mental energy on thinking about how to do something more optimally, nor performing a deep dive on figuring out the root cause of an outage. That’s why post-mortems are essential, providing a peacetime opportunity to reflect once the issue is no longer impacting users’ experiences. The post-mortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be completely lost.
By forcing the team to explicitly dedicate time towards discussing and documenting lessons learned, while the incident is still fresh in their minds, the team is able to prioritize their focus on the right thing at the right time. The team does not sacrifice its ability to respond quickly in the midst of the fire, nor does it lose the opportunity to collaboratively understand how to improve its infrastructure and processes across every step of the response.
Post-mortems matter because learning together establishes the right culture around failing forward, with iterative and continuous improvement.
The blameless post-mortem
A blameless post-mortem is critical for understanding failures by trying to understand how a mistake was made, instead of who made the mistake. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.” This is a crucial tool leveraged by many leading organizations such as Etsy, a pioneer for blameless post-mortems, for ensuring post-mortems have the right tone, empowering engineers to give truly objective accounts of what happened by eliminating the fear of punishment.
Some make the argument that the blameless post-mortem might not seem possible because humans are hardwired for blame. They advocate “blame-aware” post-mortems in which teams acknowledge the instinct to blame, but focus their attention onto actionable takeaways instead.
Whichever terminology resonates with your team, the key point is that post-mortem discussions should be safe spaces in which teams can be completely honest and oriented around improving for the future instead of blaming others for the past.
Best practices and more
PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are now part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more.