What is an Incident Postmortem?
A postmortem (or post-mortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.
As your systems scale and become more complex, failure is inevitable, assessment and remediation is more involved and time-consuming, and it becomes increasingly painful to repeat recurring mistakes. Not having data when you need it is expensive.
The good news is, most organizations do have some kind of a postmortem process in place to assess what happened once a service has been restored. Arguably, any resolution of an issue isn’t truly complete until a team has fully documented and reflected on it.
However, conducting a postmortem can be a highly time-consuming task — teams often spend hours on each postmortem trying to piece together the chronology of events from different sources of information.
Streamlining the postmortem process is key to helping your team get the most from their postmortem time investment: spending less time conducting the postmortem, while extracting more effective learnings, is a faster path to increased operational maturity. In fact, the true value of postmortems comes from helping institutionalize a positive culture around frequent and iterative improvement.
Organizations may refer to the postmortem process in slightly different ways:
- Learning Review
- After-Action Review
- Incident Review
- Incident Report
- Post-Incident Review
- Root Cause Analysis (or RCA)
Streamline the postmortem process
The specifics around conducting postmortems vary from organization to organization. Regardless of the process, the primary purpose of postmortems should be learning, whether it’s about the systems being managed, the process being followed, or how the organization executes during a crisis. Additional goals, including identification and implementation of system or process improvements, may be realized depending on the process followed.
In general, an effective postmortem report tells a story. Incident postmortem reports should include the following:
- A high-level summary of what happened
Which services and customers were affected? How long and severe was the issue? Who was involved in the response? How did we ultimately fix the problem?
- A root cause analysis
What were the origins of failure? Why do we think this happened?
- Steps taken to diagnose, assess, and resolve
What actions were taken? Which were effective? Which were detrimental?
- A timeline of significant activity
Centralize key activities from chat conversations, incident details, and more.
- Learnings and next steps
What went well? What didn’t go well? How do we prevent this issue from happening again?
Why do postmortems?
During incident response, the team is 100% focused on restoring service. They can not, and should not, be wasting time and mental energy on thinking about how to do something more optimally, nor performing a deep dive on figuring out the root cause of an outage. That’s why postmortems are essential, providing a peacetime opportunity to reflect once the issue is no longer impacting users’ experiences. The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be completely lost.
By forcing the team to explicitly dedicate time towards discussing and documenting lessons learned, while the incident is still fresh in their minds, the team is able to prioritize their focus on the right thing at the right time. The team does not sacrifice its ability to respond quickly in the midst of the fire, nor does it lose the opportunity to collaboratively understand how to improve its infrastructure and processes across every step of the response.
Postmortems matter because learning together establishes the right culture around failing forward, with iterative and continuous improvement.
The blameless postmortem
A blameless post-mortem is critical for understanding failures by trying to understand how a mistake was made, instead of who made the mistake. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.” This is a crucial tool leveraged by many leading organizations such as Etsy, a pioneer for blameless postmortems, for ensuring postmortems have the right tone, empowering engineers to give truly objective accounts of what happened by eliminating the fear of punishment.
Some make the argument that the blameless postmortem might not seem possible because humans are hardwired for blame. They advocate “blame-aware” postmortems in which teams acknowledge the instinct to blame, but focus their attention onto actionable takeaways instead.
Whichever terminology resonates with your team, the key point is that postmortem discussions should be safe spaces in which teams can be completely honest and oriented around improving for the future instead of blaming others for the past.
Best practices and more
PageDuty offers a completely free postmortem handbook that shares industry best practices and includes a postmortem template. Use it to help you formalize your own postmortem process to make it as easy as possible for your team to respond to issues. Even better, postmortems are now part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire postmortem process with automated timeline building, collaborative editing, actionable insights, and more.