Postmortem Documentation Guide

While incidents and outages are a reality in the tech world, they also present valuable opportunities for growth and improvement and continue to be a costly problem. According to recent data, customer impacting incidents increased by 43% over the last year and cost nearly $800,000. Although these moments can be challenging for both customers and teams, the actions taken during and after an incident can make all the difference in strengthening systems and preventing future issues.

This is where a well-crafted postmortem becomes an essential tool, but it’s critical to find the time after to analyze what happened, the resolution, and how to move forward. In this resource, we’ll walk through the process of creating an effective postmortem to help teams improve incident management, learn, adapt, and continuously improve.

What is a postmortem?

A postmortem is a structured process following an incident. Sometimes called an incident postmortem or post-incident review, it includes a detailed review of what occurred, why it happened, how it was resolved and what to do to prevent recurring incidents.

Postmortems can help teams understand what’s going well, what could be improved, and how to avoid making the same mistakes. A thorough postmortem with detailed documentation helps teams learn from mistakes and improve systems and processes.

Postmortems are essential after an incident and help team members reflect and identify opportunities for improvement.

Project postmortem questions

Postmortems aren’t just beneficial following an incident; they can also be valuable learning tools after completing a project. This type of postmortem gives project teams an opportunity to assess what went well and what can be improved.

Here are some questions to ask during a project postmortem:

What were the goals of the project? Were the goals achieved?
What were the wins/successful parts of the project?
How well did the team collaborate? Were there any blockers (e.g. communication, timelines, etc.)
Were the requirements clearly defined for team members?
Did the team feel they had adequate resources to complete the project?
What issues did the team encounter?
Did the project adhere to the budget and timeline?
What were the biggest learnings or takeaways from this project? How can these learnings be applied to future projects?
What skill gaps must be addressed that were uncovered in this project?

How to write a postmortem

Creating a postmortem report is crucial for documenting incidents, identifying contributing factors, and establishing actions that prevent recurrence and support continuous improvement.

The postmortem documentation must be detailed and include a summary of the incident, timeline, root causes, impact, and action items to help teams determine the cause of an incident and what they can do to prevent incidents in the future. Below are the must-have sections and details to include in a postmortem.

Overview

Briefly describe what happened. Specify which teams were involved (e.g., IT, DevOps, Support) and provide a snapshot of the major milestones in the incident (e.g., detection, containment, resolution). Include a brief overview of the incident’s business and user impact to help stakeholders quickly understand the scope. This section should include a high-level summary of the incident, including the root causes, timeline, and impact.

What happened?

Provide a short description of what occurred during the incident.

Include details on:

Which parts of the infrastructure were impacted, and detail the specific services or functions that were disrupted.
Note any user-facing effects, such as slowdowns, access issues, or feature unavailability, to clarify how users were affected.
Who was involved in the response?
How was the incident resolved?

Root causes

In this section, list any conditions that may have contributed to the incident/issue.

Describe all contributing factors that may have led to the incident, such as recent changes in code, system load, or configuration errors.
Note if the team attempted interim solutions or escalated the issue internally before the root cause was discovered.

Be sure to include if any actions were taken that made the situation worse. Filling in this section can help team members learn from the mistakes that caused the incident.

Resolution

How was the problem resolved? What actions were taken?

Document both short-term fixes and the permanent solution in separate points. Include any workarounds or manual interventions while resolving the issue. Link to specific runbooks, guides, or procedures so responders have a reference in case the incident reoccurs.

Impact

What happened as a result of the incident? Be very detailed in this section and include any numbers or other particulars.

This section should include:

Timeline: Outline the major milestones from discovery to resolution, including the duration of each phase.
Issue discovery: Describe when and how the issue was first detected, noting how it was discovered (i.e. internal monitoring, customer reports, or automated alerts.)
Severity: Assign a severity level to the incident, detailing why it was categorized in this way and any specific criteria used in the assessment.
Customer impact: Document the number of customers affected and the duration of their impact. Specify the type of impact customers experienced (e.g., service disruptions, slowed performance) and any variations across user segments.
Impact on internal teams and partners: Describe how the incident affected internal teams and partners. Include any delays, resource reallocations, or additional workloads.
Final impact analysis: (technical and business KPIs): Use KPIs to quantify the technical and business impact. For example, this might include uptime metrics, service-level agreement (SLA) breaches, lost revenue, or user churn.
Resolution: Summarize the steps taken to resolve the issue, including any interim fixes and the final solution based on root cause analysis. Link to more detailed technical documentation if applicable.
Potential future impact: Assess the likelihood of similar incidents occurring in the future and discuss any potential long-term implications for the system or business.
Learnings and opportunities for continuous improvement: Reflect on what the team learned from this incident and outline specific areas for improvement in processes, tools, or team dynamics. Note any follow-up actions that can be taken to prevent recurrence.

Use opportunity metrics of interest to technical teams and business stakeholders to quantify the impact of the incident. This includes metrics like event submission or delayed processing.

Additional metrics to include:

Time to detect (TTD): Helps teams understand how fast monitoring and alert systems flagged the incident.
Time to respond (TTR): Measures the responsiveness in investigating and allocating resources.
Time to resolve (TTR): How efficient was the team in responding and problem-solving
System availability/uptime: The amount of time a system is operational, indicating system reliability.

Timeline

The timeline should document how/when the incident occurred. The timeline should contain only facts and not focus on evaluating or analyzing what happened.

Tips for creating the timeline:

Start the timeline from before the incident happened and work forward to the resolution
Review the incident log in Slack or other team communication method and find the decisions and actions that occurred during the response
Add information from monitoring logs and deployments from the affected services. Include any changes to the incident status
Note when customers first started reporting the issue and when it was discovered by internal users. What was the timeline disparity?
Create a metric for each item in the timeline or page where the data came from, such as a log search, tweet, or monitoring graph

Times to include:

Time the impact began
When the team was notified
Time of any significant actions
Time the impact ended

Responders

Describe the role team members played in resolving the incident.

Who documented the issue? Who else was involved? Specify the roles and responsibilities of each responder, like on-call engineer, communications manager, and technical support, and their key contributions to incident management.

Highlight any specific actions taken by individuals that were crucial in resolving the incident.

Evaluation

Assess the response process to understand strengths, areas for growth, and systemic contributing factors.

Analyze the contributing factors. Look beyond the immediate incident to identify a combination of contributing factors (organizational, human, technical).
Avoid blaming individuals. Blameless postmortems help teams move forward, find resolutions, and identify opportunities for improvement. Anonymize mistakes and recognize that actions occur under uncertain outcomes.
Review monitoring data around the incident, including any unusual patterns, and ensure monitoring tools are in place to prevent future issues.
Ask critical questions. Consider if the issue is part of a trend, if it reflects anticipated or unexpected problems, and if past decisions contributed to it.
Reflect on what the team learned from this incident, focusing on identifying patterns across similar incidents. These insights can guide technical and organizational improvements, especially in designing automated responses. Teams should use these learnings to inform auto-remediation strategies, enabling the system to detect and respond to recurring issues without manual intervention.
Analyze how collaboration, communication, and review processes impacted the incident, aiming to improve future responses

Include what went well and what didn’t. This will help to apply these learnings to prevent future incidents.

Next steps and action items

After recounting all the details, what’s next?

Identify actions to prevent recurrence and reduce the likelihood or impact of similar issues.

Include items such as:

Prioritized action steps. Assign priority levels (e.g., high, medium, low) to action items based on urgency and potential impact. Also, assign a team or individual responsible for each action item to ensure accountability.
Any fixes required to prevent the issue from happening again.
Consider improvements in monitoring, alerting, and incident response to detect issues and minimize the impact.
Any preparedness tasks that could improve the detection and mitigation of a similar issue.
Addressing process or workflow issues identified in the postmortem process, this includes internal emails, updating public status pages, etc.
Any improvement to the incident response process.
Specify deadlines. Include target dates for the completion of each action item.

Log all follow-up actions in a task management tool, labeling tickets with severity and date for easy tracking.

All action items should be actionable, specific, and time-bounded to ensure they lead to meaningful improvements and prevent similar incidents in the future.

Actionable: Start with a clear, directive verb to guide the responsible person on what needs to be done. Be specific, using terms like implement or document rather than vague phrases.
Specific: Define the scope of each action item in detail to prevent misinterpretation or ambiguity. Specify which systems, processes, or teams the action pertains to and outline any key steps that need to be taken.
Time-bounded: Set clear deadlines or target completion dates to avoid open-ended tasks.

Messaging

It’s also important to include guidelines for follow-up messaging for employees and public-facing messaging for customers.

Internal email

Following the post-incident review, send an internal email to relevant employees. This email should include a short paragraph summarizing the incident and a link to the full postmortem document.

Think about which employees need to receive this message. For a major outage or company-wide incident, it may be appropriate to notify all employees to maintain transparency. For more isolated incidents, limit distribution to the specific teams or departments that were directly affected.

Consider whether the message should be tailored for different audiences. Even after a widespread incident, teams that were directly impacted or will be involved in post-incident follow-up may need specific instructions besides what’s shared with the larger company. Customizing the communication can help ensure that each group understands next steps and their role in preventing future occurrences.

External message

This will appear on the website/status page for partners and customers. Decide what to tell customers and partners, including what happened and what actions were taken. Respectfully acknowledge the issue and empathize with impacted customers.

Just like the internal communication, this messaging may vary based on the severity of the incident and who was involved.

It can also be helpful to have a framework versus starting from scratch. Here’s a helpful postmortem template.

Tips for writing a postmortem

We’ve covered the basic framework for creating a postmortem, here are some best practices to follow.

Dos

Be thorough and detailed, documenting events and outcomes as accurately as possible. Be truthful and describe events and how they happened.
Discuss solutions, not just problems. In other words, separate what happened from how to fix it.
Write actionable follow-up tasks.
Define language and terminology. Understand some post-mortem meeting attendees or readers may be newcomers, so ensure everyone is familiar with any terminology or acronyms.

Don’ts

Blame people or teams. Naming or shaming someone will only upset them rather than create an opportunity to learn and improve
Change details or events to save face. Postmortems are only effective if they include accurate data
Blame human error. There are often several contributing factors that cause an incident. Identify the underlying causes.

Who is responsible for creating the postmortem?

The Incident Commander, usually a member of the IT or DevOps team, should select one of the responders as the owner of the postmortem. Although the owner can collaborate with other responders to create the postmortem, they are essentially responsible for ensuring it gets done.

The postmortem owner is responsible for:

Investigating the incident to find out what happened.
Creating the postmortem document and keeping it up to date with the latest information.
Updating the public-facing website page with relevant information.
Scheduling the postmortem meeting within a certain number of days depending on the severity of the incident (within three calendar days for a Sev-1 and five business days for a Sev-2).

A successful postmortem is a strategic tool for ongoing growth, resilience, and improvement. By carefully analyzing each aspect of an incident, from root causes to resolution and action items, teams can turn challenges into opportunities for enhancing systems and processes. Understanding the steps and process to create a detailed postmortem can help teams turn incidents into opportunities for learning, growth, and improvement.

Discover how PagerDuty supports teams in navigating incidents with confidence and building resilient operations. Start your free 14-day trial today.

Drops de produits mensuels