PagerDuty
/
Blog
/
Alerting
/
Quick Tips: How to Post Mortem Every Incident

Blog

Quick Tips: How to Post Mortem Every Incident

by David Hayes December 17, 2015 | 3 min read

The Case for Post Morteming Every Incident

A post mortem is a process for investigating an incident to figure out what went wrong and what can be learned from it. We’ve written before about why you should not just post mortem major incidents, but publish them as well. But you shouldn’t be doing post mortems just for major incidents. As a general rule, we recommend that you follow up on every incident, especially if it woke someone up. Every incident is an opportunity to learn as a team and improve your product. But there’s no reason it needs to always be a heavyweight process.

Tips for Making it Easy

Here are some tips for making it quick and easy:

Establish a threshold for what gets a full team post mortem. At PagerDuty, the team looks at all Sev1s, Sev2s and anywhere a process broke down. Everything else is checked by a single person.
Batch up minor incidents and look at a week’s worth (tip: PagerDuty’s analytics functionality is great for this). Probably the best time is the end of shift handoff.
The goal is to prioritize your various resolution efforts, not to assign blame.
The outcomes can be simple, like the following examples:
- Adjusting the alerting threshold on that particular monitoring tool. (In my experience, this one is underapplied.)
- Adding a new filter in PagerDuty via email filters, support hours or use our new Event Enrichment Platform beta.
- Counting repeat, low-urgency incidents. Most problems aren’t blockers, but you should still track how often they happen so they can be prioritized and addressed when you have the bandwidth.
- Tweaking the routing of a particular notification.
- Automatically scheduling a maintenance window, if all else fails (I personally don’t recommend this solution, but it’s a popular use of our API.)
- Updating the runbook (and linking it in the service description so responders see it).
Track some rough estimates of how disruptive a particular shift incident can be for your team. Has it been getting better or worse over the last few shifts? Do your incidents follow a power law (one large incident, many small ones) or are you always putting out medium-sized fires?
Include all available raw supporting material (logs, chat transcripts, etc.) into your Reason for Outage (RFO) document as appendices.

Post Mortems Make Your Product Better

If the thought of doing a post mortem for every incident is exhausting, than it’s even more important to do. And with these tips, it’s an easy way to make your team more efficient at addressing outages big and small. It will also allow your team to build a library of documentation, which will help you with onboarding, training, and understanding how to build a better product in general.

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

Blog

Quick Tips: How to Post Mortem Every Incident

The Case for Post Morteming Every Incident

Tips for Making it Easy

Post Mortems Make Your Product Better

You may also love these...

Intelligent Alert Grouping Series Summary

Intelligent Service Design

Building Titles for Intelligent Alert Grouping