The Case for Post Morteming Every Incident
A post mortem is a process for investigating an incident to figure out what went wrong and what can be learned from it. We’ve written before about why you should not just post mortem major incidents, but publish them as well. But you shouldn’t be doing post mortems just for major incidents. As a general rule, we recommend that you follow up on every incident, especially if it woke someone up. Every incident is an opportunity to learn as a team and improve your product. But there’s no reason it needs to always be a heavyweight process.
Tips for Making it Easy
Here are some tips for making it quick and easy:
- Establish a threshold for what gets a full team post mortem. At PagerDuty, the team looks at all Sev1s, Sev2s and anywhere a process broke down. Everything else is checked by a single person.
- Batch up minor incidents and look at a week’s worth (tip: PagerDuty’s analytics functionality is great for this). Probably the best time is the end of shift handoff.
- The goal is to prioritize your various resolution efforts, not to assign blame.
- The outcomes can be simple, like the following examples:
- Adjusting the alerting threshold on that particular monitoring tool. (In my experience, this one is underapplied.)
- Adding a new filter in PagerDuty via email filters, support hours or use our new Event Enrichment Platform beta.
- Counting repeat, low-urgency incidents. Most problems aren’t blockers, but you should still track how often they happen so they can be prioritized and addressed when you have the bandwidth.
- Tweaking the routing of a particular notification.
- Automatically scheduling a maintenance window, if all else fails (I personally don’t recommend this solution, but it’s a popular use of our API.)
- Updating the runbook (and linking it in the service description so responders see it).
- Track some rough estimates of how disruptive a particular shift incident can be for your team. Has it been getting better or worse over the last few shifts? Do your incidents follow a power law (one large incident, many small ones) or are you always putting out medium-sized fires?
- Include all available raw supporting material (logs, chat transcripts, etc.) into your Reason for Outage (RFO) document as appendices.
Post Mortems Make Your Product Better
If the thought of doing a post mortem for every incident is exhausting, than it’s even more important to do. And with these tips, it’s an easy way to make your team more efficient at addressing outages big and small. It will also allow your team to build a library of documentation, which will help you with onboarding, training, and understanding how to build a better product in general.