Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
What do you do after you’ve experienced an incident and performed a post-mortem (or, postmortem)? That may seem like a simple question, or even a non-question; after all, it’s easy to think of the post-mortem as the last step in handling an incident.
But it’s not. In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.
Before we take a closer look at that question, however, we need to look at an even more basic question: What is the function of a post-mortem, and what should it contain?
An incident post-mortem serves the following basic functions:
To support these basic functions, a post-mortem should include a record of the incident, the response, and its resolution. It should also include an analysis of the root cause of the incident, a description of the scope of the incident and its effects, and any appropriate recommendations for resolving the root problem, improving the response process, and/or mitigating the impacts of future incidents.
It is important to note that a post-mortem should not become a vehicle for blame, or for settling scores in corporate or organizational politics. If necessary, set up a separate process (i.e., informal/moderated discussion within the department) for discussing personnel-related issues, as a way of channeling blame-setting away from the post-mortem itself.
The post-mortem should, however, include an honest discussion of any technical or organizational problems which may have contributed to the incident, or which became apparent during the response. The emphasis should be on improvements in the technology or the response process, rather than the deficiencies of individuals or teams, or of their work.
Not all incidents require a post-mortem. Minor operational issues, incidents with a well-understood cause and a simple resolution, and incidents which are easily contained with no downtime or loss of data may not need a post-mortem.
Here are a few examples of situations for which a post-mortem is necessary:
In order for a post-mortem to be of value, it needs to be read and understood by the people who are responsible for analyzing, resolving, and preventing any of the long-term problems which it describes.
This may mean, for example, that teams or departments with a stake in the problem or its resolution should be required to read the post-mortem and engage in a discussion as soon as possible to determine appropriate next steps as a result. The actual process for circulating post-mortems and ensuring that they are read and lead to action items will, of course, depend on the structure and the managerial philosophy of your organization.
There are three key areas to look at when writing or reading an incident post-mortem:
A post-mortem should always contain a description of the root cause, even if it is known and trivial. If it is non-trivial, the description should include an analysis of the cause, with, if possible, a precise identification of the actual root of the problem and whether the root cause needs to be fixed. If the specific root cause cannot be precisely identified, any information which may lead to its future identification should be included.
If, for example, during the course of the incident’s resolution, it becomes apparent that the problem originated in a module which contains a large amount of legacy code, it is important to include that fact in the root cause analysis, even if it is not possible at the time of the post-mortem to identify the root cause below the level of the module itself. The mere fact of identifying legacy code in connection with an incident may be of value not only in the resolution of the incident but also in later surveys identifying code which needs to be replaced.
The post-mortem should include a full technical description of the response process. It should also include a description and analysis of the relative success or failure of that process. This should be done without pointing the finger of blame at anyone, but it should clearly indicate any apparent failures or weaknesses in the response process, or in the way that the response was carried out. This can include division of responsibilities among response team members, communication within the response team, or between the response team and other stakeholders across the business, and problems with specific response procedures.
Failures of the response process can range from technical or organizational. They can include such simple things as failing to tell affected departments or users that a system or application was unavailable while the problem was being resolved. If two team members performed the same task without coordination between them, or nobody performed a required task, leading to a delay in the resolution, it should be noted in the post-mortem as an indication of potential problems in team organization or communication.
The post-mortem should include a clear and accurate description of the extent of any damage caused by the incident, including loss of data, loss of productivity, and interruptions in user access. It is equally important to include a description and analysis of any actions taken to limit or remedy this damage. Damage control should be considered as a separate process from technical incident resolution. Depending on the type of incident, the type of damage, and the organization’s structure, it may be a customer service responsibility or require action items for other departments in the business.
Damage control actions should be part of the post-mortem, since they may directly or indirectly affect how similar incidents are handled in the future. If, for example, an outage results in the shutdown of an airline flight reservation system, it may be necessary to give priority to putting into place an alternate system for handling reservations during downtime.
The key to getting the most out of post-mortems lies in understanding that they are roadmaps for improvement of your application, your infrastructure, and your response process. Each post-mortem has the potential to improve the way that your system operates and the way that you handle incidents. Rather than treating post-mortems as an embarrassment or indication of some kind of failure, you should this valuable opportunity to reflect as gold.
PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more!