Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
What do you do after you’ve experienced an incident and performed a post-mortem (or, postmortem)? That may seem like a simple question, or even a non-question; after all, it’s easy to think of the post-mortem as the last step in handling an incident.
But it’s not. In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.
Before we take a closer look at that question, however, we need to look at an even more basic question: What is the function of a post-mortem, and what should it contain?
An incident post-mortem serves the following basic functions:
To support these basic functions, a post-mortem should include a record of the incident, the response, and its resolution. It should also include an analysis of the root cause of the incident, a description of the scope of the incident and its effects, and any appropriate recommendations for resolving the root problem, improving the response process, and/or mitigating the impacts of future incidents.
It is important to note that a post-mortem should not become a vehicle for blame, or for settling scores in corporate or organizational politics. If necessary, set up a separate process (i.e., informal/moderated discussion within the department) for discussing personnel-related issues, as a way of channeling blame-setting away from the post-mortem itself.
The post-mortem should, however, include an honest discussion of any technical or organizational problems which may have contributed to the incident, or which became apparent during the response. The emphasis should be on improvements in the technology or the response process, rather than the deficiencies of individuals or teams, or of their work.
Not all incidents require a post-mortem. Minor operational issues, incidents with a well-understood cause and a simple resolution, and incidents which are easily contained with no downtime or loss of data may not need a post-mortem.
Here are a few examples of situations for which a post-mortem is necessary:
In order for a post-mortem to be of value, it needs to be read and understood by the people who are responsible for analyzing, resolving, and preventing any of the long-term problems which it describes.
This may mean, for example, that teams or departments with a stake in the problem or its resolution should be required to read the post-mortem and engage in a discussion as soon as possible to determine appropriate next steps as a result. The actual process for circulating post-mortems and ensuring that they are read and lead to action items will, of course, depend on the structure and the managerial philosophy of your organization.
There are three key areas to look at when writing or reading an incident post-mortem:
A post-mortem should always contain a description of the root cause, even if it is known and trivial. If it is non-trivial, the description should include an analysis of the cause, with, if possible, a precise identification of the actual root of the problem and whether the root cause needs to be fixed. If the specific root cause cannot be precisely identified, any information which may lead to its future identification should be included.
If, for example, during the course of the incident’s resolution, it becomes apparent that the problem originated in a module which contains a large amount of legacy code, it is important to include that fact in the root cause analysis, even if it is not possible at the time of the post-mortem to identify the root cause below the level of the module itself. The mere fact of identifying legacy code in connection with an incident may be of value not only in the resolution of the incident but also in later surveys identifying code which needs to be replaced.
The post-mortem should include a full technical description of the response process. It should also include a description and analysis of the relative success or failure of that process. This should be done without pointing the finger of blame at anyone, but it should clearly indicate any apparent failures or weaknesses in the response process, or in the way that the response was carried out. This can include division of responsibilities among response team members, communication within the response team, or between the response team and other stakeholders across the business, and problems with specific response procedures.
Failures of the response process can range from technical or organizational. They can include such simple things as failing to tell affected departments or users that a system or application was unavailable while the problem was being resolved. If two team members performed the same task without coordination between them, or nobody performed a required task, leading to a delay in the resolution, it should be noted in the post-mortem as an indication of potential problems in team organization or communication.
The post-mortem should include a clear and accurate description of the extent of any damage caused by the incident, including loss of data, loss of productivity, and interruptions in user access. It is equally important to include a description and analysis of any actions taken to limit or remedy this damage. Damage control should be considered as a separate process from technical incident resolution. Depending on the type of incident, the type of damage, and the organization’s structure, it may be a customer service responsibility or require action items for other departments in the business.
Damage control actions should be part of the post-mortem, since they may directly or indirectly affect how similar incidents are handled in the future. If, for example, an outage results in the shutdown of an airline flight reservation system, it may be necessary to give priority to putting into place an alternate system for handling reservations during downtime.
The key to getting the most out of post-mortems lies in understanding that they are roadmaps for improvement of your application, your infrastructure, and your response process. Each post-mortem has the potential to improve the way that your system operates and the way that you handle incidents. Rather than treating post-mortems as an embarrassment or indication of some kind of failure, you should this valuable opportunity to reflect as gold.
PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more!
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019