Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
This month is a big month for PagerDuty—we turned 10 on February 18! I never imagined we’d reach this milestone, honestly. A lot of Dutonians...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Your team had been fighting this major incident for hours, but your investigation was hitting one dead end after another. Finally, you managed to isolate the problem and your graphs started to improve. When all systems went back to normal, everyone let out a collective sigh of relief, shut down the response call, and went back to bed, never to think of this incident again.
Or so you thought.
There’s actually one more thing your team needed to do before moving on: perform a postmortem. Why? Postmortems are important because they help establish a culture of continuous improvement.
Without a postmortem, you and your team miss out on the opportunity to learn what you’re doing right, where you could improve, and most importantly, how to avoid making the same mistakes again and again. A well-designed, blameless postmortem will help your team improve their infrastructure and incident response process.
We’re excited to announce we’ve launched a comprehensive guide on how to conduct effective postmortems. No other resource (that we’ve found) covers the nuances of culture change, the details of how to perform an in-depth analysis, and the unique skills required to facilitate a calm and engaging conversation about failure. We explain why these concepts are important, describe the challenges associated with implementing them, and offer actionable instruction to conduct blameless postmortems.
If you are not yet doing postmortems, this guide will equip you with the knowledge and strategies needed to introduce a new process to your organization. For those of you with some experience doing postmortems, you will learn how to combat the natural tendency to blame, new lines of inquiry for deeper incident analysis, how to better utilize the postmortem meeting, and more ways to improve your existing process.
During incident response, the team is 100 percent focused on restoring service. They cannot, and should not, be wasting time and mental energy on thinking about how to do something optimally or performing a deep dive on what caused the incident. That’s why postmortems are essential—they provide a peacetime opportunity to reflect once the issue is no longer impacting users. The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be lost.
The incident postmortem goes by many names. You may know it as:
At its core, the postmortem is a document that describes in detail the situational factors that led to the incident, steps taken to respond to the incident, and planned work expected to prevent it from happening again. The postmortem process also includes a meeting to discuss the outcome of the analysis and sharing those learnings with the broader organization and your customers.
After resolving a major incident, you and your team should start thinking about the postmortem while the incident is still fresh in your minds. At PagerDuty, we complete postmortems within five days of every major incident. Just as resolving the incident becomes top priority when it occurs, completing the postmortem is prioritized over planned work. Postponing the postmortem delays key learnings that can prevent the incident from recurring.
As IT professionals, we understand that failure happens in complex systems—it’s unavoidable. And how we respond to failure when it occurs matters. An impulse to blame and punish individuals for causing incidents has the unintended effect of disincentivizing the knowledge-sharing required to prevent future incidents. Engineers will hesitate to speak up when incidents occur for fear of being blamed. This silence exacerbates the impact of incidents by increasing overall mean time to acknowledge and mean time to resolve.
For the postmortem process to result in system improvements and learning, we must treat human error as a symptom of a systemic problem, not the cause itself. In complex systems of software development, a variety of conditions interact to lead to failure. The goal of the postmortem is to understand what systemic factors led to the incident and identify actions that can prevent this kind of failure from recurring.
A blameless postmortem stays focused on how a mistake was made instead of who made the mistake. This is a crucial tool leveraged by many leading organizations, such as Etsy (a pioneer for blameless postmortems), for ensuring postmortems have the right tone, empowering engineers to give truly objective accounts of what happened by eliminating the fear of punishment.
It’s easy to agree that we want a culture of continuous improvement, but it’s difficult to practice the blamelessness required for learning. The inherently surprising nature of failure naturally leads humans to react in ways that interfere with our understanding of it. When processing information, the human mind unconsciously takes shortcuts to optimize for timeliness over accuracy, sometimes resulting in incorrect conclusions. In our guide, we detail many cognitive biases that interfere with postmortem analysis and strategies to overcome them.
The next time you encounter a major incident, remember your response is not done until the postmortem is done. Though major incident response is sometimes painful, it’s also an incredible opportunity to learn and make lasting improvements to your systems and processes.
Take a look at our new guide to read more about the steps involved in the postmortem process. We’d also love to hear your techniques for practicing blameless postmortems in our Community forums!
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
Your team had been fighting this major incident for hours, but your investigation was hitting one dead end after another. Finally, you managed to isolate...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019