Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty is thrilled to be named a leader in G2Crowd’s Fall 2018 Grid Report for Incident Management. The ranking is based on high customer satisfaction...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results of broader design issues in our systems. The good news is that we can design systems to reduce the risk of human errors, but in order to do that, we need to look at the many factors that contribute to failure – both systemic and human. Blameless post mortems, where the goal isn’t to figure out who made a mistake but how the mistake was made, are a tool that can help. While running one is not an easy task, the effort is well worth it. Here, two engineering managers describe some of the challenges and share how they make blameless postmortems successful.
Start with the right mindset
The attitude you take to the discussion is critical and sets the tone for the entire conversation. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.”
Mike Panchenko, CTO at Opsmatic, says that the approach is based on the assumption that no one wants to make a mistake. “Everyone has to assume that everyone else comes to work to do a good job,” he says. “If someone’s done something bad, it’s not about their character or commitment, it’s just that computers are hard and often you just break stuff.”
Don’t fear failure
Because it’s going to happen. “One thing I always tell my team is that if they’re not screwing up now and then, they’re probably not moving fast enough,” says Chakrabarti. “What’s important is, you learn from your mistakes as quickly as possible, fix it quickly, and keep moving forward.”
Nip blaming in the bud
There are no shortcuts here. “You have to be very open about saying, ‘Hey, I will not tolerate person A blaming person B,” says Chakrabarti. “You have to call it out immediately, which is uncomfortable. But you have to do it, or else it gives whoever’s doing it a free pass.”
Panchenko agrees: “I’m a pretty direct guy, so when I see that going on, I immediately say ‘stop doing that.'”
That goes for inviting blame, too
“There’s a natural tendency of people to take blame,” says Panchenko. “But a lot of times, there’s the ‘last straw’ that breaks the system.” He describes a recent outage where a bunch of nodes were restarted due to a bug in an automation library. That bug was triggered by the re-appearance of a long-deprecated Chef recipe in the run list. The recipe, in turn, was added back to the runlist due to a misunderstanding about the purpose of a role file left around after a different migration/deprecation. The whole thing took over a month to develop. “Whoever was the next person to run that command was going to land on that mine,” he says, “and usually the person who makes the fatal keystroke expects to be blamed. Getting people to relax and accept the fact that the purpose of the post mortem isn’t to figure out who’s going to get fired for the outage is the biggest challenge for me.”
Handle ongoing performance issues later
It’s natural to be apprehensive about sharing things that didn’t go well when your job performance or credibility may be on the line. The trick is separating ongoing performance issues from “failures” that happen because of shortcomings in your processes or designs.
Panchenko pays attention to the kind of mistake that was made. “Once you see a failure of a certain kind, you should be adding monitoring or safeguards,” he says. “If you’re doing that, the main way someone’s going to be a bad apple is if they’re not following the process. So that’s what I look for: do we have a process in place to avoid the errors, and are the errors happening because the process is being circumvented, or does the process need to be improved?”
And sometimes, yes, you do need to fire people. “I have had scenarios where a single individual keeps making the same mistake, and you have to coach them and give them the opportunity to fix it,” says Chakrabarti. “But after enough time, you have to take that level of action.”
Get executive buy-in
Both Arup and Mike agree that successful blameless postmortems won’t work without backing from upper-level management. “You have to get top-down support for it,” says Chakrabarti, “and the reason I say that is that blameless postmortems require more work. It’s very easy to walk into a room and say ‘Dave did it, let’s just fire him and we’ve fixed the problem.'” Instead, though, you’re telling the executives that not only did someone on your team cause an expensive outage, but they’re going to be involved in fixing it too. “Almost any executive is going to be very concerned about that,” he says.
“The one thing that’s definitely true is that the tone has to be set at the top,” says Panchenko. “And the tone has to be larger than just postmortems.”
Have you led or participated in blameless post mortems? We’d love to hear more about your experiences – leave us comments below!
I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and...
Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018