When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems.
Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results of broader design issues in our systems. The good news is that we can design systems to reduce the risk of human errors, but in order to do that, we need to look at the many factors that contribute to failure – both systemic and human. Blameless post mortems, where the goal isn’t to figure out who made a mistake but how the mistake was made, are a tool that can help. While running one is not an easy task, the effort is well worth it. Here, two engineering managers describe some of the challenges and share how they make blameless postmortems successful.
Start with the right mindset
The attitude you take to the discussion is critical and sets the tone for the entire conversation. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.”
Mike Panchenko, CTO at Opsmatic, says that the approach is based on the assumption that no one wants to make a mistake. “Everyone has to assume that everyone else comes to work to do a good job,” he says. “If someone’s done something bad, it’s not about their character or commitment, it’s just that computers are hard and often you just break stuff.”
Don’t fear failure
Because it’s going to happen. “One thing I always tell my team is that if they’re not screwing up now and then, they’re probably not moving fast enough,” says Chakrabarti. “What’s important is, you learn from your mistakes as quickly as possible, fix it quickly, and keep moving forward.”
Nip blaming in the bud
There are no shortcuts here. “You have to be very open about saying, ‘Hey, I will not tolerate person A blaming person B,” says Chakrabarti. “You have to call it out immediately, which is uncomfortable. But you have to do it, or else it gives whoever’s doing it a free pass.”
Panchenko agrees: “I’m a pretty direct guy, so when I see that going on, I immediately say ‘stop doing that.'”
That goes for inviting blame, too
“There’s a natural tendency of people to take blame,” says Panchenko. “But a lot of times, there’s the ‘last straw’ that breaks the system.” He describes a recent outage where a bunch of nodes were restarted due to a bug in an automation library. That bug was triggered by the re-appearance of a long-deprecated Chef recipe in the run list. The recipe, in turn, was added back to the runlist due to a misunderstanding about the purpose of a role file left around after a different migration/deprecation. The whole thing took over a month to develop. “Whoever was the next person to run that command was going to land on that mine,” he says, “and usually the person who makes the fatal keystroke expects to be blamed. Getting people to relax and accept the fact that the purpose of the post mortem isn’t to figure out who’s going to get fired for the outage is the biggest challenge for me.”
Handle ongoing performance issues later
It’s natural to be apprehensive about sharing things that didn’t go well when your job performance or credibility may be on the line. The trick is separating ongoing performance issues from “failures” that happen because of shortcomings in your processes or designs.
Panchenko pays attention to the kind of mistake that was made. “Once you see a failure of a certain kind, you should be adding monitoring or safeguards,” he says. “If you’re doing that, the main way someone’s going to be a bad apple is if they’re not following the process. So that’s what I look for: do we have a process in place to avoid the errors, and are the errors happening because the process is being circumvented, or does the process need to be improved?”
And sometimes, yes, you do need to fire people. “I have had scenarios where a single individual keeps making the same mistake, and you have to coach them and give them the opportunity to fix it,” says Chakrabarti. “But after enough time, you have to take that level of action.”
Get executive buy-in
Both Arup and Mike agree that successful blameless postmortems won’t work without backing from upper-level management. “You have to get top-down support for it,” says Chakrabarti, “and the reason I say that is that blameless postmortems require more work. It’s very easy to walk into a room and say ‘Dave did it, let’s just fire him and we’ve fixed the problem.'” Instead, though, you’re telling the executives that not only did someone on your team cause an expensive outage, but they’re going to be involved in fixing it too. “Almost any executive is going to be very concerned about that,” he says.
“The one thing that’s definitely true is that the tone has to be set at the top,” says Panchenko. “And the tone has to be larger than just postmortems.”
Have you led or participated in blameless post mortems? We’d love to hear more about your experiences – leave us comments below!