Postmortems and More With J. Paul Reed
PagerDuty sat down with J. Paul Reed, a Senior Applied Resilience Engineer at Netflix, for an Ask Me Anything (AMA) to discuss best practices around postmortems.
Reed is a prominent speaker and advocate of DevOps and operations complexity, and has over 15 years of experience in release engineering. His background in tech, along with his previous work at companies like Mozilla and VMware, give him a unique perspective into the inner workings of innovative organizations.
With questions and prompts submitted by the PagerDuty community, Reed covered topics ranging from blameless postmortems and the impact they have on SLAs, to the importance of follow through and the advantages of maintenance over replacement. In this blog post, we will take a closer look at Reed’s answers surrounding postmortem best practices and the steps you can take to conduct one successfully.
Keeping a Postmortem Blameless
A blameless postmortem focuses on how an incident was triggered instead of who caused it. A truly blameless postmortem enables team members to be honest when a situation occurs without the fear of being punished. When employees are given a safe space to honestly discuss an incident, they will also feel more comfortable brainstorming ways to resolve and prevent a similar incident from happening in the future.
But it’s not enough to just be blameless—it’s also important to be blame-aware. Being blame-aware means that we are aware of our biases and how they may impact our ability to view an incident impartially.
According to Reed, many people tend to think linearly, where one thing must be the direct cause of something else, and that is a direct cause of something else, and so on. That mode of thinking can be detrimental because, when it comes to complex programs and integrations, that isn’t always the case.
Biases tend to enforce this linear thinking without considering the circumstances. But luckily, managers and higher-ups are always there to help, right? Well, yes, but they have unconscious biases of their own and, therefore, may default to linear thinking without knowing it.
Managers have the tricky task of remaining blame-aware as well as ensuring the postmortem environment remains blameless. They also need to correct an employee if they are acting on their biases and turn that incident into a teachable moment. This can be difficult, and the best way to handle moments like these—especially when working towards fostering a blameless environment—is to engender trust within a team so that they feel comfortable discussing incidents and personal mistakes.
A large amount of trust among teams fosters a sense of comfort and honesty that provides everyone a safe environment where they can fail and learn from those failures. Over time, it will encourage efficiency in workflows and reduce stress across projects throughout an organization.
Key Takeaway: In order to ensure a blame-aware and blameless environment, have the entire team work on building trust and practice being aware of biases, which will help foster a blameless culture.
Improvement and Teamwork
One of the main purposes of conducting a postmortem is continuous improvement of, and creating efficiencies within, existing processes. This is especially important today, where many larger organizations run on a HybridOps model and want to both use and improve what they already have, in addition to implementing revamped run-state features.
Many teams want to take a “rip and replace” approach to systems because it seems easier. But while maintaining an existing system and implementing a new one at the same time may seem like a lot of busy work, it also has the added benefit of improving and enhancing a feature. Reed says that by continuing to operate within a certain system, employees build “tribal knowledge” surrounding it. Thus, when they run into incidents on call, they are better equipped to manage them, which results in faster solutions.
Key Takeaway: Postmortems are built around improvement and teamwork. If a company is constantly replacing their systems instead of maintaining them, it becomes difficult to build a common knowledge base surrounding the current run-state features.
Following Through on Follow-Ups
Follow-up tasks should be assigned during a postmortem to ensure that improvements are made after a postmortem takes place.
To achieve this, Reed recommends that, at the end of a postmortem, each person writes on a sticky note the top three follow-up tasks that they think are the most important. Once completed, the notes are compiled and the team votes to rank them based on importance and what’s likely to get done.
The team then takes the top five and focuses on completing only those follow-up actions. Once 6 weeks have passed, the team meets again to review what tasks were completed and when.
Key Takeaway: Completing all follow-up items after a postmortem can feel good and leave everyone with a sense of accomplishment, but it is not always possible or realistic to tackle all the tasks on everyone’s wishlist. Instead, it’s better to set small, realistic goals for the team and narrow down the postmortem follow-up actions.
Benefits of a Timely Postmortem
According to Reed, conducting a postmortem more than 72 hours after an incident makes the postmortem null and void. Cognitive biases tend to take hold after that amount of time, making it difficult to run a blameless postmortem, which leads to mediocre data. Hindsight and recency bias make it especially difficult to conduct a successful postmortem after long stretches of time, as these biases cause you to forget what you were thinking at the moment the incident happened. Additionally, memory tends to fade with time, so too long of a time frame between an incident and postmortem can lead to results based on unreliable narration.
Key Takeaway: Conduct a postmortem as soon as possible; ideally within 72 hours of an incident.
Blameless postmortems foster a culture of knowledge, understanding, and productivity. As Reed points out, postmortems are more than just meetings to discuss what went wrong, they are indicative of the environment in which a business operates.