In a recent blog post, Managing a Tier Zero Service Doesn’t Have to Be Scary, PagerDuty’s SVP of Product Development Tim Armandpour discussed several important best practices that minimize chaos during incident resolution. According to Tim, in today’s always-on world, guaranteeing reliability by adopting better incident response processes is a practice that’s more important than ever before. It’s critical for teams to be able to answer questions such as, “how do I notify the right people when my system’s down”, “how do I drive down resolution times”, “how do I surface the right data, and how can we collectively improve?”
He shared PagerDuty’s story of transformation, and how our engineering team began injecting failure into our own environment with Failure Fridays to improve system resiliency, get better at proactively detecting issues, and gain essential practice in efficiently acting on and resolving issues. He also outlined the two main goals of Failure Friday: 1) to understand common failure scenarios and establish best practices for when things go wrong, and 2) to foster collaboration by bringing disparate parts of our organization together to problem solve – especially in the line of fire – using a controlled, intentional approach.
The post highlights key learnings from having introduced Failure Fridays, including:
- The team constantly slices and dices different failure scenarios, testing and trying out different things to try to expose potential vulnerabilities. Teams who are responsible for managing services that get attacked don’t know ahead of time (just as in real life). Everyone must be prepared to get into coordinated response mode at any given moment.
- The team conducts failure scenario testing not in a testing or pre-prod environment, but in the live production environment. While the failure testing is always constructed in a way that ensures customers aren’t affected, being intentional is key to truly becoming experts in real-life incident response. According to Tim, because reliability is such an important promise to our customers, “we practice as if our jobs depended on it.”
- When you do actually identify a vulnerability during live failure testing, it’s important to not let that become panic-inducing. Rather, “gotcha’s” are an important opportunity to get used to being unfazed in the midst of fires, as well as to actually implement a fix and further improve the resilience of your infrastructure.
- At the end of the incident response, it’s essential to conduct a post-mortem so the team can learn and improve together. Post-mortems must be blameless and focused on actionable next steps for improvement.
Check out the entire post to learn more tried-and-true insights for practicing and getting better at incident response, so your team is prepared when the next inevitable failure strikes.
Read the entire post »