It’s already a couple of months into the new year, but a lot of us are still likely thinking about what we can improve on....by Tiffany Chang
March 5, 2019
In a recent blog post, Managing a Tier Zero Service Doesn’t Have to Be Scary, PagerDuty’s SVP of Product Development Tim Armandpour discussed several important best practices that minimize chaos during incident resolution. According to Tim, in today’s always-on world, guaranteeing reliability by adopting better incident response processes is a practice that’s more important than ever before. It’s critical for teams to be able to answer questions such as, “how do I notify the right people when my system’s down”, “how do I drive down resolution times”, “how do I surface the right data, and how can we collectively improve?”
He shared PagerDuty’s story of transformation, and how our engineering team began injecting failure into our own environment with Failure Fridays to improve system resiliency, get better at proactively detecting issues, and gain essential practice in efficiently acting on and resolving issues. He also outlined the two main goals of Failure Friday: 1) to understand common failure scenarios and establish best practices for when things go wrong, and 2) to foster collaboration by bringing disparate parts of our organization together to problem solve – especially in the line of fire – using a controlled, intentional approach.
The post highlights key learnings from having introduced Failure Fridays, including:
Check out the entire post to learn more tried-and-true insights for practicing and getting better at incident response, so your team is prepared when the next inevitable failure strikes.