This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...by Dave Bresci
December 5, 2018
Recently, I was putting together training material for our upcoming track on “Owning Incident Response” at PagerDuty University, and I listened to the recordings of incident calls across many years of PagerDuty history. Several hours of hearing my coworkers at 2x speed prompted two observations: first, I should go find my copy of Christmas with the Chipmunks; and second, the evolution of our incident processes took time, effort, and focus. Any company, regardless of the size of their teams and infrastructure, can have a great incident response process, but it doesn’t happen by accident, and it doesn’t happen overnight.
Many years ago, PagerDuty internally used the simple-but-painful process of, “page everyone with a generic alarm, and have everyone join a phone bridge.” This resulted in plenty of chaos, even with seasoned Ops people. Tasks were done without coordination, there was often confusion of what the scope of the customer impact was, and so on.
One of the very first things we chose to improve was to iterate on the language used in the call when someone was providing information, or when a request was being made to someone else. Taking the time to have a shared vocabulary, with phrases like “Is there any strong objection?”, we began to decrease the time our incident responses took and lessen the time customers were impacted.
Next, another large improvement came when we started using Incident Command System-styled roles. Agreeing ahead of time on who would take care of the problem (subject matter experts), and who would handle the process of managing the incident itself (incident commanders and related).
This let us also scope down the initial response to only the engineers who needed to be on the call. Gone now are the days of confusion and people joining the bridge with “what’s wrong?”. Along the way, we’ve come up with our own workarounds for anti-patterns in incident response, such as removing disruptive and non-contributing folks from the call, even if they’re the CEO.
So much of “Operations” or “Site Reliability” information is spread via tribal knowledge, or oral storytelling. Getting to the point of having a well prepared, comprehensive, and humane incident response process shouldn’t have to be that hard. Companies shouldn’t have to figure each part of a great incident response on their own, but to improve overall, everyone does have to make it an area to focus on.