The PagerDuty Incident Response Process is a detailed document that provides a framework for how to structure your incident response process. But sometimes it helps...by George Miranda
June 20, 2019
The PagerDuty Incident Response Process is a detailed document that provides a framework for how to structure your incident response process. But sometimes it helps to understand how these seemingly abstract concepts play out during real-world scenarios. You can now hear an incident call recording that’s based on a real PagerDuty incident.
Due to the nature of incident response practices, the process guide we publish is filled with very explicit details regarding a variety of situations. That level of detail is wonderful when you find yourself picking apart the nuances of complicated scenarios. But for someone who has never even so much as participated in a technical outage, the level of depth provided may seem a bit overwhelming without some basic context to center around. What even is an incident call? What does it sound like and how do people interact?
Regardless of your level of experience with real-time incident response, having shared context to center around can be helpful for developing your own response processes within your teams. To help with that, we’ve decided to share an example of what our incident response calls are like at PagerDuty.
The recorded call is a reenactment of an actual major incident that occurred at PagerDuty in January 2017. Some names and identifying details have been changed in the interest of privacy, but the incident remains otherwise largely unredacted.
In the interest of brevity, some details have been altered or omitted from the reenactment. Although this incident went on for approximately 79 minutes, the call audio has been compressed to just over 26 minutes. When watching the video, you should take note of timestamps for transcribed commentary. The timestamps will give you an idea of how much time elapsed between developments. It’s not unusual for there to be silence during an incident call while responders work to resolve an issue.
For those newer to the incident response process, a few slides explaining what various responder roles do throughout the call have been added for additional context. This recording is meant to supplement, not replace, the Incident Response guide. Before practicing the skills demonstrated in this video or changing anything about your own existing processes, be sure to review the Incident Response guide in its entirety as it provides a critical and additional level of detail that’s not contained in this recording.
This incident was chosen for reenactment because of its complexity and the many different stages of an incident that were demonstrated. It required cross-functional collaboration from a number of different teams, had a problem that was difficult to diagnose, and contained common examples of necessary actions like paging responders who weren’t on call. Slight alterations were made to highlight some of these actions as they occurred.
PagerDuty’s Incident Response Training provides in-depth coverage about what the role of an Incident Commander (IC) entails, as well as a lot of guidance around how to manage an incident. Listen to how the IC creates space for responders to resolve the incident: the IC keeps the incident moving along, gains consensus before taking action, and adjusts course based on feedback.
The role of the Scribe is most clearly illustrated by the accompanying text in the video. A Scribe is not a stenographer. The role isn’t responsible for transcribing every single thing said during the call; rather, the Scribe’s job is to note important events that might be useful in the context of a postmortem. Watch how the Scribe captures relevant details that will be useful later.
The Deputy’s role is to help the Incident Commander stay focused on the incident by taking on any tasks that might create a distraction. In this incident, our experienced IC delegated tasks to the Deputy and also kept track of time for timeboxed tasks. However, it would not be unusual for a Deputy to offer to take some tasks off the IC’s plate or to act as a timekeeper.
The Communications Liaison provides both external and internal stakeholder updates. In the interest of brevity, the recorded incident focuses on how external customer communications are generated. In practice at PagerDuty, the Communications Liaison generates internal stakeholder notifications from within our product automatically. If your own incident response system doesn’t allow for that, the Customer Liaison would manage the process similarly to how external notifications are generated.
The incident that is the basis of this reenacted recording occurred on January 6, 2017. The impact resulted in zero notifications being delivered outside of our service level agreement (SLA). Customers were affected in three ways:
The incident postmortem is available on the PagerDuty Status Page. You will notice in the postmortem that the entire incident lasted about 80 minutes. If you examine timestamps in the recording video, you’ll see that the elapsed call time is only about 50 minutes. This is because the incident was detected and managed as a minor incident for approximately 30 minutes before it was escalated to a major incident, thereby requiring a larger coordinated response.
Most incidents simply don’t present an opportunity to demonstrate every single facet of the incident response system. Incidents are unpredictable and the response process is meant to equip you with the real-time tools you will need to help resolve an incident effectively. Rather than staging a work of fiction, we decided it was best to share an actual incident with as much transparency as possible.
This incident recording is not meant as a definitive guide and it only covers some of the considerations you might face when dealing with a real incident; however, when used in tandem with our Incident Response Guide, it demonstrates how those possibly abstract principles play out in real-world scenarios. Use this recording in tandem with the guide for best results. Refer to the guide for additional details and refer to the recording to hear how the principles in the guide are applied.
As always, if you have questions about any of this and would like to discuss further, please reach out to us on the PagerDuty Community Forum. We’d love to hear from you!