Keeping PagerDuty Always On With Remote Incident Response
Earlier this month, many areas of the internet experienced a major incident caused by a router misconfiguration within a highly used service provider. This led to cascading service failures, causing widespread outages and disruptions for several well-known SaaS organizations.
When the outage occurred, our teams at PagerDuty immediately noticed a global spike in events and incidents. While it’s not unusual to see an increase in alerts or incidents within a couple organizations, in this instance we saw a slew of customer events originating from multiple regions. This was cause for concern.
In scenarios where we see an unusual increase in incident volume, we proactively spin up a Major Incident Response as a precaution to ensure we have all hands on deck to combat the issue. To ensure our responders are notified in a timely manner, we use the PagerDuty mobile app to instantly contact the necessary stakeholders, wherever they may be.
Since this particular issue happened while we’re all working remotely, we took to Slack and Zoom to coordinate a response. Using PagerDuty’s Slack integration, we had a fully remote team of incident commanders, subject-matter experts, stakeholders, and scribes all orchestrating a collaborative, major incident response from San Francisco, Toronto, and Atlanta—all in less than three minutes.
Our incident commanders coordinated the response while customer support managed internal and external updates, subject matter experts discussed the necessary steps to be taken, and scribes documented response progress and communication.
Fortunately for us, we were able to quickly determine that our systems were able to handle the abrupt increase in incident traffic and spun down the call.
The Importance of Remote Incident Response
Major incidents such as this one in a fully remote working environment highlight the importance and criticality of being able to rapidly acknowledge, react, and respond as a team to an incident, regardless of location. At PagerDuty, a culture of distributed work and response has been ingrained into our processes since day one. In fact, if you take a look at our incident response documentation, you won’t find a single protocol that necessitates a responder’s physical proximity during a response. With the PagerDuty platform, you can truly respond to and work incidents instantaneously, wherever you are.
We also rely on collaboration tools like Slack and Zoom to communicate in real time during an incident. In this particular instance, PagerDuty’s Slack integration became our central hub for incident status and stakeholder updates. Within Slack, our team members were able to notify key stakeholders, assign roles, and truly work the incident in a centralized, virtual location.
Additionally, outside of this incident, Slack benefits our response process even when an incident is resolved as it helps us with our postmortem process. The scribe uses the Slack integration to document and log everything that happened during the response; e.g. “incident commander approved wording for external status updates.” This is useful because everyone can see everything that happened: who responded, who didn’t, why things got escalated the way they did, and so on. This gives us a full picture and understanding of an incident and allows us to improve on our processes to respond and resolve even faster when future incidents inevitably occur.
Our culture of distributed engineering is what allows us to ensure PagerDuty is always on for our customers, no matter what. By using PagerDuty as the single source of the truth alongside collaboration tools and well-defined practices, we’re able to effectively respond to incidents from virtually anywhere. In many cases, you’d think that going from in-office orchestration to a virtual response would be challenging, but with PagerDuty, it really is—for the most part—business as usual.