How We Use PagerDuty for Emergency Response

by Ryan Hoskin March 17, 2020 | 4 min read

PagerDuty is known as the platform for driving real-time work, and with the current global spread of COVID-19, many of our customers have been asking how we leverage PagerDuty internally to intelligently coordinate a response to emergency situations (such as this) as they arise.

PagerDuty customers primarily leverage our platform for coordinating an incident response process when technical issues happen, such as a bad deployment, network degradation or failed hardware. Many of our customers have also realized that this incident response process can be applied to other business-impacting incidents, and have begun to use PagerDuty for other use cases, such as working with high-profile customer support tickets, security issues, or even emergency situations like what we’re encountering with the COVID-19 outbreak.

As part of our crisis communication plan, we identified several scenarios that would benefit from leveraging PagerDuty to expedite our ability to respond to major incidents at PagerDuty:

  • Notify our Crisis Communications team that there is potentially an urgent issue that requires their attention.
  • Update our employees on the status of a major incident.

Scenario 1: Notifying Our Crisis Communications Team of an Urgent Issue

It’s critical that our Crisis Communications team is made aware of business-impacting incidents so that we are able to minimize internal disruption and to make sure that we can keep external stakeholders informed. Our Crisis Communications team is a group comprised of folks from our People Operations, Executive, Legal, Marketing, and Facilities teams.

Within PagerDuty, the configuration for notifying the Crisis Communications group is relatively straightforward.

  • We have a service and escalation policy that’s dedicated to this group. It has an email address associated with it so that folks can easily trigger an incident via email (as well as through our mobile or web applications).

  • All members of the Crisis Communications team are set up on the first level of an escalation policy, and all have multiple notification rules that are set up to notify them immediately should an incident get triggered. Note: It’s important that all users have multiple notification rules that will notify them immediately for redundancy purposes, as well as to ensure that they get notified of the incident even when one of their peers acknowledge the incident before they receive their notification.
  • The service is also configured with a conference bridge, which helps facilitate getting the team together to resolve issues in real time via tools like Zoom.
  • Our Slack integration is used to keep stakeholders up-to-date in a private Slack channel.

Scenario 2: Keeping Our Employees Up-to-Date on a Major Incident

For major events like the COVID-19 outbreak, it’s important to ensure that we are communicating with all of our employees as the rapidly evolving situation changes. Given that PagerDuty is a global company, we designed a configuration in which we can communicate with each region as needed. Below are some details on how it’s set up.

  • We set up two services for each region: one for connecting with leadership and one for communicating with all employees. We also have services set up to communicate with our executive and senior leadership teams.
  • Each service has an email integration so that incidents can be triggered via email, or through our web application or mobile app.
  • We have three levels of escalation for each region. These folks are expected to facilitate and coordinate a response to each incident, similar to the Incident Commander’s role for a technical incident.
  • All employees in each region are set up on a team.
  • Each service is also configured to automatically run a response play on ticket creation. The response play is configured to add the regional team as stakeholder users, and they will immediately get notified when there is an update.

  • As the situation progresses, the incident owners will send out status updates, which will notify all employees (subscribers).
  • Once the incident is resolved, the incident owner will resolve the PagerDuty incident.

Through these two workflows, we can ensure that we can swiftly and efficiently get the right team on the issue, and can keep all affected parties up to date. Should you have any questions or need any help configuring your PagerDuty account to enable your team to respond to critical issues, please contact us at