How PagerDuty and Partner Rundeck Enable Business Continuity for Digital Operations

by Scott McAllister April 23, 2020 | 5 min read

At times like these when the world has been forced to adapt and go almost entirely digital, it’s imperative that our systems and platforms stay up and operational—all the times. We are going to great lengths to make sure that the hardware and software in our application stacks are reliable and responsive. Hardware is set up to have redundant backups and new code is tested and reviewed to make sure it doesn’t introduce any bugs into the system. These preparations minimize the impact of when we lose a portion of our digital infrastructure.

But what happens when we lose contact with our people?

Short answer: The loss (temporary or permanent) of people means also losing all the tribal knowledge that really runs an enterprise.

The long answer: In digital operations, even with all the safeguarding, testing, and reviewing of hardware and software, incidents are inevitable. The need for real-time solutions for these issues has never been greater. To best respond to problems as they happen, we need to proactively prepare both as individuals and as teams. We need the right information for the services we are responsible for in the event of an incident. A well-orchestrated response requires a coordinated effort from the right people who can take action, and gathering those who have the right knowledge and system access requires planning and foresight.

Tools like PagerDuty allow your teams to stay informed and take action—but, as PagerDuty co-founder and CTO Alex Solomon mentioned in a recent episode of the Page It to the Limit podcast, it’s more than a product or platform that will transform your team to effectively handle incidents. “What I see over and over is that, yes, you can buy the platform. But the hard part is changing culture…and transforming the way people work—and that comes down to people and process,” he shared.

As part of a successful culture, you want to make sure that the appropriate people are scheduled to handle incidents. This requires planning to get the right balance of expertise and ensuring those experts are in a healthy rotation. You want your experts to be sharp and ready when an incident strikes. That means they also need enough downtime—in other words, they need times when they are not on call and not expected to respond.

All these precautions are put in place to keep our technology running in case of emergencies, but what about our people? Even with all the right planning there may be times when our subject matter experts (SMEs) simply aren’t available. As we’ve seen recently, natural disasters, family emergencies, or even pandemics can make any of us suddenly unreachable. This type of risk is known as risk to business continuity, as the folks at Rundeck describe in their new ebook the “Guide to Business Continuity for Digital Operations.

Rundeck, a platform for runbook automation, enables you to give anyone on your team self-service access to the operations capabilities that previously only your SMEs could perform. Think how much more comfortable you’d feel if part of your team’s preparations for emergency preparedness was having your SMEs create automated runbooks for tasks they commonly anticipate during incidents.

When thinking about business continuity risk, the top priority is capturing tribal knowledge so your business isn’t disrupted. With PagerDuty, you can use Event Intelligence, Response Plays, and Escalation Policies to capture how to spot and respond to issues (including modeling escalation options when people aren’t available). The dynamic Service Directory allows you to proactively gather all the necessary information about your services, in addition to Runbooks, so your teams can easily have access to all the knowledge needed during those critical moments of resolving an incident.

In Rundeck, you take those preparations one step further by capturing all of the procedures for maintaining, diagnosing, and repairing your environments and services. You put PagerDuty and Rundeck together, and you can continue to operate your digital business in the face of a disrupted or disoriented expert workforce.

To help organizations make the shift to remote workforces, Rundeck is providing a comprehensive guide on how to ensure business continuity during uncertain times. The guide explains that the key to ensuring business continuity in digital operations boils down to three things: Automation Harness, Guardrails, and a Dynamic Infrastructure Map.

Rundeck’s runbook automation feature provides a vehicle for automating repeatable tasks, and PagerDuty can help with providing guardrails. When only the right people are notified of an incident—i.e., those who have the knowledge and access to act on the issue—and those people are given only the information they need, organizations can avoid scrambling to locate people and knowledge during a crisis. With proper preparation and planning, those decisions are made beforehand.

Another aspect of building a successful response team is aligning the technical service teams with the corresponding business services. Today’s infrastructure and software components are constantly in motion. By establishing and keeping strong communication across teams and knowing where to find the “sources of truth,” you can regularly keep all parties up-to-date. PagerDuty has provided an operations guide for Full-Service Ownership that can help your team align on these common principles.

Prepared teams are responsive teams. As you prepare your teams for the unthinkable, arm yourself with the knowledge found in some of the resources mentioned above. To learn more about how to best prepare your teams (and your schedules) to best handle incidents, take a look at our Incident Response operations guide. When thinking about your services and how to configure your technical services to match up with the corresponding business services, see the Service Configuration guide. And, to find out how runbook automation can help you keep track of tribal knowledge and keep things running, checkout Rundeck’s Business Continuity in Digital Operations guide.