The Future of Incident Response is Automated, Flexible, and Proactive
We know our customers rely on PagerDuty as the backbone of critical real-time operations, so we want to make sure each and every enhancement helps streamline incident response. How can we help our customers spend less time firefighting and more time innovating?
One of PagerDuty’s values is Champion the Customer – and we take this very seriously. When building and improving features, we aim to keep a pulse on what’s going on with our customers: what’s keeping them up at night? What do they need today? How have their circumstances changed recently? And how can we help them scale their goals for tomorrow?
I sat down with Dan McCall, VP of Product for Incident Response, to learn more about his philosophy for building on the legacy of PagerDuty’s best-in-class incident response solution. To hear about all the features that Dan’s team is building at PagerDuty, check out his session, “Incident Response Keynote: Automated, Flexible, Proactive”. Registration is easy, just click here.
Q: So Dan, are there any patterns that have emerged from speaking with customers? What’s top of mind?
I’m hearing customers talking a lot about maximizing efficiency, minimizing toil, and generally becoming more data-driven so that they can build resilience at scale. What’s interesting is that this is the case whether they’re just getting started on their DevOps journey or have been at it for years. This makes sense – complexity is increasing and incidents are happening more often across the board, but it impacts customers differently. For some, just getting the right person at the right time is the goal, while others prioritize fine-tuning response to streamline ongoing processes and contain impact to responder health.
But there’s one thing that I hear most, and it’s that while building resilience and scaling efficiency are challenging to solve in the best of times, everything has become a whole lot harder because of the “Great Resignation.” In fact, in our most recent customer survey, 64% of our respondents said that they’re experiencing increased turnover this year. It goes without saying that attrition puts added strain on teams – it takes resources to hire and onboard new people, and running understaffed can lead to a vicious cycle of even more manual toil and burnout. And this situation drives even more urgency for getting operations into a healthier, more mature state.
Q: What do you mean when you say operational maturity?
Operational maturity is about providing a better, more predictable experience for your teams so you can address and get ahead of the underlying issues behind attrition and burnout, with process and behavior to turn the corner on some of that potential turnover.
We created this digital operations maturity model after looking at teams and organizations across our platform, codifying the behaviors that we observed.
For those of you who might be newer to operational maturity, customers often ask us what ‘good’ looks like to help organizations measure their operational maturity, we developed the Digital Operations Maturity Model. The model gives organizations a way to define operational maturity, learn how to identify where they fall on the spectrum, and understand where to focus their efforts to improve.
To take this a step further and make it even more tangible, our product analytics team modeled the operational maturity model with data on our platform. We see that reactive teams consistently experience higher turnover than preventative teams – just last quarter, the delta was over 2x! When you think about that against the backdrop of the Great Resignation, it’s clearer than ever before that our products can make a big difference in helping our customers with their most pressing operational challenges. I’d highly recommend you check out this talk, “Getting from Reactive to Proactive (and Beyond!)”, from Scott Bastek and Tejere Oteri, which you can access by registering here.
Q: How does what you’ve been hearing from customers shape your vision for the future of our incident response solution?
When thinking about where we can steer our product to best help our customers achieve this transformation and level up their operational maturity, my team’s vision is to make incident response more:
- Automated to eliminate waste and inefficiency
- Flexible to address a multitude of unique business needs at scale
- Proactive to anticipate and prevent business disruption
And we’re going to do this while staying true to the core of what our customers know and love about PagerDuty.
Q: Automation can mean a lot of things to a lot of people – when you think about Automated Incident Response, what does it mean to you?
Automated Incident Response to me is humans and machines working better together. To help illustrate this, I often think about the concept of Centaur Chess. The TLDR version is: AI can beat a human at the game of chess, but a human paired with AI can beat pure AI.
Automation as the first line of defense empowers teams to balance critical workloads between humans and their machines, helping humans work smarter when they’re needed, and removing the burden when they’re not. There’s plenty in the incident response process that involves manual toil or well-understood tasks – our goal is to remove that unnecessary burden from your humans, so that the humans you have can stay focused and do better at their jobs.
One example of how we’re enabling this is by making it possible to call Automated Diagnostics right from your mobile app, so that your responder doesn’t have to manually run through a rote set of tasks associated with standard diagnostics when they get to their desk. With automation, it’s already run and ready to go by the time your responder gets to the incident.
At its best, automation and AI can take care of things that your teams shouldn’t be doing in the first place. Helping people do less repetitive, manual work helps them stay more engaged, which reduces burnout and helps with attrition. More time to think and focus on how to innovate also means having the extra cycles you need to learn from incidents and improve processes to build the resilience that you want.
Q: PagerDuty has been actively investing in several acquisitions – how has this tied into your roadmap?
We’re thrilled to harness really strong partnerships with our most recent acquisitions, Rundeck in 2020 and Catalytic earlier this year, to spin out better experiences for our customers.
For Incident Response, we’ve been working with our colleagues from the Rundeck acquisition to take their product (now known as Process Automation) and embed Automation Actions deeply within our Incident Response experience –starting from ingest and Event Orchestration, to Mobile, and even our web experience.
First-line responders often find themselves actioning the same, recurring diagnostic steps when it comes to incident triage and remediation, which takes time away from high-value work, keeps specialists firefighting instead of innovating, and prolongs MTTR. So making it as simple and light as possible for teams to start leveraging automation in their incident response lifecycle was really important to us. With the ability to call Automated Diagnostics in any number of ways, teams can save time that they would’ve had to spend on rote, manual tasks. Instead, they can have the results ready by the time the responder gets to their desk.
With Catalytic, we’re taking a different approach. When an incident strikes, organizations typically have a checklist of important steps to run through, which are often manual and hard to remember, especially in the heat of the moment at 2 a.m.! Finding and remembering these steps can distract the response team from its main focus: resolving the incident. We’ve had lightweight response plays for a few years now and have been asked by customers for more ways to automate steps of their incident response processes with more flexibility, which is why we’re excited to introduce Incident Workflows.
Coming later this year, we will be upgrading our lightweight response plays into powerful Incident Workflows based on the new workflow engine from our Catalytic acquisition. These workflows will allow you to define an orchestrated response using “if-this-then-that” logic, which will make it effortless to configure a sequence of common incident actions—such as adding a responder, subscribing stakeholders, or starting a conference bridge—into an orchestrated response.
You can customize your Incident Workflows to reflect your organization’s unique processes for any number of use cases, such as by incident priority, status, or urgency. And as you learn from an incident, you then can encode that learning back into your workflows to automate those repetitive and mundane tasks for the next time an incident occurs.
Q: Which of these announcements do you think our customers will be most excited about?
It’s hard to pick just one, so I’ll tease two and you’ll have to check out my session to hear about all the goodness we’ve got in store for you. First, I think customers are going to be really excited about where we’re taking the future of Response Plays. We’ve already been getting some amazing feedback on how Incident Workflows will deliver a step-function-level improvement on Response Plays with the powerful UI and modular flexibility based on things like priority. I’m personally really excited to see what customers will do with Incident Workflows and how they make them their own. One of the beautiful parts of building this “in a platform way” is that, although we’re showcasing how it can be useful in major incidents, it can be used in a multitude of other ways. You can hear more about this in my session at Summit where Stephanie Gridley, a Resilience Manager from Wayfair, details how their team might use the functionality for both P1 and P5 incidents.
Customers will also be very happy about seeing some updates on some core features that they’ve wanted for a long time, such as Status Update Notification Templates. What’ll get even more interesting is when these features eventually start feeding into each other to do even cooler things. It’s the nexus of these features working in context with one another that provides a multiplier impact greater than the sum of the parts.
If you want to learn more about what else is on the Incident Response roadmap for this year, check out Dan’s virtual keynote session: “Incident Response Keynote: Automated, Flexible, Proactive.” It’s not too late to register for PagerDuty Summit – register here.