PagerDuty Blog

Automate Major Incident Management Step-by-Step for Better, Faster Response

Organizations looking to win the market and drive great customer experiences need to deliver on the promise of exceptional service, meaning fewer interruptions and faster resolution. This can be done by embedding automation across the incident management lifecycle for major incidents, and bringing in humans where it makes sense.

Let’s walk through some of the steps in incident management that are primed for automation for immediate gains, either to eliminate the need for humans to intervene, or to support them in critical moments.

Before you know there’s an incident

Before responders know an incident is happening, there’s a great opportunity to let machines take the brunt of the work via event-driven automation. Event-driven automation begins at the event level when data is ingested from trusted sources such as monitoring tools. At this point, automation can do several things to ensure that incident management proceeds efficiently with as little interruption to SMEs as possible. Some of these include:

  • Reducing incident volume: If a responder does need to jump on an issue, it can be annoying and disruptive to be alerted multiple times for the same problem. Not only that, it makes it challenging to concentrate and slows down response time. By using AI and automation to group alerts into the same incident for related problems, this noise is streamlined so responders can concentrate on the problem at hand.
  • Reducing event volume for better signal to noise ratio: Duplicate, informational, or transient events can contribute to noise for responders, making it hard to know what’s important and what’s not. Reducing the volume ensures that only the most important events are surfaced.
  • Providing context from event data: Events can come in looking very different depending on what services they pertain to, and they don’t always come with helpful information that’s easily digestible for responders. Using automation, these events can be enriched with key information and standardized across the organization so that all responders can understand the context. Additionally, organizations can use custom fields to add even more context, such as labeling incidents as “major” or “production” immediately based on incoming data. In fact, this functionality is now available in early access
  • Providing automatic diagnostic context: Running diagnostics is often a manual task that responders do for every incident. But, we don’t need to waste capacity on this. Instead, automation can kick off diagnostics, populating them before a responder ever looks at the incident.
  • Auto-remediation: According to our customers, about 15% of incidents can be resolved without any human intervention at all. These well-understood issues can be remediated entirely with automation, saving time and reducing customer impact. In many cases, the automation works fast enough that most customers don’t notice an incident at all.

After automation handles these tasks, the remaining incidents that can’t be resolved are routed to the correct SME – often the service owner – for triage.

During triage

Triage is the phase where the responder is trying to figure out what went wrong. But systems are complex and the answer isn’t often straightforward, so this process can often take a lot of time and mental capacity. Meanwhile, customers are waiting for service to return to normal. It’s important for responders to be able to use their expertise wisely to pinpoint the issue rather than digging through docs and postmortems, or pinging other SMEs for insights on tribal knowledge to understand what went wrong. With machine learning and automated diagnostics, lots of this groundwork is already populated on an incident as soon as responders get to their desks.

Machine learning can surface system context for responders such as the probable origin of the incident, other teams experiencing the same problem, past incidents and how they were resolved, change events, and more.

Armed with this information, responders can act quickly and get up to speed on incidents without the toil involved in information gathering. This democratizes the information available to all responders, whether they’ve been at the organization for a decade and know everything about the system or have only just started.

While working to a resolution

Working towards an actual resolution to an issue is the part of response that SMEs are most required for. At this point, automation serves as an assistant, something that can answer questions using AI and streamlines workflows for the response team to keep everything codified and on-track. Let’s chat about each.

GenAI has been a game changer for many companies. But how you use it really makes the difference. An important factor in using GenAI is being able to ask questions and interact with the AI to get the answers you want quickly. With GenAI chatbots assisting incident management, you can preserve team capacity by asking it questions about the system to give you a jumping off point, an idea of impact, and more.

Combined with an AI co-pilot, Incident Workflows can also be a game changer. Not only do responders have answers to key questions right at their fingertips, they also know what to do next and toilsome steps such as creating communication channels, drafting updates, and more are completed for them.

While communicating both internally and externally

Speaking of updates, communication is a key part of incident management, but can be easy to miss during the heat of an incident. Effective communication includes internally with key stakeholders, externally with customers, and to other systems such as your ITSM. It’s important to use automation and GenAI in tandem to cover your bases and craft tailored communications for each audience.

Responders can accomplish this in several ways:

  • Custom fields: Write updates back to your ITSM, and update the incident with any relevant ITSM data so that all teams, whether IT or developers, are on the same page.
  • Status update templates: Use GenAI to craft updates and automatically publish them to key internal stakeholders based on pre-assembled groups.
  • Status pages: Update customers automatically on what to expect from the response effort and share out when an incident has come to a close.

Communicating throughout the incident helps build and preserve trust. Responders may need to send an early acknowledgement, regular updates, and then a closing response. Automation via incident workflows can keep responders on track, meaning nobody is left out of the loop all the way from acknowledging an incident to resolution.

Leveraging AI and automation 

Leveraging AI and automation throughout the incident lifecycle can improve the experience for responders, stakeholders, and customers. It’s important to adopt these new ways of working and be on the forefront of this new technology. But it’s unlikely that machines will be able to resolve novel issues by themselves for a long time. In the meantime, it’s key to have a strategic partner that helps organizations make the most of AI and automation. If you want to learn what PagerDuty can do for you, try us out today.