What is Incident Management?

In today’s digital world, technology has become the focal point of business performance and customer satisfaction across industries of all sorts. Because of the increased complexities within infrastructure environments and the additional abstractions that are layered within applications and services, the need for a centralized incident management platform has never been greater. But what is incident management in the context of technology?

Incident management is the end-to-end business process of addressing an outage, service disruption, or other major incident from its initial conception to its full resolution. While this definition may sound simple, the lifecycle management process itself is extremely complex and involves cross-team collaboration, disparate technologies, and distributed systems in order to resolve efficiently without risking customer experience, brand reputation, and most importantly, the bottom line of the business.

While the process of incident management can grow to be quite complex, you can break down the stages into these seven main categories:

Incident identification
Incident logging
Incident categorization
Incident prioritization
Incident assignment
Task creation and management
Incident response
- Diagnosis
- Escalation
- Investigation
- Resolution and recovery
- Postmortem

A little more on the process of managing an incident later. But first, let’s discuss what exactly an incident is—and what isn’t.

Types of Incidents

Incidents occurring within a given IT environment can be categorized and defined in numerous ways. Some incidents are defined by severity or business impact, while others are defined by the root cause of the outage. For example, an incident can be as simple as a latency in the network due to high traffic, or as complex as dealing with a container failure for a mission-critical, customer-facing application, which can cause widespread outages to a customer base.

In many business cases, incidents are defined by its severity level and will often look like:

Sev1
Sev2
Sev3
P1
P2
P3

The Incident Management Process

Step 1: Identifying an Incident

It may sound obvious, but the first step in managing an incident is to first identify an incident. To do this, you must determine what defines an incident for your team. An incident is when your service experiences an unplanned interruption or reduction in quality. Since each company is different, as is their infrastructure and applications, it’s important to consider the specific types of incidents you might run into. For example, if your primary service includes an online shop, a possible incident you may run into could be slower page speeds caused by increased site traffic – perhaps during a big sale.

Step 2: Logging an Incident

Once an incident has been identified, the next step is to correctly log and track the incident. This will typically be done by your service desk. Incidents are logged as tickets, which should include the following information:

User’s name and contact information
Description of the incident
Date and time of the incident (needed for SLA clearance)

Step 3: Categorizing an Incident

Once an incident is logged, it must then be categorized. This is extremely important, and every incident should be assigned at least one category (such as “Network” and subcategory (such as “Network Outage.” This will allow your service desk to easily sort through all incidents based on their categories and subcategories rather than having to sift through a sea of uncategorized tickets. We’ve all been there, and it’s not a fun place to be. Proper categorization of incidents can also help to show patterns, track how many times similar incidents occur, and diagnose larger problems and areas that may require additional training. For example, if you continuously run into speed issues, it may be time to discuss upgrading your infrastructure.

Step 4: Prioritizing Your Incidents

As with any task or to-do list, prioritization is key. Prioritizing incidents based on their severity will clearly point to major incidents that need to be solved right away, and minor incidents whose necessary resolution time is much more flexible. An incident’s priority and urgency will be based on the level of impact to users and their ability to use the service. With all incidents categorized, your team can automate how specific incident categories and subcategories should be prioritized.

Incidents are typically prioritized as:

Low-priority incidents: Users experience no interruption in service
Medium-priority incidents: Some internal staff affected with little to no interruption for users
High-priority incidents: Large number of users experience service interruption and reduction in quality. High priority incidents often have negative financial impacts on business.

Step 5: Responding to an Incident

Once an incident has been identified, logged, categorized, and prioritized, it’s time to respond to the incident. This is a typical process of how an incident response is conducted:

First, your service desk will need to make an initial diagnosis, where the issue is clearly described and troubleshooting questions are answered.
Once the incident has been diagnosed, your service desk will determine whether or not an incident escalation is needed. An escalation is when there is advanced support needed to resolve an incident, in which case the incident will be assigned to the appropriate team.
Next, the assigned team will investigate and diagnose the incident. This is typically done during a troubleshooting phase after confirming the initial incident hypothesis. Once a diagnosis has been made, your team will apply the needed fix, such as a software patch, change in settings, new hardware, etc.
Finally, once an incident is fixed, your team can close the incident.
Following the incident closure, your team should have an internal review meeting, and conduct any needed postmortems. At this point, you’ll also need to determine whether any public postmortem is needed.

Don’t forget about incident communication with your users! It’s important to remember that while responding to an incident, your team is also in communication with its users as needed. Incident communication is essential to maintaining the trust of your users, as well as the credibility of your brand. Should an incident arise that impacts their ability to use the service without interruption, your team should immediately notify users (whether via email, social media, a designated page or plugin, etc.) of the incident. Let them know your team is on it and provide them with regular updates throughout the incident response process.
Once an incident has been closed, notify users of the incident, how it’s been resolved, and whether or not any additional steps are needed.

Roles

Every organization typically has their own custom roles and responsibilities, below are some of the most common incident management roles:

End user. This is the stakeholder who usually experiences the first sign of an outage or disruption and will flag it to initiate the incident management process.
Tier 1 Service Desk. Typically the first point of contact when there is an incident ticket or request incoming.
Tier 2 Service Desk. Comprised of technicians with primary knowledge around major incidents involving applications, infrastructure, and systems management.
Tier 3 (and above) Service Desk. Specialist technicians that have advanced knowledge in extremely specific regions of the company’s infrastructure. Usually these professionals are brought in for complex maintenance and remediation.
Incident Manager. A key stakeholder in the incident management process that drives the entirety of the lifecycle from diagnosis to resolution.
Process Owner. This person typically moderates the incident lifecycle, analyzes the process, and points out areas of improvement to make the management lifecycle more efficient for teams.

But how does the process of incident management actually work? With PagerDuty, the process can be broken down into these four stages of management:

Harness Data
Make Sense of Data
Respond & Engage Teams
Analyze and Learn

Harness Digital Data

When incidents do inevitably occur, understanding the makeup of an incident and its root cause is critical to diagnosing—and eventually mitigating—the issue and saving time and money for your business. While there is no uniform identity to an incident, you can follow the breadcrumbs based on the type of outage you are seeing. For example, if there is a load balancing issue with one of your external applications, you may want to dig deeper into your container environment to better understand the issue. Having the ability to aggregate all of the digital data surrounding the incident will help you to uncover the root cause is the first step in orchestrating a coordinated, holistic response.

PagerDuty’s integration of ecosystem of over 350+ integrations allows your teams to have a centralized view into your entire environment, which enables data signals from any tool, webhook, system, or monitoring application to have one single point of ingestion.

Make Sense of the Data

With all of the data surrounding the incident in front of you, it’s nearly impossible to pin-point the disruptive signal, and would be similar to searching for a needle in a haystack. In order to uncover the identity of the incident, you need the ability to aggregate and segment the data you are surveying in order to paint a better picture of the incident makeup and turn the data into meaningful signals.

With so much data consistently flowing in and out of a given environment, being able to make sense of the data and create actionable paths to mitigation is a major key in resolving the issue before it starts to cascade across the rest of the business and your customer base. With PagerDuty’s collection of over 10+ years of historical data, we are able to help aggregate, correlate, and connect similar incidents and events into a single instance in order to help orchestrate an efficient and collaborated response.

Respond and Engage Teams

One of the most important functions of the incident management process is making sure the correct stakeholders and service owners are actively enabled and working to help mitigate the issue at hand. By looping in key stakeholders, teams can take a proactive approach to addressing and remediating the issue, as well as providing organizational visibility so teams are aware of the on-going response.

By using PagerDuty, key stakeholders and responders can be informed in real time as an incident is happening in order to make sure the incident is routed to the right team to take immediate action to prevent the issue from becoming customer- or revenue-impacting.

Analyze and Learn

Once an incident is fully resolved, the postmortem stage is an important function of the incident lifecycle as it helps teams to better understand what happened and how they can prevent recurring incidents in the future. This enables teams to take a preventative approach to incident management and make sure, when things do inevitably happen, that they are dealt with in a timely and frictionless manner.

PagerDuty gives teams the tools and information necessary to better understand an incidents makeup and give teams actionable insights in order to prevent similar incident from recurring in the future.

To learn more about how PagerDuty can improve your organization’s incident management process, try a 14-day free trial today.

Additional
Resources

EBook

Maximizing the ROI of incident management

Podcast

The Unplanned Show, Episode 3: LLMs and Incident Response

Recent
Blog Posts

Learning from Major Incidents: The Opportunities We’re Missing

Highlights from PagerDuty on Tour

PagerDuty Expands Leadership Team with Introduction of Public Sector and Americas Sales Leaders

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Copilot

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

What is Incident Management?