PagerDuty Logo
Home
Resources
Articles
What is Incident Response?

What is Incident Response?

Incident response (IR) is a process used by ITOps, DevOps, and dev teams to address and manage any sort of major incident that may arise. The main goal of IT incident response is to organize an approach that limits damage and reduces recovery time and costs — and prevents it from happening again. Incident response generally includes an outline of processes that need to be executed upon in the event of an IT incident.

An incident response process is something you hope to never need, but when you do, it’s critical that it encompasses all the steps necessary for the response to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company or organization is built up over time and gets better with each incident. Many times, the knowledge of how to conduct thorough incident response is lost when a team member leaves, making it ever more crucial to have a documented process.

Nailing your incident response and learning how to deal with major incidents in a way which leads to the fastest possible recovery time is vital to the success of any team. Generally, your incident response documentation will outline not only how to prepare for an incident, but what to do during and after an incident. It is intended to be used by on-call practitioners and those involved in an operational incident response process.

Importance of incident response

Incident response is used to address potential and active breaches quickly, efficiently and effectively. Having a strong incident response plan is important for the protection of three vital areas of your business: data, reputation, and revenue.

In today’s modern technological world, the privacy and security of data stored within your organization is paramount. We can’t lock up our customers’ secrets physically, but we can do all we can to safeguard them virtually. 

Losing a handle on the security of the information with which you’ve been entrusted can cause a loss in company trust that can damage your reputation for years to come, potentially permanently.

Plus, data breaches are immensely costly, often causing millions of dollars in losses for businesses. For example, in the Home Depot breach of 2014, the business recorded almost $200 million in breach-related pre-tax losses. 

Steps for successful incident response

For successful incident response, you must not only have a holistic view into the health of your IT infrastructure, you have to prepare your team to know just how to respond and what roles they must take on — allowing you to orchestrate the right response to resolve incidents faster and reduce your mean-time-to-resolution (MTTR).

Monitoring your IT infrastructure health by implementing different monitoring tools to appropriately monitor disparate and new systems, you can gain full-stack visibility. There needs to be a way to normalize, de-dupe, correlate, and gain actionable insights from all this data, and all the events generated by these monitoring tools must be centralized in a single hub, from which they can be triaged and routed to the right on-call engineer.

Before all else, it’s crucial for your team to have established guidelines for what to do when a major incident occurs. Incident response documentation that outlines a process for going on-call, what to do when an incident arises, how to communicate with teams, and what post-mortem process to follow after an incident is crucial. If you need help getting started with establishing your own incident response process, check out PagerDuty’s incident response documentation for guidance.

All this sets the stage for being able to streamline the incident response process when an incident does occur. When a major incident does occur, be sure you:

  • Assess
    When a major incident does occur, assess the situation and call in the right stakeholders as needed. Collaborate with subject matter experts if need be, otherwise work with your incident commander, deputy, and customer liaison to assess the damage.
  • Resolve
    Once a plan of attack has been formulated, incident resolution begins. Determine what needs to be shared with the public, employees, and customers.
  • Learn
    Learn is arguably the most important step in the incident response process. It’s in the aftermath that your team is able to look and see what went well or what didn’t go so well, and what you can do to prevent things from happening again. Incident post-mortems are a great way for teams to continuously learn and serves as a way to iteratively improve your infrastructure and incident response process. Check out our incident post-mortem template and handbook to get started.

Roles in Incident Response

Every organization typically has their own custom roles and responsibilities, below are some of the most common incident management roles:

  • End user. This is the stakeholder who usually experiences the first sign of an outage or disruption and will flag it to initiate the incident management process.
  • Tier 1 Service Desk. Typically the first point of contact when there is an incident ticket or request incoming.
  • Tier 2 Service Desk. Comprised of technicians with primary knowledge around major incidents involving applications, infrastructure, and systems management.
  • Tier 3 (and above) Service Desk. Specialist technicians that have advanced knowledge in extremely specific regions of the company’s infrastructure. Usually these professionals are brought in for complex maintenance and remediation.
  • Incident Manager. A key stakeholder in the incident management process that drives the entirety of the lifecycle from diagnosis to resolution.
  • Process Owner. This person typically moderates the incident lifecycle, analyzes the process, and points out areas of improvement to make the management lifecycle more efficient for teams.

While the process of incident response can grow to be quite complex, you can break down the stages into these seven main categories:

  1. Incident identification
    • The first and most obvious step is identifying the problem. You can’t hope to solve a problem you can’t find. Identifying the problem isn’t just about finding the breach, though. It’s also a matter of answering who found it, where it was, how it happened, and what critical systems or information are being compromised by it.
  2. Incident logging
    • The next step is logging and tracking the problem to make sure each issue and contingency is being documented as it happens. Tracking is vital to ensure that the same breaches don’t happen more than once, and that teams can learn from past weaknesses and/or errors.
  3. Incident categorization
    • Find the problem. Track the problem. Classify the problem. Classifying the breach or incident into categories helps to show trends over time, which exposes recurring issues and vulnerabilities. This goes hand-in-hand with logging. Good documentation is key to incident response success.
  4. Incident prioritization
    • Many times, multiple incidents happen at the same time. Prioritizing the more important issues to address can be done a number of different ways. Oftentimes, it’s done by determining how many users are affected by a particular incident. However, sometimes the loss or interruption of just a small number of users can be highly impactful. So it’s important to create an internal procedure for prioritization that best suits your organization.
  5. Incident assignment
    • If you have an effective incident response plan in place, your roles and responsibilities should be clearly laid out. That means when something does happen, you’re able to swiftly assign tasks to people in key roles, and they’ll be prepared to handle them. Whether you outsource a response to a third party or take care of it in-house, having the assignments in place ahead of time will save time and resources when incidents do arise.
  6. Task creation and management
    • Incident prioritization leads to task prioritization. In incident response, each task matters, and the timing of it matters just as much. Each role should have responsibilities assigned to them, and each individual incident will require the creation of tasks for them to accomplish in the response cycle. It’s also important that management roles are in place to monitor, oversee, and assign as needed.
  7. Incident Lifecycle
    • The entire lifecycle of the event needs to be tracked, logged, and reviewed after the fact. Hotwashes are important because they ensure that all information from the breach is being fleshed out, reviewed, and applied to future incidents so those same issues don’t occur repeatedly. 
      • Diagnosis
      • Escalation
      • Investigation
      • Resolution and recovery
      • Postmortem

Modern incident response lifecycle

Organizations are investing in many monitoring solutions to get visibility into their IT infrastructure so they can better deliver on rising customer demands. Making sense of the event data and taking action by automating the incident response lifecycle for your environment—from assess, to resolve, and learn — is critical. Knowing what do when a major incident does occur is vital to the success of your team and your organization,

Learn more about incident response and the incident response lifecycle, which encompasses everything from assess, triage, and resolve – to learning and prevention to support developers as they move towards owning their code in production.