What is Incident Response?
Incident response (IR) is a process used by ITOps, DevOps, and dev teams to address and manage any sort of major incident that may arise. The main goal of IT incident response is to organize an approach that limits damage and reduces recovery time and costs — and prevents it from happening again. Incident response generally includes an outline of processes that need to be executed upon in the event of an IT incident.
An incident response process is something you hope to never need, but when you do, it’s critical that it encompasses all the steps necessary for the response to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company or organization is built up over time and gets better with each incident. Many times, the knowledge of how to conduct thorough incident response is lost when a team member leaves, making it ever more crucial to have a documented process.
Nailing your incident response and learning how to deal with major incidents in a way which leads to the fastest possible recovery time is vital to the success of any team. Generally, your incident response documentation will outline not only how to prepare for an incident, but what to do during and after an incident. It is intended to be used by on-call practitioners and those involved in an operational incident response process.
Importance of incident response
Incident response is used to address potential and active breaches quickly, efficiently and effectively. Having a strong incident response plan is important for the protection of three vital areas of your business: data, reputation, and revenue.
In today’s modern technological world, the privacy and security of data stored within your organization is paramount. We can’t lock up our customers’ secrets physically, but we can do all we can to safeguard them virtually.
Losing a handle on the security of the information with which you’ve been entrusted can cause a loss in company trust that can damage your reputation for years to come, potentially permanently.
Plus, data breaches are immensely costly, often causing millions of dollars in losses for businesses. For example, in the Home Depot breach of 2014, the business recorded almost $200 million in breach-related pre-tax losses.
Steps for successful incident response
For successful incident response, you must not only have a holistic view into the health of your IT infrastructure, you have to prepare your team to know just how to respond and what roles they must take on — allowing you to orchestrate the right response to resolve incidents faster and reduce your mean-time-to-resolution (MTTR).
Monitoring your IT infrastructure health by implementing different monitoring tools to appropriately monitor disparate and new systems, you can gain full-stack visibility. There needs to be a way to normalize, de-dupe, correlate, and gain actionable insights from all this data, and all the events generated by these monitoring tools must be centralized in a single hub, from which they can be triaged and routed to the right on-call engineer.
Before all else, it’s crucial for your team to have established guidelines for what to do when a major incident occurs. Incident response documentation that outlines a process for going on-call, what to do when an incident arises, how to communicate with teams, and what post-mortem process to follow after an incident is crucial.
All this sets the stage for being able to streamline the incident response process when an incident does occur. When a major incident does occur, be sure you:
- Assess
When a major incident does occur, assess the situation and call in the right stakeholders as needed. Collaborate with subject matter experts if need be, otherwise work with your incident commander, deputy, and customer liaison to assess the damage. - Resolve
Once a plan of attack has been formulated, incident resolution begins. Determine what needs to be shared with the public, employees, and customers. - Learn
Learn is arguably the most important step in the incident response process. It’s in the aftermath that your team is able to look and see what went well or what didn’t go so well, and what you can do to prevent things from happening again. Incident post-mortems are a great way for teams to continuously learn and serves as a way to iteratively improve your infrastructure and incident response process. Check out our incident post-mortem template and handbook to get started.
Roles in Incident Response
For major incidents, it’s key that teams can move as one to execute a variety of tasks. These tasks range from actually resolving the issue, to taking command and leading the resolution effort, to communicating internally and externally. It’s important to know who is responsible for what. Every organization typically has their own custom roles and responsibilities, below are some of the most common incident response roles:
- Incident commander: Runs the incident from start to finish. Makes any necessary decisions during the response process and works with the team to share out communications.
- Deputy: The commander’s right hand. This is an active role, not just an observer. This person is responsible for keeping time during an incident and ensuring updates are given to the incident commander as expected.
- Scribe: Documenter of an incident. The scribe takes note of anything important during incidents, including communications, logs, screen shots, and more. They will then record it all and assemble a basic timeline of events.
- Internal liaison: Communicates to business stakeholders. During an incident, other teams within the organization need to be updated with the right content and context so they can take action to mitigate customer impact. The internal liaison ensures that communications go out on time and to the right people.
- Customer liaison: Communicates to the customers. Just as business stakeholders need to know what’s happening during an incident, so do the customers. A customer liaison might have responsibilities such as updating a status page, crafting customer emails, or even updating social feeds.
- Subject matter expert: Problem solvers. SMEs are the people actually working to draw the incident to a close. They’re usually the service owners who have deep knowledge of the impacted areas of your technical ecosystem.
Steps in an Incident Response Process
Incidents are chaotic enough. They can be even more difficult when resolved ad-hoc. Organizations should work to codify the incident response process and have a dedicated incident response plan for when something in the technology ecosystem breaks. This helps improve response times and gives responders a better idea of what they should do and when. While the process of incident response can grow to be quite complex, you can break down the stages into these six:
- Detect. Anomalous behavior is detected within your system. Ideally, this would be discovered via monitoring tools. However, this can sometimes also come from internal technology teams or, unfortunately, customers. Once an issue is detected, you’ll want to route it to your alerting and on-call management tool so the right people can begin working on it.
- Prevent. Before you get all hands on deck, you want to make sure you’re protecting your team. This means preventing alert storms, or a series of alerts for the same problem, from overwhelming responders. You can also consider auto-remediation efforts here to resolve incidents before a person needs to attend to them, ensuring that people aren’t interrupted unless absolutely necessary.
- Mobilize. The team is assembling. This is where you loop in SMEs who can help resolve the issue, including those from other teams. You’ll also want to establish key incident processes to kick off, like spinning up a CollabOps channel for communication or starting a video conference. You can leverage automation to make these processes seamless.
- Diagnose. Next is data gathering. You need to know what’s happening in your system before you can fix it. You can run diagnostics to provide these crucial details, and even create automatic diagnostics so that responders are armed automatically with this information.
- Resolve. This part of the process is often the longest. This is where the team works to fix whatever is broken. A very important part of this phase is communication, both internally and with customers. Keep your stakeholders up to date on the situation and set expectations.
- Learn. Repeat incidents are no fun, especially if you can prevent them. After an incident concludes, distill any relevant learnings from the process so that your people, tools, and processes become more resilient.
Modern incident response lifecycle
Organizations are investing in many monitoring solutions to get visibility into their IT infrastructure so they can better deliver on rising customer demands. Making sense of the event data and taking action by automating the incident response lifecycle for your environment—from assess, to resolve, and learn — is critical. Knowing what do when a major incident does occur is vital to the success of your team and your organization.
Learn more about incident response and the incident response lifecycle, which encompasses everything from assess, triage, and resolve – to learning and prevention to support developers as they move towards owning their code in production.
If you need help getting started with establishing your own incident response process, check out PagerDuty’s incident response documentation for guidance.
Additional
Resources
EBook
Maximizing the ROI of incident management
Podcast
The Unplanned Show, Episode 3: LLMs and Incident Response