(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
Incident response bottlenecks – you know they’re real and you know that your incident response system probably has a few, but they must be minimized as they hurt your on-call teams and your customers. Let’s take a look at some of the most critical bottlenecks and how to avoid them.
First, before you understand the bottlenecks in any process, you have to understand what the goals of that process are. What are the goals of incident response?
For most incident response teams, the basic list of goals would probably look something like this:
If the above are the basic goals of incident response, then the bottlenecks are likely to be conditions which make it difficult to meet those goals. The most important of these are:
Prioritization is the most important tool available for both incident resolution and confining incident impact. It’s what allows you to focus on the incidents that are most in need of attention based on their potential for serious impact. It allows you to set aside incidents which are relatively minor in their impact, but can take up a great deal of incident response team time and attention. When you fail to set adequate priorities, you almost guarantee that some major incidents will not be handled promptly, or possible not at all.
If your response team is overwhelmed by the volume of alerts, they may become effectively paralyzed, and unable to respond at all, simply because they don’t have the time to recognize which issues should have top priority, or to separate genuine incidents from alert noise. Eventually, this can lead to chronic alert fatigue, as team members develop the unconscious mental habit of blocking out most alerts, so that they’re able to concentrate on at least a few of them.
A (typically automated) system for filtering out alert noise is an absolute necessity. Non-actionable alerts should be suppressed, and related alert context should be grouped into a single incident. Ideally, all this should be done automatically via rules. Additionally, it’s crucial to implement a system for channeling alerts to the correct teams or team members, rather than broadcasting them to all teams and all members, as repeated alert fatigue and lack of accountability can also quickly become fatal.
Ideally, every incident response team should consist of highly-trained and experienced technicians who are able to diagnose problems quickly and who understand which tools and techniques they should use to fix each incident.
In practice, of course, it isn’t so simple. High turnover and the need for more responders may result in response teams in which most or even all of the members have little or no experience. When this happens, considerable time may be lost as new team members learn things that experienced responders already know. When there is a complete break in continuity (an entirely new team), the situation may be made much worse, because the old team’s knowledge is now “lost wisdom” and often unrecoverable.
The best ways to minimize such problems are to have a formal system of training for incident responders, to place new team members in teams with experienced responders whenever possible, and to have adequate documentation available to response teams. Documentation to ensure consistent, repeatable best practices should include some kind of basic manual of procedures and a well-indexed, easily searchable, cross-referenced database of past incidents, such as a runbook.
A new release of a major application is coming up — or maybe it’s an entirely new application or service — is your response team ready to handle the volume of alerts if it turns out that the developers missed a few serious bugs? Murphy’s Law is, after all, waiting around every corner. All it takes is an update to a widely used program and one or two bugs of the kind that produce a cascade of not-so-easy-to-trace errors. If your response team isn’t prepared, you may find that all of your time and resources are taken up by a storm of high-priority alerts, leaving you with very few reserves for handling any other unrelated incidents that may come up.
Ideally, of course, the update will be adequately tested before full release, with some kind of limited A/B or canary type of deployment. As long as the response team is part of this deployment, they will have the opportunity to deal with problems that do arise on a much smaller scale. The decision to start with a limited deployment, however, is likely to be out of the hands of the incident-response team, and they may have to deal with an inadequately tested release going directly to full deployment.
When this happens, it may be necessary to place all responders on-call — or to designate a special team to handle all update-related problems, freeing up at least some responders to handle unrelated, unplanned issues that also must be addressed. Which approach works best depends at least in part on the scope of the update, and the response team resources that are available. However, plans can always be iterated on as needed and having some plan in place will make a significant difference in comparison to being completely unprepared.
There are plenty of other bottlenecks, of course, including those that arise from outdated, failure-prone, or overloaded infrastructure, as well as the kind that result from response team time being co-opted for non-incident-related tasks. But the bottlenecks that we’ve listed account for much of the time lost by incident response teams, and the remediation approaches that we’ve suggested help to clear up most of them.
Try PagerDuty free for 14 days!