(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
“Incident lifecycle management? If we manage to stay alive from one incident to the next, it’s a good day. On a bad day, it’s all panic mode.”
Unfortunately, that’s the reality of incident lifecycle management for far too many software and IT companies — but it doesn’t have to be that way. The truth is that genuine, proactive incident lifecycle management can keep incident-response teams from falling into chronic survival or panic mode.
Incident lifecycle management is a framework for categorizing, responding to, resolving, and documenting incidents so that they can be handled effectively with minimal loss of services and with well-organized follow-up. An end-to-end incident resolution framework is crucial for maintaining critical services.
Most modern incident management systems are based to one degree or another on the ITIL model, first developed in the 1980s by the British government’s Central Computing and Telecommunications Agency. The ITIL model is centered around maintaining services to clients and customers, as opposed to maintaining key systems strictly according to technical specifications. This makes it an ideal model for incident response in outward-facing applications, where maintenance of user services is of high importance. The most important elements of the ITIL model to keep in mind when setting up an incident lifecycle management framework are:
This is the phase during which incoming alerts are logged, categorized, and routed to the appropriate teams. In many respects, this is the most important part of the incident management lifecycle, because it is when you detect issues and filter out noise (non-actionable alerts), set priorities, and determine where each alert should be routed.
Failure to adequately manage this part of the process can result in important alerts being missed, handled at too-low priority, or routed to the wrong responders, as well as unbalanced workloads for response teams.
After an alert has been categorized, it is sent to a Level 1 response team. Level 1 teams are the first responders; their job is to resolve the incident to the customer’s satisfaction, typically within a specified time frame. The Level 1 team will investigate the incident, figure out what the basic problem is, and apply known or recommended remediations wherever possible.
Level 1 support also monitors the status of the incident, particularly with regard to escalation. Another key responsibility of Level 1 support is to maintain communication with the affected customer or client and provide status updates at intervals which may be set by contract, or by organizational policies. This makes it possible to maintain a consistent channel of communication and support, even if the incident has been passed on to higher-level support.
If an incident is beyond Level 1 support’s capacity for diagnosis and quick resolution, it is typically passed on to a Level 2 support team, which will generally be able to bring more resources and experience into play.
Level 2 teams are also able to call in specialized and third-party support (from manufacturers, vendors, etc.). The basic goal of Level 2 support remains the same as Level 1—to restore service to the customer or client as quickly as possible.
The formal ITIL model breaks this down into two processes: Closure and Evaluation, and Incident Management Reporting. For many organizations, particularly smaller ones, it may be more convenient to combine them into a single process.
The key elements of any post-resolution wrap-up are to verify, record, and evaluate the resolution (or lack of one), and to fully report the details of the incident (typically with a post-mortem report). Incident post-mortem reports should be entered into an information base that is available to response teams and managers, and which is sufficiently indexed and searchable to serve as an easily accessible source of information for responding to (and hopefully preventing) future incidents.
In addition to the elements listed above, the ITIL model includes two other factors which come into play in any realistic incident lifecycle management system:
Major incidents are typically those which present an immediate, serious threat to the operation or security of basic infrastructure or key services. The objective is still to get the system up and running as quickly as possible, but the priority and initial level of response may be much higher. A major incident may go directly to level 2, to a specialized support team, or even to third-party support (for example, if an important component of the hardware infrastructure breaks down).
Each organization may have its own standards for what constitutes a major incident, but for most organizations, it is important to recognize that major incidents form their own category, with a significantly higher level of priority and response.
Because one of the top priorities of incident management in the ITIL model is to maintain or restore customer service as quickly as possible, the initial resolution may involve workarounds — a rollback, for instance. This is true at all levels. The logic is simple: If you restore customer service now, you’ve solved the immediate problem and the IT or development team can then take as much time as necessary to resolve the underlying issues.
It is important to log and identify all workarounds, both in the incident report system, and when scheduling IT and development updates, because every workaround results in technical debt, the cost of which generally becomes higher the longer it goes unpaid. This means that workarounds resulting from incident response should be replaced with solutions conforming to system design standards as soon as it is practical to do so. In many respects, an incident isn’t fully resolved until any workarounds have been replaced by more permanent solutions.
There really is no need for your incident response team to operate in survival mode from day to day. In a world where it’s never been more expensive to be unprepared for customer-impacting issues, doing so introduces chaos and anxiety into the equation.
With an incident lifecycle management framework tailored to the needs of your organization, you can keep critical applications and infrastructure running with minimal service interruption as well as stress. Implementing the best practice incident lifecycle is the key to reliability, and reliability itself is an indispensable service that will help define your long-term success.