Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
“Incident lifecycle management? If we manage to stay alive from one incident to the next, it’s a good day. On a bad day, it’s all panic mode.”
Unfortunately, that’s the reality of incident lifecycle management for far too many software and IT companies — but it doesn’t have to be that way. The truth is that genuine, proactive incident lifecycle management can keep incident-response teams from falling into chronic survival or panic mode.
Incident lifecycle management is a framework for categorizing, responding to, resolving, and documenting incidents so that they can be handled effectively with minimal loss of services and with well-organized follow-up. An end-to-end incident resolution framework is crucial for maintaining critical services.
Most modern incident management systems are based to one degree or another on the ITIL model, first developed in the 1980s by the British government’s Central Computing and Telecommunications Agency. The ITIL model is centered around maintaining services to clients and customers, as opposed to maintaining key systems strictly according to technical specifications. This makes it an ideal model for incident response in outward-facing applications, where maintenance of user services is of high importance. The most important elements of the ITIL model to keep in mind when setting up an incident lifecycle management framework are:
This is the phase during which incoming alerts are logged, categorized, and routed to the appropriate teams. In many respects, this is the most important part of the incident management lifecycle, because it is when you detect issues and filter out noise (non-actionable alerts), set priorities, and determine where each alert should be routed.
Failure to adequately manage this part of the process can result in important alerts being missed, handled at too-low priority, or routed to the wrong responders, as well as unbalanced workloads for response teams.
After an alert has been categorized, it is sent to a Level 1 response team. Level 1 teams are the first responders; their job is to resolve the incident to the customer’s satisfaction, typically within a specified time frame. The Level 1 team will investigate the incident, figure out what the basic problem is, and apply known or recommended remediations wherever possible.
Level 1 support also monitors the status of the incident, particularly with regard to escalation. Another key responsibility of Level 1 support is to maintain communication with the affected customer or client and provide status updates at intervals which may be set by contract, or by organizational policies. This makes it possible to maintain a consistent channel of communication and support, even if the incident has been passed on to higher-level support.
If an incident is beyond Level 1 support’s capacity for diagnosis and quick resolution, it is typically passed on to a Level 2 support team, which will generally be able to bring more resources and experience into play.
Level 2 teams are also able to call in specialized and third-party support (from manufacturers, vendors, etc.). The basic goal of Level 2 support remains the same as Level 1—to restore service to the customer or client as quickly as possible.
The formal ITIL model breaks this down into two processes: Closure and Evaluation, and Incident Management Reporting. For many organizations, particularly smaller ones, it may be more convenient to combine them into a single process.
The key elements of any post-resolution wrap-up are to verify, record, and evaluate the resolution (or lack of one), and to fully report the details of the incident (typically with a post-mortem report). Incident post-mortem reports should be entered into an information base that is available to response teams and managers, and which is sufficiently indexed and searchable to serve as an easily accessible source of information for responding to (and hopefully preventing) future incidents.
In addition to the elements listed above, the ITIL model includes two other factors which come into play in any realistic incident lifecycle management system:
Major incidents are typically those which present an immediate, serious threat to the operation or security of basic infrastructure or key services. The objective is still to get the system up and running as quickly as possible, but the priority and initial level of response may be much higher. A major incident may go directly to level 2, to a specialized support team, or even to third-party support (for example, if an important component of the hardware infrastructure breaks down).
Each organization may have its own standards for what constitutes a major incident, but for most organizations, it is important to recognize that major incidents form their own category, with a significantly higher level of priority and response.
Because one of the top priorities of incident management in the ITIL model is to maintain or restore customer service as quickly as possible, the initial resolution may involve workarounds — a rollback, for instance. This is true at all levels. The logic is simple: If you restore customer service now, you’ve solved the immediate problem and the IT or development team can then take as much time as necessary to resolve the underlying issues.
It is important to log and identify all workarounds, both in the incident report system, and when scheduling IT and development updates, because every workaround results in technical debt, the cost of which generally becomes higher the longer it goes unpaid. This means that workarounds resulting from incident response should be replaced with solutions conforming to system design standards as soon as it is practical to do so. In many respects, an incident isn’t fully resolved until any workarounds have been replaced by more permanent solutions.
There really is no need for your incident response team to operate in survival mode from day to day. In a world where it’s never been more expensive to be unprepared for customer-impacting issues, doing so introduces chaos and anxiety into the equation.
With an incident lifecycle management framework tailored to the needs of your organization, you can keep critical applications and infrastructure running with minimal service interruption as well as stress. Implementing the best practice incident lifecycle is the key to reliability, and reliability itself is an indispensable service that will help define your long-term success.
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018