Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Incident response bottlenecks – you know they’re real and you know that your incident response system probably has a few, but they must be minimized as they hurt your on-call teams and your customers. Let’s take a look at some of the most critical bottlenecks and how to avoid them.
First, before you understand the bottlenecks in any process, you have to understand what the goals of that process are. What are the goals of incident response?
For most incident response teams, the basic list of goals would probably look something like this:
If the above are the basic goals of incident response, then the bottlenecks are likely to be conditions which make it difficult to meet those goals. The most important of these are:
Prioritization is the most important tool available for both incident resolution and confining incident impact. It’s what allows you to focus on the incidents that are most in need of attention based on their potential for serious impact. It allows you to set aside incidents which are relatively minor in their impact, but can take up a great deal of incident response team time and attention. When you fail to set adequate priorities, you almost guarantee that some major incidents will not be handled promptly, or possible not at all.
If your response team is overwhelmed by the volume of alerts, they may become effectively paralyzed, and unable to respond at all, simply because they don’t have the time to recognize which issues should have top priority, or to separate genuine incidents from alert noise. Eventually, this can lead to chronic alert fatigue, as team members develop the unconscious mental habit of blocking out most alerts, so that they’re able to concentrate on at least a few of them.
A (typically automated) system for filtering out alert noise is an absolute necessity. Non-actionable alerts should be suppressed, and related alert context should be grouped into a single incident. Ideally, all this should be done automatically via rules. Additionally, it’s crucial to implement a system for channeling alerts to the correct teams or team members, rather than broadcasting them to all teams and all members, as repeated alert fatigue and lack of accountability can also quickly become fatal.
Ideally, every incident response team should consist of highly-trained and experienced technicians who are able to diagnose problems quickly and who understand which tools and techniques they should use to fix each incident.
In practice, of course, it isn’t so simple. High turnover and the need for more responders may result in response teams in which most or even all of the members have little or no experience. When this happens, considerable time may be lost as new team members learn things that experienced responders already know. When there is a complete break in continuity (an entirely new team), the situation may be made much worse, because the old team’s knowledge is now “lost wisdom” and often unrecoverable.
The best ways to minimize such problems are to have a formal system of training for incident responders, to place new team members in teams with experienced responders whenever possible, and to have adequate documentation available to response teams. Documentation to ensure consistent, repeatable best practices should include some kind of basic manual of procedures and a well-indexed, easily searchable, cross-referenced database of past incidents, such as a runbook.
A new release of a major application is coming up — or maybe it’s an entirely new application or service — is your response team ready to handle the volume of alerts if it turns out that the developers missed a few serious bugs? Murphy’s Law is, after all, waiting around every corner. All it takes is an update to a widely used program and one or two bugs of the kind that produce a cascade of not-so-easy-to-trace errors. If your response team isn’t prepared, you may find that all of your time and resources are taken up by a storm of high-priority alerts, leaving you with very few reserves for handling any other unrelated incidents that may come up.
Ideally, of course, the update will be adequately tested before full release, with some kind of limited A/B or canary type of deployment. As long as the response team is part of this deployment, they will have the opportunity to deal with problems that do arise on a much smaller scale. The decision to start with a limited deployment, however, is likely to be out of the hands of the incident-response team, and they may have to deal with an inadequately tested release going directly to full deployment.
When this happens, it may be necessary to place all responders on-call — or to designate a special team to handle all update-related problems, freeing up at least some responders to handle unrelated, unplanned issues that also must be addressed. Which approach works best depends at least in part on the scope of the update, and the response team resources that are available. However, plans can always be iterated on as needed and having some plan in place will make a significant difference in comparison to being completely unprepared.
There are plenty of other bottlenecks, of course, including those that arise from outdated, failure-prone, or overloaded infrastructure, as well as the kind that result from response team time being co-opted for non-incident-related tasks. But the bottlenecks that we’ve listed account for much of the time lost by incident response teams, and the remediation approaches that we’ve suggested help to clear up most of them.
Try PagerDuty free for 14 days!
SIGN UP NOW
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019