Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
We know that alert fatigue is a big concern for our users. When everything is important, nothing is important. But “non-critical” is not the same thing as “insignificant”; in fact, non-critical issues are often indicative of a larger problem down the road. So now, with Incident Urgencies, users can confidently track all of these events, and only get woken up for the most important incidents that result.
A big part of what has made PagerDuty useful for our customers is analytics and being able to see what’s going on with incidents across all of their systems and monitoring tools. Keeping non-critical events out of PagerDuty means those incident analytics are only telling part of the story. And the more data you have, the easier it is to prevent incidents from occurring in the future.
PagerDuty has always helped on-call engineers resolve incidents that require immediate attention. But what about less urgent issues? Because they couldn’t sort or snooze alerts, some users weren’t routing minor events through PagerDuty to avoid getting woken up in the middle of the night. Which means that they were missing key information in their analytics about potential trouble brewing, making their teams less prepared to handle them when they came along.
Until now, PagerDuty users were unable to sort their incidents by urgency. Every incident was registered at the same level of importance. Until now, a disk approaching 80% capacity was assigned the same level of importance as your server going up in flames. It wasn’t always easy to separate what really needed attention and what could wait.
But now, users can designate services with a high- or low-urgency setting, and customize their incident notification rules based on urgency. When incidents are triggered on a low-urgency service, notifications will follow users’ low-urgency notification rules and won’t escalate. For example, incidents set to ‘low urgency’ could use email only for notifications instead of SMS and phone calls.
Particularly advanced teams can leverage Custom Incident Urgencies not only to specify how different levels of urgencies come through, but also to change those rules based on time of day. A common use case: during working hours, some incidents are urgent, but during nights and weekends, they’re not. To illustrate, think of a broken staging environment. On the weekend or at nighttime, it’s not a problem, but come 9am on Monday, it needs to be fixed immediately or business as usual is at risk. Many incidents are only critical during business hours. Custom Incident urgencies allows businesses to set those parameters.
Users can also use our new snooze button to stop alerting for an incident that can’t be resolved right away (or doesn’t need to be!). Prior to this feature, users only had the options to acknowledge or escalate, neither of which helped teams keep track of that incident’s status. By accepting more events and allowing you to set the resulting incident urgency, PagerDuty provides a central system dashboard for all of your incidents, helping your organization work better to prioritize, fix, and prevent issues in the future.