This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Many solutions offer email alerts to notify customers of an issue. Email alerts are effective if you’re in front of your inbox all day, but the reality is we usually aren’t. Missed alerts extend outages and impact your company’s revenue and customer loyalty. To know about issues quickly, thousands of customers have chosen PagerDuty for effective incident alerting. This post will explain PagerDuty alerting concepts and best practices around how to set them up so you can increase uptime.
Make Alerts Work For You
Each PagerDuty User can customize their Contact Methods and Notification Rules to get alerted how you want. If the primary on-call engineer misses alerts, alerts can be sent to other teammates until it is responded to based on Escalation Policies.
We recommend for all users to set up at least 3 Contact Methods and 3 Notification Rules to ensure they never miss alerts. By default, there is a Notification Rule to notify the incident owner immediately via email when the incident is assigned to them.
Tip: Depending on the type of incidents that occur in your system, set up alerts based on your cost of downtime and customer service-level agreements (SLAs).
Escalation Policies are safety nets for missed incidents, and they automatically re-route alerts to specific Users or On-Call Schedules:
We recommend Escalation Policies for every incident. If you typically have high severity incidents, dispatch incidents to another person sooner rather than later to ensure that it gets addressed quickly.
Note: Escalation Policies override personal Notification Rules, so each User should make their Notification Rules tighter than their Escalation Policies. If you escalate issues after 30-minutes, have all your personal alerts completed within that timeframe. This helps to ensure you receive all your alerts and have the chance to respond before it is escalated to another teammate.
Default PagerDuty Safety Nets
Alerts can be acknowledged, re-assigned or resolved. In case an acknowledged alert is forgotten, all Services are set with a default 30-minute Incident Acknowledge Timeout. This returns an incident to Trigger state and alerts will be re-started. Additionally, if an incident accidentally left open, by default, PagerDuty will Auto-Resolve Incidents that are open for 4 hours.
Reduce Alert Fatigue
Now that you have told us how you’d like to be contacted when incidents occur, PagerDuty helps decrease alerting headaches by de-duping, bundling, and appending alerts. Incidents from API-based integrations are de-duped, bundled, and appended automatically. With email-based integrations, you can set specific filters to reduce alert fatigue.
During an outage, multiple alerts for the same issue make it difficult to get to the root of the problem. Spend less time diagnosing and more time fixing with PagerDuty. These three features make it easier for users to be aware of critical issues, faster. With PagerDuty, you can decrease the alerting noise and decrease downtime.
Put PagerDuty Alerting Concepts Into Action
1. When PagerDuty receives an alert from your monitoring system, an incident is created in PagerDuty. If there are multiple alerts for the same issue, PagerDuty will de-dupe the alerts into one incident to reduce alerting noise.
2. Multiple on-call teams can be connected to PagerDuty and PagerDuty routes alerts to the right on-call person to fix it. Teams set Escalation Policies to determine who should be notified if the primary person misses their alerts.
3. Once the primary on-call person is found, alerts will be sent in the combination of their choosing. Based upon the team’s Escalation Policies, if the primary person doesn’t respond, the next on-call superhero is called into action.
4. When Users receive alerts, they can choose to acknowledge, resolve or reassign the incident with a SMS or phone call reply, or within the mobile app or web UI.