Share

Control Downtime with Incident Alerting Best Practices

Many solutions offer email alerts to notify customers of an issue. Email alerts are effective if you’re in front of your inbox all day, but the reality is we usually aren’t. Missed alerts extend outages and impact your company’s revenue and customer loyalty. To know about issues quickly, thousands of customers have chosen PagerDuty for effective incident alerting. This post will explain PagerDuty alerting concepts and best practices around how to set them up so you can increase uptime.

Make Alerts Work For You

Each PagerDuty User can customize their Contact Methods and Notification Rules to get alerted how you want. If the primary on-call engineer misses alerts, alerts can be sent to other teammates until it is responded to based on Escalation Policies.

  • Contact Methods are the ways we can contact you including phone – cell, home and work – SMS, email and push notifications.
  • Notification Rules are the combinations of Contact Methods you want us to notify you by.
  • Escalation Policies are how we automatically re-route alerts to another person or team if they are missed by the primary incident owner.

Notification Rules

We recommend for all users to set up at least 3 Contact Methods and 3 Notification Rules to ensure they never miss alerts. By default, there is a Notification Rule to notify the incident owner immediately via email when the incident is assigned to them.

Tip: Depending on the type of incidents that occur in your system, set up alerts based on your cost of downtime and customer service-level agreements (SLAs).

Escalation Policies are safety nets for missed incidents, and they automatically re-route alerts to specific Users or On-Call Schedules:

escalation_policies

We recommend Escalation Policies for every incident. If you typically have high severity incidents, dispatch incidents to another person sooner rather than later to ensure that it gets addressed quickly.

Note: Escalation Policies override personal Notification Rules, so each User should make their Notification Rules tighter than their Escalation Policies. If you escalate issues after 30-minutes, have all your personal alerts completed within that timeframe. This helps to ensure you receive all your alerts and have the chance to respond before it is escalated to another teammate.

Default PagerDuty Safety Nets

Alerts can be acknowledged, re-assigned or resolved. In case an acknowledged alert is forgotten, all Services are set with a default 30-minute Incident Acknowledge Timeout. This returns an incident to Trigger state and alerts will be re-started. Additionally, if an incident accidentally left open, by default, PagerDuty will Auto-Resolve Incidents that are open for 4 hours.

Incident_Settings

Reduce Alert Fatigue

Now that you have told us how you’d like to be contacted when incidents occur, PagerDuty helps decrease alerting headaches by de-duping, bundling, and appending alerts. Incidents from API-based integrations are de-duped, bundled, and appended automatically. With email-based integrations, you can set specific filters to reduce alert fatigue.

  • If the same events are sent to PagerDuty, they will be de-duped and only one incident is created in PagerDuty. This avoids multiple alerts for the same incident, and only one incident will set off alerts based upon a User’s Notification Rules
  • If events for the same open incident come in, they are appended to the open incident and no new alert will be sent out.
  • If multiple incidents are triggered at the same time and assigned to the same user, the user will receive a bundled alert notifying them of those incidents.

During an outage, multiple alerts for the same issue make it difficult to get to the root of the problem. Spend less time diagnosing and more time fixing with PagerDuty. These three features make it easier for users to be aware of critical issues, faster. With PagerDuty, you can decrease the alerting noise and decrease downtime.

Put PagerDuty Alerting Concepts Into Action

PagerDuty alert routing

1. When PagerDuty receives an alert from your monitoring system, an incident is created in PagerDuty. If there are multiple alerts for the same issue, PagerDuty will de-dupe the alerts into one incident to reduce alerting noise.

2. Multiple on-call teams can be connected to PagerDuty and PagerDuty routes alerts to the right on-call person to fix it. Teams set Escalation Policies to determine who should be notified if the primary person misses their alerts.

3. Once the primary on-call person is found, alerts will be sent in the combination of their choosing. Based upon the team’s Escalation Policies, if the primary person doesn’t respond, the next on-call superhero is called into action.

4. When Users receive alerts, they can choose to acknowledge, resolve or reassign the incident with a SMS or phone call reply, or within the mobile app or web UI.

Get Social