Getting the most out of PagerDuty: Incident De-Duping
Tired of getting a flood of PagerDuty incidents whenever a problem occurs with one of your systems? Do many of the incidents seem identical? Do you spend valuable time trying to fend off the seemingly never-ending PagerDuty phone calls and SMS messages while you should be fixing the actual problem? Then you, my friend, might be interested in hearing more about our incident de-duping feature.
This feature, when enabled for Generic Email Services, ensures that all incoming email with the same subject line get appended to the log of an existing open incident with the same subject, if one exists, instead of creating a new incident. So only one incident with that subject line will get triggered at once. (If you are familiar with our integration API, it is the same concept as the “incident_key”.)
This feature has been part of PagerDuty’s Generic Email Services for a long time, but we’ve noticed that many are not taking advantage of it since the default setting has us opening a new incident for each incoming trigger email. Going forward, for newly created email services, we’re going to be changing the default instead to open a new incident for each incoming trigger email subject, so be warned.
To change the setting (and only get paged once per problem by PagerDuty), simply edit your Generic Email Services and tick “Open a new incident for each new trigger email subject” under the “Incident creation” settings.
You might be saying to yourself: “well, that persistant barrage of PagerDuty phone calls certainly drives me insane when large-scale problems happen with my systems, but if I collapsed them into only one phone call, what happens if that phone call is missed?”
A good Escalation Policy is what you are looking for here. If the alert is missed, it will move onto the next person in line. Just make sure that your service’s escalation policy loops back to the beginning when it reaches the end of its escalation rules. Just edit the policy associated with the service in question and tick the checkbox labeled “If an incident runs out of escalation rules, loop back to rule 1.”