We’ve heard it time and again: Digital transformation is happening across all industries and business is booming. Decades-old companies are migrating to the cloud, deploying...by Joseph Mandros
March 22, 2019
We’re happy to announce we’ve released the new version of PagerDuty, which has multi-incident support. To try it out, just log into your PagerDuty account.
This new feature corrects an over-simplification in PagerDuty’s design: up to now, PD required you to create a new alarm for each type of problem that your monitoring systems are capable of detecting. Unfortunately, this doesn’t work very well if you’re using a monitoring tool like Nagios, which can monitor thousands of hosts and services at once. The new release can now handle multiple open incidents from a single monitoring system; we call this “multi-incident support”.
Here’s a quick summary of the changes in the new release:
The new version of PD is 100% backwards compatible with the previous version. Yes, we’ve renamed a bunch of stuff, but we’ve been very careful to retain the same behavior as the old version for your existing services. Read on for more details.
PagerDuty is now capable of tracking multiple open concurrent incidents. Put another way, your monitoring system can tell PagerDuty about 100 simultaneous and independent problems without you needing to create 100 PagerDuty alarms (as was the case in the old version of PD).
PagerDuty now uses “incidents” rather than “alarms” as the main object. Your support team will be acknowledging, escalating, and resolving incidents, instead of alarms. Incidents in PagerDuty are similar to tickets in a bug tracking system: they are created when a problem is detected, and are resolved or closed when the problem is fixed.
Since PagerDuty can now handle hundreds of open incidents at once, we’ve tried to carefully design PagerDuty’s interface to make it easy to work with large collections of incidents. The new Incidents and Dashboard tabs feature tables that let you see all of the open incidents assigned to you at a glance. You can also easily triage your incidents straight from these pages using the controls located at the top of the table.
By default, the PagerDuty services still work the same way they’ve always worked: they can only have one incident open at once. The reason for this is to maintain backwards compatibility.
You can enable multi-incident support for any existing service. Here’s how:
The second option, if selected, will cause the service to open a new incident based on the email subject: if an open incident with the same subject already exists, the email is appended to this incident; if not, a new incident is created.
The third option, which should be selected by default for an existing service, allows a service to maintain the behavior of the old version of PagerDuty. It basically turns multi-incident support off: if selected, the service can only have one open incident at any one time. When the service receives a trigger eamil, it opens a new incident if the service doesn’t already have an open incident; otherwise, it appends the email to the open incident.
We’ve renamed “alarms” to “services”. Services are now used only to represent an integration point between PagerDuty and your monitoring services. Currently, the PagerDuty services integrate with your monitoring systems via email integration (just like in the old version of PD). In the coming weeks, we will also add support for an HTTP-based API for the PagerDuty services. This will allow your monitoring systems to trigger/acknowledge/resolve incidents in PagerDuty via a synchronous API call.
For similar reasons, we’ve renamed “alarm groups” to “escalation policies”. We think the new name better captures the use of these objects.
We’ve also renamed incident “suppression” to “acknowledge”. As before, this feature is used to temporarily prevent an incident from generating alerts. We thought the word “acknowledge” better captured the purpose of the feature: “stop bothering me about this problem for now… I’m working on it!”.
We’ve also made the acknowledgement timeout configurable on a service-by-service basis. This means that you can set the amount of time that an incident stays in the Acknowledged state, before it reverts to back to Triggered and alerts you again. The timeout is set to 30 minutes by default for each service, but you can change it or even turn it off easily:
Next up is support for a PagerDuty API. This will make it easier to integrate PagerDuty with popular monitoring tools like Nagios, Zenoss, monit, Munin and many others. The API will allow your monitoring system to trigger, acknowledge and resolve incidents directly in PagerDuty, via a synchronous call to the API.