This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Since our first on-call best practices post back in March 2011, on-call scheduling methods have remained mostly unchanged. Many teams start off with sending email alerts to the entire team then someone volunteers to resolve the incident. With this model, some superhero team members end up handling a disproportionate amount of incidents while new hires don’t have the opportunity to learn how to fix them.
Worst of all, everyone is on-call all the time. As your team grows and responsibilities are divided, an on-call rotation system is needed. It’s not easy to implement though; your teammates may be based in multiple cities, schedules change, and each engineer has their own preferred method of being alerted. You need a system that’s flexible enough to address these issues and robust enough to perform reliably.
The Current State of On-Call Scheduling
There are several on-call scheduling methods organizations use today. Some are more sophisticated than others, but each possess their own limitations.
1. Unfair On-Call Burden
A simple, common on-call solution is to use a single dedicated phone or pager that gets handed off to the next on-call engineer. Although this may sound antiquated, many organizations we talked to have used this method. If your team is spread across various cities, some members cannot participate if they are out of range. This creates an unfair burden for some of your superheroes teammates.
2. Delayed Response Time
Another simple – but labor-intensive – option is to staff a 24/7 network operations center (NOC). This method involves paying staff to monitor metrics all day and identify problems themselves. When an issue arises, they have to look up the appropriate contacts in a directory and notify the on-call personnel to resolve the situation. It would be much easier for your NOC team to centrally manage an on-call schedule system which directly notifies the right on-call person and decrease your mean time to response.
3. Alert Fatigue
Some companies keep it simple by sending email blasts to their entire team. In this model, the team scheduled to be on-call are responsible for monitoring their email 24/7; everyone else on the email list has to manually delete the alerts. This creates spam and decreases the sense of the urgency when alerts received.
4. Alerts Slip Through the Cracks
A more sophisticated option involves automating around the alert email address in your monitoring tool. For example, you could set up Google Calendar with the on-call schedule and use a script that polls the calendar. The script would take the email of the on-call staff and update the monitoring tool when there is a change. However, this solution only supports single-level on-call scheduling. It doesn’t allow for escalation scenarios where the first alert is missed by the primary engineer, and the need for the to the secondary on-call teammate to be notified.
5. No Central Source Of On-Call Schedules
Some monitoring tools support on-call scheduling natively via CSV uploads, but with limited flexibility. Often, your choices are limited to daily (as opposed to hourly) rotations or simplistic schedules. They don’t allow for more complex on-call scheduling such as follow-the-sun schedules. Many companies have multiple monitoring tools for their website, server, database, etc. Setting up and managing multiple monitoring tools just for on-call scheduling is a pain.
If you suffer from any of the issues above, you’re in need of a cure. It’s time you turn to an incident management remedy to alleviate your on-call scheduling ailments, and to preserve your mental health. Don’t be shy if you are feeling these discomforts. We have personally experienced these symptoms and that’s why we created the PagerDuty cure.