On-Call Rotations and Schedules

On-Call Rotations and Schedules

Just as doctors go on-call to support emergency patient needs around the clock, IT organizations task dedicated groups of engineers with going on-call to fix issues for software services as they arise. These engineers are put on an on-call rotation, a method of rotating scheduled shift work across everyone on the team that is responsible for maintaining software availability.

During their shift, should something break, the on-call engineer will get paged (via a smartphone push notification, phone call, text, email, or possibly even a Blackberry or pager that gets passed around if it’s an older organization). The on-call engineer is responsible for immediately taking action on the page and must fix the issue quickly or escalate it if he or she can’t fix it. As they must be available to perform troubleshooting at any point during the duration of their shift, rotating on-call responsibilities among multiple individuals or teams is important for overcoming alert fatigue and protecting work-life balance.

The practice of having an on-call rotation is typically an organization’s first step towards committing to reliability for customers and users. On-call engineers are the first line of defense in ensuring customer-impacting outages are quickly noticed and resolved by someone on the team. That is why implementing setting up such a process is critical for having 24x7x365 coverage in managing issues as they arise. And by tying a timeout threshold to each tier of an escalation policy (i.e. the incident must be acknowledged or resolved within 30 minutes before it’s auto-escalated to the next line of defense), organizations can guarantee that when something breaks, someone will be on it fast. They can better meet their SLA’s, instead of collectively falling asleep at the wheel during a customer-impacting issue because the right information wasn’t quickly routed to the right person.

Creating an effective on-call schedule

Some organizations manually use wiki pages or spreadsheets to manage on-call rotation schedules. However, changes often don’t propagate in real-time, and getting the right people on issues can quickly become challenging if contact information is outdated, or time zone math is incorrect, among other things. At the same time, organizations are also finding that every minute of downtime can cost thousands of dollars and irreversible damage to brand reputation. Fumbling through a wiki page or static spreadsheet to find and notify the right on-call engineer is quickly becoming a very costly method of managing on-call rotation information.

 Example of a On Call Rotation Schedule using PagerDuty

On-call rotation best practices to keep in mind

Here are a few steps that you can take in effectively creating and managing on-call rotations that meet the needs of your team:

Consider software for automation

On-call scheduling software can be a great investment for your team. It saves time and minimizes manual overhead by automatically routing notifications via engineers’ preferred contact methods based on predefined schedules. This removes several steps in getting the right information to the right expert when every minute counts.

Set up teams

Define the teams of individuals that have on-call responsibilities for every service. Be sure to set up both service and server-level monitoring and dashboards for teams to understand system performance and health. Whenever an issue arises, it should route to the on-call engineer on the appropriate team that manages that service. The on-call engineer should also be able to immediately recruit other teammates as needed to help collaborate on issue resolution with a collaboration tool, such as conferencing or chat.

Define escalation policies

Determine who should be in the respective lines of defense and what actions must take place when an incident occurs. For instance, the first tier of defense might be the software engineer who wrote the code, while the second tier consists of someone from the operations team who better understands the underlying network and hardware infrastructure — or vice versa.

Establish time limits

If you have an availability SLA with your customers or end users, it is critical to define time limits. This way, if the first responder doesn’t take action within the timeframe, the issue automatically gets escalated and won’t be missed.

Enable easy overrides

Make sure there’s an easy way for people to edit the schedule to accommodate shift swaps as needed should an unexpected event come up such as an appointment or PTO.

24×7 coverage

Lay out shifts to see if there are any gaps and ensure complete coverage that correctly takes time zones into account.

Transparency and communication

Everyone should be notified and kept in the loop of changes to the schedule, so no one is caught off guard or unknowingly has a weekend ruined because of a last minute change that wasn’t communicated.

Be aware of on-call hours

To the point of transparency and communication, help people get ahead of knowing when they’ll be on on-call duty, and when they’ll be off, so they never miss a shift and can also plan activities accordingly. This can be easily done with an on-call shifts.

Benefits of an effective on-call rotation

There are several benefits that make establishing an effective on-call rotation a highly worthwhile investment:

  • Improved team transparency and accountability in handling issues
  • Better service reliability by quickly acting on and resolving alerts
  • Happier customers, who can contact on-call staff for urgent issues at any time or be assured in knowing issues always will be quickly fixed
  • Less wasted time in getting on-call staff on issues

Collectively, all of this leads to shorter service disruptions, less loss of revenue and customers, and better brand reputation.

Who goes on-call?

Traditionally, on-call rotation responsibilities have been delegated to sysadmins or operations engineers (including HelpDesk and the NOC). Development teams would primarily be responsible for designing, building, and shipping new services and functionality. They would then “throw code over the wall” to operations teams, who would debug, run, operate and maintain the code.

However, this siloed process created some significant challenges in accountability, cross-functional alignment, scalability, and reliability. Developers felt less ownership of impacting the customer experience, and when they didn’t have experience handling production workloads, they were more likely to deliver non-performant code that didn’t fully scale or had high operational load. Operations engineers would often take longer to fix broken code that was written by someone else and sometimes ended up having to escalate to the developer anyway.

As a result, while most operations in enterprises to date have largely been centralized, many organizations are beginning to distribute operational responsibilities to improve the performance of services and applications, instead of operating monolithic systems. Increasingly developers are going on-call for their own code, which closes the feedback loop by encouraging collaboration between development and operations to proactively build more resilient, production-ready services. New roles have also spun up, such as DevOps Engineer and Site Reliability Engineer. These roles often focus on faster and safer releases, improving reliability via automation, and improving the software lifecycle by building internal tools that automate the manual, human labor typically involved in operations (triaging, change management, monitoring, etc.). As more groups within an organization take on operational responsibilities, as opposed to the NOC triaging all issues and trying to route them to the right people, cross-functional teams typically can focus on higher-value customer experience metrics and collectively work together to improve them.

What on-call rotation schedules does PagerDuty support?

PagerDuty can support any kind of custom on-call rotation type, including on-call after-hours support, follow-the-sun, daily, weekly, round robin, or split shift rotations. We enable you to create multiple scheduling layers (a group of people who rotate on-call responsibilities through the same shift) within a single schedule. Below, we’ve highlighted some common configurations and on-call schedule templates from our Support Knowledge Base.

  • Getting Started – Learn the basics of how to create an on-call schedule, including how to add users, define rotation frequencies and time-of-day restrictions, and more.
  • Complex irregular schedules – This schedule is set up for teams that rotate shifts that are on for one week, and then off for a few.
  • Complex schedule for 2 users on a 2-day rotation with separate weekends – This example shows a complex schedule for two users that are on a two-day rotation. However, on Saturday and Sunday, the on-call user is on call for 24 hours.
  • Complex schedule with restrictions – PagerDuty enables you to build complex schedules where users trade off the early morning, morning, evening, weekend, and other shifts for varying numbers of hours respectively. Click the link for an example.
  • Complex split shift rotation – This example shows you how to create a rotation with time restricted where each shift is split by multiple users.
  • Creating primary and secondary on-call schedules – Creating primary and secondary on-call schedules creates multiple lines of defense if the primary on-call engineer misses a notification. You can add multiple schedules as progressive levels of an escalation policy to ensure a backup user will respond to an incident.
  • Follow-the-sun schedule – The follow-the-sun schedule is used by teams that may work internationally in different time zones, and ensures full 24/7 coverage.
  • Inverse schedules on an escalation policy – If you have two or more users that rotate primary and secondary on-call shifts, then you will want to create two on-call schedules and add each of those schedules to a separate level of an escalation policy.
  • Schedule users on-call every other week – You can create multiple layers within your schedule to accommodate multiple users that hand off every other week (for example, 2 on-call engineers who cover weekdays and 2 who cover weekends, who rotate weekly).
  • Expert that is always on-call – You can create an additional layer to always route certain types of issues to specific experts (for instance, a DBA, Network Architect, etc.)

Contact support@pagerduty.com if you have any questions. We’re more than happy to help you with any custom schedule management needs and set up ideal on-call rotations for developers, NOC teams, support teams, security teams, and more.

How to get the most out of on-call scheduling

PagerDuty streamlines on-call management for any kind of rotation type or team. Our on-call scheduling capability includes simplified editing, SSO integration, automated escalations, and much more. Try it out now for yourself with a free 14-day trial.

We hope these resources enable you to formalize your on-call rotation process to make it as easy as possible for your team to respond to issues.