Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest features we've been working on — from event intelligence, machine learning, response automation, on-call, analytics, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 200 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Many IT organizations have come to learn that leveraging cloud infrastructure is not just unavoidable, it’s one of the most effective paths for IT organizations to become more responsive to business needs. Yet with the...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Just as doctors go on-call to support emergency patient needs around the clock, IT organizations task dedicated groups of engineers with going on-call to fix issues for software services as they arise. These engineers are put on an on-call rotation, a method of rotating scheduled shift work across everyone on the team that is responsible for maintaining software availability. During their shift, should something break, the on-call engineer will get paged (via a smartphone push notification, phone call, text, email, or possibly even a Blackberry or pager that gets passed around if it’s an older organization). The on-call engineer is responsible for immediately taking action on the page and must fix the issue quickly or escalate it if he or she can’t fix it. As they must be available to perform troubleshooting at any point during the duration of their shift, rotating on-call responsibilities among multiple individuals or teams is important for overcoming alert fatigue and protecting work-life balance.
The practice of having an on-call rotation is typically an organization’s first step towards committing to reliability for customers and users. On-call engineers are the first line of defense in ensuring customer-impacting outages are quickly noticed and resolved by someone on the team. That is why implementing setting up such a process is critical for having 24x7x365 coverage in managing issues as they arise. And by tying a timeout threshold to each tier of an escalation policy (i.e. the incident must be acknowledged or resolved within 30 minutes before it’s auto-escalated to the next line of defense), organizations can guarantee that when something breaks, someone will be on it fast. They can better meet their SLA’s, instead of collectively falling asleep at the wheel during a customer-impacting issue because the right information wasn’t quickly routed to the right person.
Some organizations manually use wiki pages or spreadsheets to manage on-call rotation schedules. However, changes often don’t propagate in real-time, and getting the right people on issues can quickly become challenging if contact information is outdated, or time zone math is incorrect, among other things. At the same time, organizations are also finding that every minute of downtime can cost thousands of dollars and irreversible damage to brand reputation. Fumbling through a wiki page or static spreadsheet to find and notify the right on-call engineer is quickly becoming a very costly method of managing on-call rotation information.
Here are a few steps that you can take in effectively creating and managing on-call rotations that meet the needs of your team:
On-call scheduling software can be a great investment for your team. It saves time and minimizes manual overhead by automatically routing notifications via engineers’ preferred contact methods based on predefined schedules. This removes several steps in getting the right information to the right expert when every minute counts.
Define the teams of individuals that have on-call responsibilities for every service. Be sure to set up both service and server-level monitoring and dashboards for teams to understand system performance and health. Whenever an issue arises, it should route to the on-call engineer on the appropriate team that manages that service. The on-call engineer should also be able to immediately recruit other teammates as needed to help collaborate on issue resolution with a collaboration tool, such as conferencing or chat.
Determine who should be in the respective lines of defense and what actions must take place when an incident occurs. For instance, the first tier of defense might be the software engineer who wrote the code, while the second tier consists of someone from the operations team who better understands the underlying network and hardware infrastructure — or vice versa.
If you have an availability SLA with your customers or end users, it is critical to define time limits. This way, if the first responder doesn’t take action within the timeframe, the issue automatically gets escalated and won’t be missed.
Make sure there’s an easy way for people to edit the schedule to accommodate shift swaps as needed should an unexpected event come up such as an appointment or PTO.
Lay out shifts to see if there are any gaps and ensure complete coverage that correctly takes time zones into account.
Everyone should be notified and kept in the loop of changes to the schedule, so no one is caught off guard or unknowingly has a weekend ruined because of a last minute change that wasn’t communicated.
To the point of transparency and communication, help people get ahead of knowing when they’ll be on on-call duty, and when they’ll be off, so they never miss a shift and can also plan activities accordingly. This can be easily done with an on-call timeline.
There are several benefits that make establishing an effective on-call rotation a highly worthwhile investment:
Collectively, all of this leads to shorter service disruptions, less loss of revenue and customers, and better brand reputation.
Traditionally, on-call rotation responsibilities have been delegated to sysadmins or operations engineers (including HelpDesk and the NOC). Development teams would primarily be responsible for designing, building, and shipping new services and functionality. They would then “throw code over the wall” to operations teams, who would debug, run, operate and maintain the code.
However, this siloed process created some significant challenges in accountability, cross-functional alignment, scalability, and reliability. Developers felt less ownership of impacting the customer experience, and when they didn’t have experience handling production workloads, they were more likely to deliver non-performant code that didn’t fully scale or had high operational load. Operations engineers would often take longer to fix broken code that was written by someone else and sometimes ended up having to escalate to the developer anyway.
As a result, while most operations in enterprises to date have largely been centralized, many organizations are beginning to distribute operational responsibilities to improve the performance of services and applications, instead of operating monolithic systems. Increasingly developers are going on-call for their own code, which closes the feedback loop by encouraging collaboration between development and operations to proactively build more resilient, production-ready services. New roles have also spun up, such as DevOps Engineer and Site Reliability Engineer. These roles often focus on faster and safer releases, improving reliability via automation, and improving the software lifecycle by building internal tools that automate the manual, human labor typically involved in operations (triaging, change management, monitoring, etc.). As more groups within an organization take on operational responsibilities, as opposed to the NOC triaging all issues and trying to route them to the right people, cross-functional teams typically can focus on higher-value customer experience metrics and collectively work together to improve them.
PagerDuty can support any kind of custom on-call rotation type, including on-call after-hours support, follow-the-sun, daily, weekly, or split shift rotations. We enable you to create multiple scheduling layers (a group of people who rotate on-call responsibilities through the same shift) within a single schedule. Below, we’ve highlighted some common configurations and on-call schedule templates from our Support Knowledge Base.
And much more! Contact email@example.com if you have any questions. We’re more than happy to help you with any custom schedule management needs and set up ideal on-call rotations for developers, NOC teams, support teams, security teams, and more.
PagerDuty streamlines on-call rotation management for any kind of rotation type or team. Our on-call scheduling capability includes simplified editing, SSO integration, automated escalations, and much more. Try it out now for yourself with a free 14-day trial.
We hope these resources enable you to formalize your on-call rotation process to make it as easy as possible for your team to respond to issues.