New Ops Guide: Best Practices for On-Call Teams
The always-on, always-available expectations of digital services have increased the requirements of technical teams to be ready to provide a response around the clock. For teams new to this concept, introducing on-call can be stressful and complex. As part of PagerDuty’s main platform, on-call management is key to our business, but the non-technical aspects are also important for teams to consider.
We’ve collected a number of PagerDuty’s numerous resources into a cohesive guide to help you navigate the murky waters of on-call with teams that haven’t been on call before. Hopefully you’re familiar with our Ops Guides, but if not, this one is a great place to start!
Establish Why You Need an On-Call Program
If your team is totally new to being in a prescriptive on-call rotation, it’s important to establish why the team is going to take on the on-call responsibilities. There might be any number of reasons why it makes sense for your organization to include more teams for on-call.
If your team has been relying on a network operations center (NOC) or other external first-level responders, a high number of escalations and/or unresolved alerts might be a metric driving you to implement a more robust on-call plan. If your NOC is unable to resolve an alert and then has to escalate to a team without a prescriptive on-call rotation, the delays incurred by that handoff can drive up recovery times. Remember, every handoff that happens in the resolution of an incident costs valuable time. And every new responder added to the incident needs time to gather information and context about the incident.
Delays and confusion also come into play if application development teams have been relying on separate operations teams for their production environments. Similarly to the NOC example—for errors and incidents related to application code—operations team responders end up needing to spend time finding someone on the development team to help resolve issues.
The separation of duties for incidents can also lead to a delay in when issues are permanently fixed in the application code. No one wants to respond to the same error over and over again because it hasn’t been permanently remedied in the application. Adding a card to the backlog to fix an error isn’t actually fixing the issue; the fix has to be prioritized and worked on. If the ROI for making a fix isn’t worth it, then documentation for the next folks on how to handle it is the next best option.
So you may find yourself in an organization that expects application developers to take a more visible role in on-call duties for their applications to reduce the time it takes to resolve an issue, and to reduce the time it takes to create a permanent solution.
One of the biggest challenges for teams taking on a new on-call responsibility is the reputation that on-call is disruptive to responder’s lives in a very detrimental way. No one wants to miss family events, holidays, and sleep.
Creating a better on-call experience for your team requires good technical and cultural practices. Your team will want to clean up noisy alerts, whether that means fixing the issues permanently, creating automation to handle common issues, or de-prioritizing alerts that have minimal user impact.
When an alert can potentially wake someone up at 2 a.m., it should be worth it!
You should also establish guidelines for your team to swap shifts and cover for each other when things come up. Life happens, and you can’t always plan for absolutely everything, so make it easy for your team to move shifts around when they need to.
Use Your Tools
The PagerDuty platform has a number of useful tools for you to use to make sure your team is ready to go on call. One of these tools are the On-Call Readiness Reports.
These reports will help your team stay connected to PagerDuty in the ways you want them to be connected.
The options on the Readiness Report will depend on what requirements you have set for your team and will show you which team members have configured their accounts appropriately. You can decide with your team which notification methods will work best for the services you will be supporting. For low priority responsibilities, you might choose “must include phone.” For teams managing key customer-facing services, you might want something more like “never miss a page,” encouraging your team to set up their accounts with email, phone, SMS, and push notifications from the PagerDuty mobile app.
Share Your Thoughts
We hope you’ll give the new ops guide a read! Then join us in the community forums to let us know what you think and if we missed something. Also, if you have any “must have” items for our On-Call Checklist, tell us in this thread. We’ll collect up the responses and add a downloadable checklist to the guide.