The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call
Within DevOps, we talk a lot about the on-call process—but what about the human side of being on-call? For example, what are effective ways of managing stress and anxiety during a shift? How can one manage life situations that make being on-call difficult—such as being responsible for watching the kids during an on-call rotation? And how can an empathic team culture help prevent burnout and turnover?
In November and December 2021, on-call engineers from nine teams at PagerDuty met to have a discussion on the human side of being on-call. Here are the five key takeaways from those sessions:
- Team empathy is critical
- Don’t watch graphs all day
- Postmortems can be stressful and require a lot of work
- Low-urgency alerts reduce overnight noise
- Week-long on-call can lead to burnout
Before diving into each key takeaway, let’s look at some metrics tied to the teams we talked with.
By the numbers
Here are key data points for the teams that joined the “human side of on-call” sessions:
- What’s your on-call rotation size? The average on-call rotation size was 5 engineers.
- Do you have a secondary on-call? 60% of teams said “yes.”
- How often are you on-call? The average on-call frequency was every 3.5 weeks.
- How long is your on-call shift? The average shift length was one week—several teams split this by weekdays/weekends.
- How much time do you spend on-call per week (median)? The median time spent on-call per week was 4 hours. For two of the nine teams, their on-call engineer spent most of their business hours tackling on-call issues.
We plotted the hours spent on-call in this histogram. As you can see, 55% of the teams surveyed spent 0-5 hours per week on-call, 22% of teams spent 5-10 hours on-call, 11% spent 30-35 hours on-call, and 11% spent 40 hours on-call:
Histogram: Hours spent on-call by team
Now that we’ve shared some of the details around our focus group, let’s dive into each lesson in more detail.
Lesson 1: Team empathy is critical
Team culture is everything: it sets the foundation for creating a safe space. Putting norms in place that reinforce (with words and actions) that it’s okay to ask for overrides during your on-call week is a crucial part of setting the tone for your team’s on-call experience. Cultural change isn’t something that happens overnight, but it can be developed and molded over time. While this cultural shift is taking place on your team, it’s important to actively encourage it as part of the team culture in whatever way makes the most sense for the team. For example, after requesting an override, you can thank your colleague during a team retrospective to drive positive reinforcement. If your team has their norms documented, you can also suggest that “it’s okay to ask for overrides” is added there. Additionally, as a peer or a manager, it’s important to check in on how the on-call engineer is doing, especially after major incidents. This is especially true if it’s a person’s first major incident.
Perhaps most importantly, there needs to be empathy from the team and manager towards each on-call engineer’s unique life situation. For example, having pets or kids or elderly parents can make managing on-call trickier. Additionally, being in a stressful life event, such as the death of a loved one, can compound the stress that is felt on-call. In these situations, it’s important to be proactive about suggesting that maybe an engineer shouldn’t be on-call for a particularly rough period of time.
Lesson 2: Don’t watch graphs all day
It’s important to remember that being on-call doesn’t mean that it’s your job to watch everything all day. There needs to be trust in the system that you will get paged if something goes wrong. You need to let go of what you can’t control, and be vigilant over what you can control. Rely on a team ops review meeting to do a hand-off between rotations so you are prepared for your shift. And remember that low urgency incidents don’t need push notifications—you don’t need to increase your stress levels over those.
When time is permitted during your on-call rotation, focus your efforts on improving the on-call situation for the next on-call engineer. For example, if there’s a particular issue that keeps happening (e.g. disks that full up, logs that need rotated, noisy alerts), tackle a task that fixes it for the long term.
Lesson 3: Postmortems can be stressful and require a lot of work
Major incidents—which require a coordinated response between multiple teams—can be very stressful, and the additional workload of postmortems can cause even more stress. It’s one thing to handle the incident itself, but quite another when you have another week of stress after it. If resourcing allows, it can be helpful to create a working agreement for having the postmortem be completed by someone other than the primary responder on the incident. Additionally, providing recognition of the stress involved and allowing for decompression time after the incident is resolved can help. This might mean giving the on-call engineer a “cool down” period where they have more flexibility over their work schedule and ability to catch up in other areas of their life.
Lesson 4: Low-urgency alerts reduce overnight noise
When there is no immediate danger, an alert can be configured as low-urgency to ensure the on-call engineer doesn’t get paged while sleeping. To make this work effectively, the team needs to pair low-urgency configuration of alerts with onboarding of on-call engineers, so that their alert settings ensure they aren’t disrupted by low-urgency alerts. Effective on-call engineer onboarding should cover how to set up user notification settings, serving as a checkpoint to make sure a new hire’s settings are correct before they are put on rotation in PagerDuty.
Lesson 5: Week-long on-call can lead to burnout
Going on-call for an entire week can be a mental grind, since you are not fully off work for the entire week. This is true even if you aren’t paged during your shift, as you are still anticipating getting paged. Finding the on-call rotation length sweet spot is tricky—it’s dependent on multiple factors, including:
- The preferences of the on-call engineers on the team. This can be gauged through a survey sent to the team to collect their thoughts around on-call scheduling.
- How on-call engineers feel after they finish their shift. This can be tracked over time using an end of shift on-call “Yelp review rating” of 1 (worst) to 5 (best).
- How noisy the team’s services are. More noise means more stress, in which case, a shorter on-call rotation would be preferable.
Instead of being on-call for a week, other options to consider include weekday/weekend rotations, business hours/after hours rotations, or shorter shifts of 2 days, 2 days, 3 days each week.
Best practices for on-call teams
Being on-call can be stressful, but having an empathic team culture and on-call configuration that works best for the team’s preferences goes a long way in reducing burnout. Interested in learning more about on-call best practices, and how to develop an empathic team culture? Check out our Best Practices for On Call Teams guide.
- Thanks so much to Amy Wood, Ashwin Jiwane, Charlotte Sarfati, Chelsea Vandermeer, Hunter Watson, Japa Swadia, Katherine ChengLi, KP Singh, Liam Stewart, Marcos Wright-Kuhns, Mandi Walls, Possum Nuada, Quintessence Anx, Roma Shah, Russ Smith, Todd Whitney, Tom Graft, and Vivian Chan for your contributions to these discussions and blog post.