Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
In The Hitchhiker’s Guide to the Galaxy, a group of scientist mice built a mega-computer named “Deep Thought” to Answer “The Ultimate Question of Life, the Universe, and Everything.” After 7.5 million years, the machine produced “42.”
At PagerDuty, we did something similar, except we didn’t have scientist mice or wait 7.5 million years. Instead, we had a data scientist and nine years of PagerDuty on-call notification data, which we compared it across 10,000 PagerDuty customers, 50,000 responders, and 760 million notifications—and our number was “71.”
What this means: Our Operations Health Management Service (OHMS) found that responders who maintained an average health score of 71 or higher were more likely to stay at their companies for more than 18 months.
Let me back up a little bit and explain “71.” Over the past year, my team (aka the Digital Insights team) created an algorithm to contextualize on-call pain.
The output of the algorithm was a number from 0 to 100. A health score of 100 means you’ve never received a notification within a specific time period (week, month, or year)—therefore, you might not be a responder, and we remove folks like you from our study calculations (and you’re perfectly healthy). In contrast, the closer to 0 you are, the more on-call pain you’re experiencing.
This health score is a product of 16 different facets. We took into consideration the following:
Two people with the same health score might have completely different contribution points, as seen in these screenshots:
The person on the left only received three sleep notifications, but also has a health score 64 compared to the person on right, who had seven. The algorithm not only take your day into consideration, but also looks at the volume of notifications days before as well. Looking at long-term on-call pain trends is the only way to accurately tell the story of a responder’s on-call health.
On-call pain manifests differently to different people, but in short:
On-call pain can lead to a number of problems, including persisting grouchiness, loss of productivity, responder burnout (leading to them ignoring pages/alerts), and abhorrent misuse of pop culture references.
So we want to avoid all those problems, right? After all, better work-life balance = happier workers = better productivity.
With our OHMS study, we were able to triangulate the responders who were beyond burnt out and who were most likely to leave. The replacement of an average on-call responder could add up to $300k. Here’s an overview of why it costs so much:
Some (those who are not on call) may argue that being on call is part of the job and ask: “What’s the big deal?” If you work with on-call responders, I invite you add yourself as a shadow on Escalation Policies for one week to understand their pain.
Because, yes, being woken up one night a week might not be that big of deal. But what about two nights in a row? Or three? On-call responders and new parents know that, despite how sleep deprived they might be, they’re still expected to show up to work on time the next morning, carry on with project delivery, be a sociable coworker, and still respond to incidents as they come in.
This is where the health score comes in: Putting a number to your on-call pain lets someone know when you need help and also informs your managers that someone else needs to take over an on-call shift.
This is also beneficial to your team as a whole because, as I explained earlier, employees experiencing excessive on-call pain (average health score of 71 or less) will leave and find another job—which further exacerbates the pain for everyone else staying behind.
Oh, that’s the wrong cultural reference. Oops. Anyway, now that you know on-call pain has real consequences (and hopefully you’re going to try being on call yourself), did you also know you can do something about it?
Check out the health scores of your team using PagerDuty’s Operations Health Management Service (OHMS). With OHMS, you’ll receive a weekly email that calls out the top 3 responders, top 3 teams, and top 3 services with a health score of below 71. You’ll also have access to consultants who work with you to maximize your PagerDuty investment by recommending best practices and helping implement the features that best fit the needs of your teams.
So what is the Answer to the Ultimate Question of Life, the Universe and Everything?
More stable systems and happier employees. That’s exactly 42 characters! WOW!