Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
If you’ve ever been on call, you know that the incidents don’t stop because you have the flu. Or when you’re attending your child’s high school graduation. Or, as I found out firsthand, even when you’re at your own wedding. Confucius once said, “If you have never had a major occasion happen while you are on call, then you may not have ever lived.” (Okay, I totally made that one up.)
All joking aside, life happens. A Schedule Override, or what we affectionately call “Overrides,” is a setting in PagerDuty schedules where you can ask someone else to take over either a portion of or your entire on-call shift. This is useful for those with planned vacations, unplanned illnesses, or other life events that happen during their on-call shifts because you can change the on-call responder without changing the whole on-call rotation or schedule.
Why else is this awesome? Because, as I also found out firsthand, instead of lugging your laptop to your dog’s first birthday party, you can ask your favorite teammate if he or she would be willing to take over being on call for a few hours while you celebrate the end of bi-monthly vet visits.
A lot of our customers have a DevOps culture or are transitioning into a DevOps structure. In DevOps culture, engineers are encouraged to code it, ship it, and own it—which means that if his/her team’s code breaks, s/he is the one responsible for fixing it. This culture encourages the team to do a number of things: write better code, write better tests, have more stable deploys, and pre-emptively have a rollback plan. If the team has to wake up in the middle of the night for an incident, it’s less likely to be code related. Since the engineers are now also the responders, we eliminate the classic “over the wall” dilemma.
We built PagerDuty with every intention of empowering each engineer/responder to manage their own on-call life, in addition to their code. In PagerDuty, each user determines what services they’re responsible for and what their on-call rotation looks like, including when to schedule overrides.
The Override feature is the most human feature in PagerDuty. As you may have learned from one of our previous blog posts about Operations Health, employees who bear the majority of the on-call burden get burnt out. These burnt-out employees don’t perform as well on the job, could make more mistakes, and ultimately cost the company in time and resources. Not only that, but they could quit due to sheer exhaustion or just outright frustration of having their lives constantly interrupted by work-related calls—meaning their company loses a skilled responder at the cost of up to $300,000 per person.
We work in an industry where we have tons of tools to measure health of servers, stability of applications, responsiveness of web pages; and even another tool on top of those tools to help notify you of unhealthy servers, unstable applications, and unresponsive web pages! We keep our customers happy and business productive at the expense of the health of our responders, who work all day and night to fix a bug or miss their third-grader’s first theater debut to fix a deployment issue. We often neglect the health of these real people who spend their weekends, evenings, and sometimes even sleeping hours, ensuring our digital systems are up and running.
This is where overrides can help. This year, during the PagerDuty University event at Summit, I talked to a gentleman who had his own ideas for scheduling overrides. Dan Wade from Vacasa shared that his team is scheduled on a 24/7 weekly rotation, where each responder is on call for 7 days at a time. He noticed that one of his teammates had a particularly rough on-call rotation—there were a few Severity 1 incidents that occurred while she was on call. Each Severity 1 took days before it was resolved. Knowing that she didn’t sleep for a few days, he took it upon himself to take over the remainder of her on-call shift so she could get some much-needed rest. In this situation, Dan’s teammate ended up being a happier, more productive employee because he showed empathy for her situation.
Dan was not only a hero for his team, but a role model that we all should learn from. As a modern-day tech worker, being on call is not isolated to the Ops guys/gals anymore, but for anyone working with a digital signal. Digital signals are indiscriminate of time of day, special occasions, life events, or fatigue. It falls into your hands, as a coworker, to step up and share some of your available resource, whether it be time, energy, or love.
Remember: The next time you’re on call, do you want it to be the “Boulevard of Broken Dreams” or “Wake Me Up When September Ends”?