Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
A few weeks ago, I went to my first baseball game. The San Francisco Giants were playing the San Diego Padres at the AT&T park, and my relatives had an extra ticket for me. I met my relatives at the front entrance of the park, and when we entered I took the whole spectacle in: the big LED lights, the endless rows, the infinite hallway of food stalls. After gathering the necessary garlic fries and chicken tenders, we made our way to our seats.
The Giants make three runs in the first inning, and the whole stadium was electric with excitement. But the sixth inning, the lead had flipped, and the Padres were beating the Giants by three runs. Tensions were high in the ballpark, with the few fans clad in blue getting louder and louder, while those sporting orange grew silent.
The Ghirardelli man was making his rounds, and my cousin flagged him down to grab us some hot chocolate. He makes his way to our row, pours a cup and passes it down to me. My hands grab the cup — and then my phone goes off. I’m startled by the loud ring and vibration, and the cup of hot chocolate slips from my hands. My cousin sitting beside me catches it, though my jeans take some stains. The spectators behind us complain, telling me to silence my phone. My phone was on silent though. I had configured it to only make a sound if it was an alert from PagerDuty.
“Hold that for me, there’s something I have to do,” I tell her.
“You okay? What’s wrong?” My cousin asks.
“There’s been an incident, I need to go.”
I grab my headphones from my purse, stand up, push my way past the legs of the seated spectators of row three, and run up the stairs.
I roamed around trying to find a private place to take the call, but everywhere I went the speakers blared and the cheers reverberated throughout the stadium. At the end of the food hall, I spot the illuminated sign and book it into the bathroom. The acoustics only amplified the crowd’s jeering, but I was running out of time. I pick the farthest stall from the entrance, put the toilet seat down, plug my headphones in and join the call. I mute my microphone, I did not want the background noise to bother anybody. I was the third person to join the call, and I entered mid-conversation.
“We’re waiting for the on-call member from the EM team,” a voice says.
“All right, who is the EM on-call?” another asks.
“I’m not sure. We’ll just wait and-” the first voice is interrupted.
“Hello?” A third voice.
“Hi there,” someone replies.
“Hi, this the EM on-call.”
“Hello, what’s the situation, and what’s your status on resolving it?”
“I already resolved it, but let me get on the portal to make sure everything’s okay.”
“What!” I yell in disbelief. I cover my mouth, then realized (with relief) that they could not hear me. It had only been two minutes since the initial alert was sent, and the on-call engineer had already solved the incident before joining the call. In the next few minutes, the three voices started rattling off numbers and analyzing metrics. While I had no idea what any of it meant, I took it from the calm tone of their voices and lack of swearing that we were out of any sort of trouble.
“Yeah, it’s back to normal now.”
“Awesome. Do you have any reason to think this will happen again?”
“No, I don’t think that this will come up again, but I will keep an eye out.”
“All right then. Well, thanks for handling this.”
“No problem, thank you everybody for being here. Goodbye.”
“Goodbye, have a good weekend.”
The conference call ends, and I look at my phone screen. 8 minutes and 38 seconds. 8 minutes to resolve an incident, or to talk about it anyway. I sat there in the bathroom stall, dumbfounded. I come out of the stall to wash my hands, and noticed in the mirror that I had not attended to the dark hot chocolate stains on my jeans.
As I start trying to wipe the splotches away, I realize how grossly under prepared I was for what had happened. I was stressed and flustered, and I was only shadowing. One, I did not have my laptop with me. Two, my phone was on 15% charge. Three, I had one too many beers I doubt I could have solved any sort of technical problem, let alone explain what I was solving to someone else. If I were the on-call engineer, I would have struck out. I would have let my team down.
That evening, the Giants came back in the bottom of the ninth inning, and I realized that being on-call is somewhat like baseball. Specifically, being on-call is like being the batter when your team has two outs, has third base filled, and is down by one point at the bottom of the ninth inning. In that moment, the team’s success rides on you and you alone. In front of you, you have teammates on the bases, and their success is entirely dependent on yours. Behind you, you have the rest of the team in the dugout, waiting to see whether you fail or fly.
The batter swings and the ball is in play. That was when it clicked. With PagerDuty, the on-call engineer is no longer the lone batter, and is instead one of the players on the field. With PagerDuty, being on-call ceases to be an individual endeavor: it becomes a team sport. Instead of the on-call engineer having to sift through thousands of alerts to find the problem and solve it on his own, he had a team to support him, and a central line in which they could communicate, and a platform that filtered out all the unnecessary noise. When the ball is in play they assess the situation, they pass it around to who is best positioned to solve the problem, all with the common goal of resolving the issue before it shows up on the customer’s screen.
I do not have a technical background in engineering or computer science, nor am I a huge sports fan, so I find it humorously ironic that I was able to make sense of both these things by putting them together.