Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
Becoming a new parent has been one of the most difficult challenges I’ve ever faced. That wasn’t a huge surprise; I’d been fully expecting it. But what I did find surprising was that my experiences with PagerDuty helped prepare me in ways I never anticipated. The following highlights some of my adventures in fatherhood so far, and how years of working as an on-call engineer has given me some much needed assistance during these first few months.
Rich with his daughter, the new BabyDuty “service”.
It’s worth noting that this is a little tongue-in-cheek, and that I’m not really an emotionless robot who sees my daughter as a PagerDuty service… Not yet anyway.
When my wife and I found out that we were pregnant (let’s be honest though, one of us was more pregnant than the other), one of my concerns was that when labor started, I would be somehow stuck down a mine shaft without a cell signal and would have no way of knowing. This feeling was exacerbated by the fact we’d just moved to a new city and my office was no longer a 10 minute drive from home.
I wanted to be absolutely sure that I didn’t miss a message from my wife telling me labor had started, so PagerDuty was an obvious solution. All I had to do was create a new PagerDuty service, give my wife the contact details, and set the escalation policy to page me via every possible notification method — every minute, on a loop — until I acknowledge. Unsurprisingly, this works really well!
As with all good alert management mechanisms, something I’ve learned the hard way is that you need to make sure to test that it works too. I would occasionally ask my wife to test it, putting my mind at ease that everything was still set up correctly. Knowing that we had this set up helped to keep me calm and prepared in the weeks leading up to the big day. This kind of preparation would only help in the days and weeks to come.
It turns out the PagerDuty alert we set up for labor has actually come in useful after my daughter was born too.
Picture the scene. It’s the middle of the night of our second week home, the perfect time to get paged. My daughter has woken up and it’s my turn to change and feed her. Unfortunately, I’m a heavy sleeper, and a baby’s cries don’t really wake me, nor does the nudging of my wife. One thing definitely does wake me though: my PagerDuty notification sound! It’s a custom notification sound I’ve used for the last 15 years of being on-call, and it is a unique ringtone that no one else has. Merely hearing that sound sends me from deep sleep to adrenaline fueled awakeness in a matter of microseconds.
Whenever my wife needs my help, whether it’s day or night, she knows that she can page me and I’ll always respond in seconds. We’ve used this in all sorts of situations, such as when my wife was trapped with a sleeping baby in her arms and needed me to bring her food and water while I was working in the yard. She could page me with a single touch of her phone rather than trying to type a message or call to me and potentially wake our sleeping baby.
It’s also been useful as a second alarm clock. If I’m so exhausted that I slept through my normal alarm, my wife who’s already up with the baby sometimes pages me letting me know I’ll be late for work otherwise.
I’m still working on putting some of my hard-earned automation skills to good use by wiring up a baby monitor to trigger PagerDuty alerts, thus skipping the need for my wife to manually page me. Keep an eye out for the “Baby Pager 5000”, coming soon.
One of my roles at PagerDuty is that of an Incident Commander. I’ve been heavily involved in PagerDuty’s incident response process since I joined the company, and earlier this year I open-sourced our incident response documentation to share some of the things we’ve learned. I also started, and continue to run, our incident commander training sessions.
Having this kind of incident response experience has been invaluable in my life outside of PagerDuty in ways I never would have expected. I used to get easily frustrated if I couldn’t solve problems quickly, but I’ve started to notice that I’m now a lot calmer when faced with these types of issues, and work the problem instead. I find myself not as worried when new and unique problems pop up; I’m now able to approach the problem methodically rather than just panicking. I trace this all back to my incident response experience.
During labor there was a lot going on. My wife was hooked up to all sort of machines that go beep and bing; there were tubes, needles, and all sorts of things attached with sticky tape. A variety of different people kept coming and going from the room, and each giving different (and sometimes contradictory) instructions; “Count up to 10”, “Count down from 10”, “Push”, “Keep going”, “Breathe”, “Slower”, “Faster”. My wife also had worryingly high blood pressure during the entire process and required additional beeping machines and medications to be administered. To say it was a high-stress environment would be putting it mildly. But that made it all the more important to try and remain as calm as possible.
At some point, I ended up being the only one talking to my wife, repeating the important instructions with a calm cadence I never knew was possible in these moments. As incident commanders we’re taught to remain calm, provide clear and concise instructions, and to keep things progressing towards a resolution while ensuring others don’t get burnt out. All of this training came into play here, and the results were a huge success.
My wife made a point to mention how calm I was throughout the whole experience and how it really helped her to stay calm and focused. I am absolutely sure that without my incident response training, I would not have had such a calm exterior. I can also reliably assure you that it was only a calm exterior. Anyone who says they weren’t at least mildly panicking on the inside during the birth of their child has to be lying, or a robot.
Incident response training has been applicable in many different ways since. Staying calm and focused was practical when driving our tiny baby home for the first time. Quickly sizing up a situation and prioritizing actions was significant when we ended up having to take our daughter back to the hospital less than a day after taking her home. It came up again when driving her home for the second time. Something I also learned is that babies can sense your moods, so if you get frustrated when trying to put them to sleep, they will get frustrated too. Staying calm in these moments is absolutely essential.
One final tip though: I wouldn’t recommend directly applying every aspect of incident response training to childbirth. For example, labor and delivery is not the right time to ask if there are any strong objections.
Scribing during incident response is the process of keeping an accurate timeline of events and making notes of critical information and decisions. This ensures that everyone is on the same page and can review the key events after the fact.
In the first few weeks after your child is born, you have to keep track of all of their er… inputs and outputs. You need to keep a thorough log so that the doctors can see if anything is wrong. If they aren’t outputting often enough, it can be a sign of bigger issues. The problem is that you’re both exhausted. My wife had a real excuse; she’d just given birth. I had less of an excuse.
You don’t want to miss things because you’re tired. It’s very tempting during a feeding or diaper change to just say “Oh, I’ll log this later.” Just as it’s tempting to say “I’ll log this bug later” in software development. Don’t fall into the trap! We both made sure that we logged everything as it happened, without delay.
The result is a meticulous log of our daughter’s eating and expelling patterns, which has been invaluable. It’s incredibly useful to be able to share routines with caregivers, doctors, and each other. It also makes it very easy to fill in the countless wellness questionnaires your doctors will send.
Tracking baby metrics, such as time feeding and sleeping, is invaluable for staying on top of things. This image shows our daughters sleeps (yellow) and feeds (blue) since she was born. Each vertical line represents a day going top to bottom.
We got so good at keeping the log that we’ve continued to do it to this day. It’s fantastic to be able to see the data change as our daughter’s sleeping patterns became more organized, and it provided a good motivational boost in those early weeks. As you can see, things really got better as time went on. As any operations team will tell you, the key is to track the metrics.
Monitoring metrics is only important if you can raise actionable alerts on those metrics when they don’t do what you expect.
While we track how many times a day our daughter has done her business, what happens if she hasn’t in awhile? Would we even notice? What’s great about tracking the metrics is that you can also set alerts for them. Which is exactly what we did!
If we haven’t logged a feed in over 5 hours, we have set a rule to page us both. If our baby hasn’t left us a gift in over 24 hours, same thing. Automating these alerts has taken a load off our minds, and helped to reduce our stress levels. It’s also been useful when I’m taking care of her to know when the last time she fed was, since I might not have been home at the time of her last feeding. It’s essentially automated an on-call shift handover.
Sometimes you freak out thinking that you’ve forgotten something important and you’re a horrible parent, but by putting in the time to pre-define some alerts, it’s taken away a huge source of stress, as we now know we’ve got an additional safety net. (Jury’s still out on if I’m a horrible parent. Ask again in a few years).
Not to mention all this will give me some great lines when my daughter is old enough: “You’ve been a pain in my ass since you were born and I have the metrics and graphs to prove it.”
“Daddy, there’s no way I pooped that many times on Tuesday…”
One thing that I’ve always had when on-call is a backup. A safety net. Knowing that if I can’t figure out a problem, I can call in someone to work it with me. When it comes to personal life, my wife is my backup, and I’m hers. We rely on each other to provide assistance and help whenever we need it.
But the problem now is that we’re both worn out most of the time. Four months in, our daughter went through a sleep regression. We were both physically and emotionally exhausted. It was my turn to put her back to sleep while my wife tried to sleep. Our daughter was crying and kicking because she was over-tired or just didn’t want to sleep.
I couldn’t escalate to my wife, since she had to get some much needed rest. There was no one else I could hand over to. I couldn’t just give up and call it a day. What can you do when escalation isn’t an option?
What it comes down to is the same for our Incident Commander process: keep calm, identify the symptoms, work the problem. If I’m on an incident call and am presented a problem without any backup, what do I do? Do I just give up and call it a day? No! I work the problem and figure it out. It might take longer, but the problem will get solved eventually.
In this case, I systematically worked through the things that could be wrong. Checked her diaper, tried to burp her, bounced her, walked around the room, shushed her with calming sounds, etc. Methodically work through each thing until you find something that works.
Everyone told me before my daughter was born that the quicker you can get into a pattern, the better it’ll be. Easy for people to say, a lot harder to actually do it when you have a screaming two-week old in your arms. Any system you come up with gets thrown out of the window pretty quickly.
In the first few weeks, my wife and I worked in shifts so the other could eat and rest. This worked really well for a bit, until I had to return to work — completely throwing off our pattern.
But we came up with a new schedule and eventually got into a new pattern. This has become our on-call rotation, and has worked very well. Any time during our rotation, any incident (crying, etc.) that occurs is handled by the current on-call, while the other is free to do what they want.
This also comes in useful for other things such as conferences, friends visiting, dinners, etc. as we can just swap our on-call shift for another time. We follow the same patterns as we would if it were a real on-call shift.
Much to my disappointment, you are not provided with a simple runbook or flow chart for how to take care of a new baby (start-up idea!). The standard practice in cloud computing of resources being disposable doesn’t really apply here. You cannot reboot a baby. If a server is misbehaving, you can just terminate it and get a new one provisioned. You definitely can’t do that with a baby!
There are actually a vast number of resources available about how to take care of babies. Unfortunately, most of the information is conflicting and doesn’t really have much data or science backing it up. It’s like having a collection of out-of-date documentation. You don’t know which bits will work and which won’t.
But having a runbook is vital when dealing with the same situation over and over again. You can’t automate putting a newborn to sleep (another start-up idea!), but you can document a routine that works for you and make sure everyone who puts your baby to sleep follows the same procedure.
We tried out various routines for nap and bedtime, working through trial and error to find out a solution that got her to sleep effectively. Once we had a pattern that worked, we documented it with clear steps, providing instructions on what to do at each step if something didn’t work: “If she’s still fussy after 2 minutes, go to section 3”, etc. This has become our runbook, and has worked amazingly well.
It ensures that everyone who might be putting our daughter down for a nap or bedtime follows the same pattern, every time. It keeps things consistent for our daughter so that she knows what’s coming up, and means we get her to sleep with the minimal amount of fuss.
Documenting these types of routines has become useful for other things too. Checking weight and growth milestones follows a similar pattern. What to feed and when to feed, etc.
When you don’t have a runbook, you make your own based on past experience and build it up over time. Turns out it’s not much different with a baby. Just keep good documentation and learn from experience.
And just like with code, there’s no such thing as self-documenting babies.
Going on-call has always involved preparation. Specifically with making sure things are charged. A day before I go on-call, my routine is to make sure my laptop and mi-fi are both charging. I also make sure my phone backup batteries are fully charged, and I have one in each of my backpacks so I won’t get caught without them.
Whereas before with on-call I would have to keep my laptop with me at all times when out and about, it’s now a diaper bag. But the steps are remarkably similar. Before an outing, make sure supplies are properly stocked, and that you have extras for redundancy. Make sure I have multiples in various bags that I might be using. And finally, make sure I have my phone backup batteries fully charged. Because some things never change.
Just like a major incident call, looking after a baby is a serious responsibility. It’s no time for jokes. You need to remain professional the entire time, never breaking from your seriousness. Except, that’s a complete lie.
A few weeks after our daughter was born, we were having trouble putting her to sleep. She was growing increasingly fussy, and nothing we tried seemed to work. My wife and I had swapped between ourselves about three times as we each got burnt out, and we were both close to breaking down. I cradled my daughter in my arms, and held her close to try and rock her to sleep. Then it happened. She obviously thought it was feeding time, but wasn’t aware that her father isn’t able to feed her in the way she wanted. She wasn’t going to let that stop her from trying though. My wife and I broke out in laughter, and in turn so did our daughter. The seriousness of the situation was broken and we just couldn’t stop laughing. After hours of having a fussy baby, a lighter mood was just what we needed to keep our sanity and turn a stressful environment into a relaxed one.
During times of high stress, taking a moment to laugh can be one of the best things for your mindset. Newcomers to PagerDuty are often surprised when joining an incident call to hear a joke every now and then, and people laughing. “Why are people laughing if there’s a critical incident in progress! Surely they should be solving the incident instead?” While moments should be chosen wisely, and there are certainly times when it would be inappropriate, I’ve found that injecting a joke now and then to lighten the mood actually improves the incident response process. Sometimes people just need a chance to relax a little, even in serious times.
Despite all the things I’ve learned at PagerDuty and from previous engineering/on-call jobs that have helped with parenting, the one thing everyone always said was that it’ll be just like being on-call. It isn’t.
In the weeks and months leading up to the birth of my daughter, pretty much everyone constantly wanted to remind me that I’m “obviously super prepared to be woken up in the middle of the night thanks to years of being on-call.” I’d just fake a smile and laugh. Turns out it doesn’t prepare you. Not even close.
I’ve spent years being woken up by pages at 3am, glaring at a computer screen trying to fix an issue. Things are different when it’s a baby crying and screaming in your face. I’ve been on-call in the airline industry, where minutes matter, and one mistake in a weight and balance calculation can cause a plane to crash. That wasn’t as stressful as the first few weeks as a new parent.
But like on-call engineers, there’s one thing I can always count on: the compassion of others in the same position. As on-call engineers, we regale each other with stories of our worst and best times being on-call, with the problem we solved late at night in a moment of clarity. We have each other’s backs if we got hit with a particularly hard problem.
The support of other parents is something I’ve been overwhelmed with. The stories are different, the problems are similar, but the community element is something I would never take for granted.
Whether you’re a parent or not, there’s lessons to be learned and applied from the things you do. Incident response training is useful in many situations — whether it’s a production server exploding, a fender bender on the highway, or a midnight explosive diaper change. The lessons and tactics we use as engineers can be applied to situations we’ve probably never even considered.
“Daddy, I can’t even walk yet. Why are you using me for commercial branding? I better get royalties for this!”