Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
What have you done to improve the lives of your co-workers who are on-call? I posed this question to my Twitter followers to see what folks are doing to make things as delightful as possible for the people keeping a watchful eye on our critical systems. The following is a small sampling of what I learned.
write notes of what i did on every incident, document workarounds, tune alert timing, tune log searches that trigger on call to exclude non-emergency conditions
— alimac (@alimacio) December 12, 2017
@alimac, Operations Engineer
When in the middle of an issue, the last thing you want to be doing is chasing down tribal knowledge. Having to reinvent the wheel on every outage wastes time and costs the organization money. Providing easy access to historical information allows for reproducibility of fixes/resolutions.
One method to help provide context to folks in the middle of a firefight is PagerDuty’s alert grouping capability, which automatically groups related alerts into a single rich incident to reduce noise while centralizing context. Similarly, if you use a tool like Slack to collect your details during the incident, PagerDuty’s Postmortems feature can ingest them into the report.
We talk about every alert from the last 24/weekend every day. No broken windows.
— Andy Fleener (@andyfleener) December 13, 2017
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
To prevent this, it’s essential to drive towards a culture of learning. According to Ron Westrum, in a generative, performance-oriented organization, “failure leads to inquiry.” Failures are opportunities to make things better—but only if we take the time and effort to learn from them.
From sales : set expectations.
— eric snyder (@esnyds) December 13, 2017
“Sales and Account Management are often in the middle of an escalation,” said Eric Snyder, Sr. Director of Channels at Auth0. “Relaying the sensitivity and providing context to the on-call team is as important as setting expectations with the customer. Managing customer expectations and communications gives the on-call team space and time to get the fix in place. A best practice is for Sales / Account Management to know their own teams just as well as they know their customer.”
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
The whole @cookieopsdotcom thing. It's died down but we baked a *lot* of cookies for ops folks
— Chris Corriere (@cacorriere) December 13, 2017
Everyone has a driver. Some people are recognition oriented. Some are money driven. But almost everyone likes cookies, or other treats.
“The best cookies are grandma’s cannoli but they’re a lot of work,” said Chris Corriere, Systems Engineer of Ecology Computing, which provides system modeling, mapping, and adaptation services. “But we have chocolate chip on lockdown and can churn them out.”
You don’t need to be a great baker (but I think everyone should try to make Pete Cheslock’s Chocolate Crinkle Cookies). The key is to show appreciation—and showing it by doing more than just sending a Slack message or gifting a $5 Starbucks card. It’s something that demonstrates you took time to reflect and understand that your co-workers took on a great responsibility to watch over the business that you both work for, during the wee nighttime hours. It’s the least you could do.
Different policies for daytime and nighttime alerts so we’re woken up less during the night.
— Emma Sax (@emma_sax4) December 12, 2017
Likewise, remember that there’s a person at the other end of that alert. Context matters for when and how you are alerted—during daytime hours, things may have different criticality. For example, the General Ledger application for your business may only be used by U.S.-based employees and only during the workday. If the system fails to respond during this time, that is a high-urgency alert, and you should be informed immediately! But if it fails at 1 a.m., is it worth getting the on-call engineer out of bed? Likely not.
If you use PagerDuty, you can configure the urgency on an individual service based on the hours (or other criteria like payload information). Given the example above, for the General Ledger service, we could set a higher urgency for working hours and a much lower one for outside of working hours.
What is one thing you have done to improve the lives of your on-call colleagues? Let us know in the comments or tweet @pagerduty with the hashtag #pageitforward!