What have you done to improve the lives of your co-workers who are on-call? I posed this question to my Twitter followers to see what folks are doing to make things as delightful as possible for the people keeping a watchful eye on our critical systems. The following is a small sampling of what I learned.
@alimac, Operations Engineer
When in the middle of an issue, the last thing you want to be doing is chasing down tribal knowledge. Having to reinvent the wheel on every outage wastes time and costs the organization money. Providing easy access to historical information allows for reproducibility of fixes/resolutions.
One method to help provide context to folks in the middle of a firefight is PagerDuty’s alert grouping capability, which automatically groups related alerts into a single rich incident to reduce noise while centralizing context. Similarly, if you use a tool like Slack to collect your details during the incident, PagerDuty’s Postmortems feature can ingest them into the report.
Review All The Things
Andy Fleener, Platform Operations Manager, SportsEngine
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the Normalization of Deviance effect. In this case, we start to accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.
To prevent this, it’s essential to drive towards a culture of learning. According to Ron Westrum, in a generative, performance-oriented organization, “failure leads to inquiry.” Failures are opportunities to make things better—but only if we take the time and effort to learn from them.
“Sales and Account Management are often in the middle of an escalation,” said Eric Snyder, Sr. Director of Channels at Auth0. “Relaying the sensitivity and providing context to the on-call team is as important as setting expectations with the customer. Managing customer expectations and communications gives the on-call team space and time to get the fix in place. A best practice is for Sales / Account Management to know their own teams just as well as they know their customer.”
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need five? Have you tied your metrics to a business outcome?
Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that if your page load time increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will increase by 50 percent. Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.
People Are People
Everyone has a driver. Some people are recognition oriented. Some are money driven. But almost everyone likes cookies, or other treats.
“The best cookies are grandma’s cannoli but they’re a lot of work,” said Chris Corriere, Systems Engineer of Ecology Computing, which provides system modeling, mapping, and adaptation services. “But we have chocolate chip on lockdown and can churn them out.”
You don’t need to be a great baker (but I think everyone should try to make Pete Cheslock’s Chocolate Crinkle Cookies). The key is to show appreciation—and showing it by doing more than just sending a Slack message or gifting a $5 Starbucks card. It’s something that demonstrates you took time to reflect and understand that your co-workers took on a great responsibility to watch over the business that you both work for, during the wee nighttime hours. It’s the least you could do.
Emma Sax, Software Engineer, SportsEngine
Likewise, remember that there’s a person at the other end of that alert. Context matters for when and how you are alerted—during daytime hours, things may have different criticality. For example, the General Ledger application for your business may only be used by U.S.-based employees and only during the workday. If the system fails to respond during this time, that is a high-urgency alert, and you should be informed immediately! But if it fails at 1 a.m., is it worth getting the on-call engineer out of bed? Likely not.
If you use PagerDuty, you can configure the urgency on an individual service based on the hours (or other criteria like payload information). Given the example above, for the General Ledger service, we could set a higher urgency for working hours and a much lower one for outside of working hours.
So Let’s Page It Forward
What is one thing you have done to improve the lives of your on-call colleagues? Let us know in the comments or tweet @pagerduty with the hashtag #pageitforward!