This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...by Dave Bresci
December 5, 2018
“You code it, you own it” means engineers are called when the software and systems they’ve built fail in production and it’s their responsibility to get everything working again. However, managers and business stakeholders aren’t usually on-call so they don’t see or feel the pain of being paged. This can lead to work prioritization decisions that lack empathy and fail to take into account the responsibility we all have for operational resiliency. Managers push for delivery of new features and higher output over work that addresses operational pain. The engineers see problems and feel powerless to solve them. Over time this conflict results in expensive outages that hurt the team, the business, and customers.
Small issues are usually an early warning sign of more serious problems. If they’re fixed as soon as they arise, bigger problems can be avoided in the long run and your team and customers stay happy.
So, how do we get proactive and make fixing operational problems a habit? Empowering the team with effective on-call handoff sessions is a great place to start!
When our on-call team members go off duty and hand the baton to their teammates, we use this time to expose operational problems, discuss solutions, and empower the team to initiate action. Here are a few tips for effective on-call handoff sessions based on my experience of being on-call at a number of companies, including PagerDuty.
It’s easy to miss problems engineers are facing when they’re on-call if the team only talks about operational problems in engineering chat rooms. We have regular, dedicated handoff sessions to encourage reflection and create a bias for proactive action to address root cause. Our schedules usually change once a week so the meeting coincides with the day of the changeover.
Being on-call and waking up to incidents can be disruptive and stressful. We include other stakeholders in the on-call handoff meeting to build a sense of camaraderie and empathy, which ultimately leads to better decision making across the organization.
Our product managers benefit from understanding the impact of operational pain on engineers and customers. Exposure to hand-off sessions allows PMs to hear the impact of their prioritization decisions and ensure both product and technical initiatives are moved forward during work planning sessions.
The goal of engineering leaders is to foster a team culture where individuals are happy, motivated, creative and engaged. By observing on-call handoff sessions and carefully listening to concerns, people managers get exposure to insights that may not be uncovered in team/one-on-one meetings. Following the session, leaders can take action to provide support and resources. Encouraging engineers to take well-deserved time off or helping prioritize the team’s technical/operational recommendations are two examples.
It’s easy for teams to get accustomed to disruption when it builds up gradually over time; especially if no one is taking a holistic view and noticing worrying patterns. By reviewing metrics during the handoff session, a culture of observability is promoted that allows the team to see the true picture of operational health — both infrastructural health and human health.
Here are metrics and tools we’ve found useful during our handoff sessions:
Team disruption statistics: PagerDuty provides valuable data and graphs showing total incidents by service, team, and user. Comparing counts at each review allows us to reflect on patterns and discuss solutions.
Chat history: By using chat integration (Slack, Hipchat etc.), all incident notifications can be sent to a dedicated channel. Our engineers chat in the same channel as the incident notifications so it’s easy to identify and analyze conversation threads showing trending topics and concerns.
Use PagerDuty’s Public APIs to create custom reports and apps: Using PagerDuty’s APIs supports the creation of reports and apps that can be tailored to your business. For example, we’ve created an extension that gives an instant picture of how much out-of-hours disruption the on-call team members have had based on the time of day and frequency of high-priority incidents. By sharing this view across the team in the handoff session, we see a picture of team health that motivates us to take action.
Areas of concern that are uncovered during the on-call hand-off sessions must be followed up with concrete actions. PagerDuty’s Jira integration makes it easy to quickly track unplanned work from right inside an incident. It’s then just a short step to assign this work to the on-call engineer (see next section “Reinforce expectations for on-call duties” to understand how this works).
If improvements are noted and correlated back to concrete actions, it’s much more likely those improvements will happen.
Remember to review the result of changes in subsequent on-call handover sessions and adjust your approach based on what was learned.
Many teams fall into the trap of failing to set clear expectations of on-call and see it as just ‘part of the job’ rather than a dedicated, critical role. How can you stay out of this trap? We set clear expectations:
At the on-call handover session, it’s important to check in on these expectations and reinforce the message: Operational improvement requires effort: humans need time and space to be able to focus on it. They also need downtime and a workload that is sustainable.
For more advice on best practice for being on-call, check out our On-Call Survival Guide.
Having engineers on-call is an effective way to encourage continuous improvement and system stability. However, it only works if everyone in the organization understands how to play their part in making it successful. Even if you are not an engineer, your decisions are likely to have unintended side effects on the well-being of engineers and the systems they’re building. Getting involved in on-call handoff sessions and encouraging proactive resolution of problems leads to happy teams and successful products. I encourage you to look at your own organization and reflect on ways you can build empathy across teams using similar techniques. Share your ideas and suggestions in our Community forum!