Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
It’s rare to find a business today that doesn’t rely on digital technologies and services. Retail is one example: Whether customers are buying online or in store, completing a transaction requires a website or point-of-sale system. The entire supply chain relies on IT services to deliver goods on time, to the right locations, and just like any company today, every department —from development and marketing, to HR and business services—has a critical tech stack.
This increasing reliance on technology as the engine of the business, coupled with user expectations of having always-available services, means technology disruptions or degradations have an immediate impact on customers. It also means that the ability to respond and resolve issues in real time is more important than ever. And while some organizations are set up to handle real-time work, most companies are not: they lack the technology to support real-time work, their processes are designed around queued work, and their employees called upon to do this work are not enabled with the necessary knowledge or empowered to perform it.
In the digital era, at any moment, thousands of customers may be pushing the proverbial “buy” button. This means that every moment is a potential “moment of truth”—the moment where you succeed or fail in the eyes of your customers. All of us have had poor experiences or failed transactions, and the ramifications are huge for the affected businesses, with the Rand Group reporting that an hour of downtime costs $1 – 5 million for a third of enterprises.
From research we conducted, we found that organizations on average experience 22 incidents per month (7 major incidents and 15 minor incidents). Major incidents on average take 5 hours to resolve, and at an average of 7 major incidents per month, this amounts to 35 hours of down or degraded services per month. While incident counts vary across organizations, what we see is that digital services do sometimes fail. How effectively an organization can both predict and avoid issues—in addition to mitigating, responding to, and handling issues when they do occur—can mean the difference between having happy customers versus no customers. This is where the ability to work in real time really matters.
But what does excellence in real-time work look like? How does an organization develop that muscle? What are the benchmarks that can be achieved?
At PagerDuty, we realized there was no way for organizations to measure their real-time operations efficiency—and so we decided to build a method. We constructed the industry’s first Real-Time Operations Maturity Model, based on nine years of working with PagerDuty customers and developing our own best practices. The model lays out what excellence in real-time operations looks like, includes the metrics and behaviors with which to measure maturity, and helps organizations assess how mature they currently are. But perhaps most importantly, it details what benefits you and your customers can expect to see as your organization’s digital maturity improves.
The Real-Time Operations Maturity Model has four different levels:
To help determine what level of real-time operations maturity organizations have achieved today, we engaged with IDG to conduct a survey of 600 IT leaders and practitioners in the U.S., U.K., and the Australia – New Zealand region. Respondents represent industries ranging from technology and finance to communications and manufacturing.
So what did we find? Most organizations still have a long way to go in order to achieve real-time operations maturity and fully realize the benefits.
From the survey data, we found that mature organizations learn from past issues, which are automatically documented and made available, and improvements are quickly implemented. Response processes are well defined, coordinated, and leverage automation as much as possible to reduce manual work, enabling employees to spend more time on innovation.
Mature companies also complete postmortems for 77 percent of incidents and complete 78 percent of follow-up tasks, taking full advantage of the opportunity to learn from incidents—and taking steps to implement improvements that can help reduce the risk of the same incidents from recurring.
Only a low number of survey respondents reported that they measure health at the team and organizational level, in addition to properly managing workloads. This is concerning as data has shown that being on call can have a major negative impact on on-call responders’ happiness, both at work and in their personal lives. Responders can be interrupted by calls waking them up at night and pulling them away from important family events—and what makes it even worse is when they find out that many of the alerts are unactionable anyway, either due to lack of information or false alarms.
As one of our other research reports, “The State of IT Work-Life Balance,” points out, the risk of burnout for on-call responders is very real, resulting in high turnover and the loss of highly skilled technical employees for organizations—which can mean even slower response times and more unhappy customers when incidents arise.
But there is good news: The survey results also revealed that organizations can reduce employee burnout and attrition by increasing automation and reducing the number of unactionable alerts. In fact, the data shows that more mature companies create roughly 40 percent fewer incidents for every alert, resolve 40 percent more issues with automation, and experience a 21 percent lower on-call responder employee attrition rate when compared to their less mature counterparts.
Still not convinced about the business impact of effective of real-time operations? The IDG survey also found that, compared to their less mature peers, mature organizations:
As the survey data shows, there is a real and large positive impact to your business, your customers, and your bottom line when you have the right technology, process, knowledge sharing, and culture in place for managing real-time work.
To learn more about the model and how you can implement it in your organization, attend our upcoming webinar, IBM & PagerDuty: Driving Real-Time Operations Excellence, at 10 a.m. PT on Tuesday, December 4, or check out our summary of survey results, developed in conjunction with IDG.
Ready to see how mature your organization is? Take an abridged version of the survey.