PagerDuty Blog

The Cost of Operational Immaturity

What Is Digital Operational Maturity?

Digital operational maturity is defined as an organization’s effectiveness at real-time work and ability to focus on performance metrics that improve as the organization becomes more adept at responding to incidents.

Based on extensive research and nine years of industry data, in conjunction with a survey of 600+ respondents from across industries, PagerDuty developed a model that identified the four following levels of operational maturity:

  1. Reactive organizations tend to discover most issues when customers report them, but don’t have processes around responding to these issues. First-line responders often don’t have the skills, knowledge, or authority to resolve most service issues.
  2. Responsive organizations surface some issues before they impact customers. First-line responders have the skills, and are beginning to acquire the authority, to prevent issues. However, they still lack the necessary information to do their jobs.
  3. Proactive organizations surface and resolve most issues before customers are aware and affected. Learnings from past issues are automatically documented and responders are empowered with the knowledge and authority to resolve current issues and prevent future issues.
  4. Preventative organizations surface and resolve almost all issues before they affect customers. This level is extremely difficult to achieve, and the hallmark of an organization operating at this level is a culture of continuous learning. Their responders are fully empowered with the knowledge and authority to resolve current issues and prevent future problems. These organizations heavily utilize automation throughout the real-time response process.

PagerDuty has identified five key focus areas which are critical in determining an organization’s operational maturity, shown in the figure below:

Let’s dive into more detail as to how the five areas of focus can help improve your business.

1. Improve the Customer Experience

Positioning your business to always think “customer first” is critical in today’s market. This means doing everything possible to give your customers a seamless and uninterrupted experience. Consider the following scenario:

You’re at your favorite Philz Coffee in the South of Market district in San Francisco and start craving a burrito. You open ridesharing app A on your phone, but it doesn’t load. So you wait a few more seconds. Nope, still not loading. Finally, you give up, close the app—and open competitor ridesharing app B. So that’s one lost sale for A, right?

Not quite.

The next time you need a ride, there’s a good chance you’ll remember your last experience with rideshare company A, so you open company B’s app first. That one subpar experience resulted in damage to company A’s reputation, leading to potential lost sales in the future. In fact, Gartner estimates that the cost of downtime for the average company is $5,600 per minute or approximately $300,000 per hour.1

That’s a lot of potential money lost. To retain customers and revenue, companies need to ensure their digital businesses are running nearly 100 percent of the time; however, our survey found that over 50 percent of respondents indicated that their organizations are either Responsive or Reactive, and don’t have processes in place to take the proper steps that help prevent or reduce customer-impacting downtime, resulting in lost revenue and damaged customer relationships.

2. Improve Employee Health and Company Culture

In their simplest form, companies are nothing more than groups of humans working toward a common goal. (And PagerDuty has some of the best humans— come join us!). They’re also a company’s most valuable assets, so taking steps to prevent burnout are critical to maintaining employee happiness and productivity, resulting in less turnover.

Why should organizations care about turnover? Well, aside from losing valuable institutional knowledge, a recent PagerDuty study of IT professionals found that replacing just one experienced IT responder can cost up to $300,000. With those kinds of numbers, it’s in a company’s best interest to track individual and overall team health by asking questions like:

  • How often are employees being paged after work hours?
  • What’s the annual turnover rate for IT responders on your team?

Having access to these metrics are valuable, as we found that an excessive amount of non-actionable alerts and repetitive incidents lead to frustrated employees who become disengaged from their work. Our data also showed that while the impact of reducing the number of alerts isn’t immediately apparent, over time, it makes a huge difference.

In fact, our study found more mature organizations that took steps to reduce alerts saw a 21 percent lower attrition rate of their on-call responders when compared to their less mature peers. For a company with 50 on-call responders and a 10 percent attrition rate, a 21 percent reduction translates into $315,000 saved per year!

3. Optimize Processes to Reduce MTTR

Process optimization is critical as organizations scale. But it’s more than simply identifying and following best practices—it also means companies need to evaluate existing processes to pinpoint areas for improvement.

For example, stakeholder notification during incidents is a process that can be improved at most organizations. As we explained in an earlier blog, building a communication strategy to update people outside of the core response team enables on-call responders to spend more time resolving an incident.

Additionally, the incident response process isn’t complete just because the incident is resolved. A critical component of digital operations at mature companies includes a blameless postmortem process that allows teams to conduct root cause analyses, which help identify patterns and provide insights that can help prevent similar incidents from recurring.

The surprising thing is that only 50 percent of respondents were either Proactive or Preventative. These organizations have blameless postmortem processes in place and experience substantially fewer customer-impacting incidents. The other half were either Reactive or Responsive, spending hours trying to find the right responders to resolve incidents and repeating incidents due to an inability to identify and address the root causes of incidents. For an e-commerce company, minutes of downtime due to an inability to notify the right responders could lead to tens of thousands of dollars of lost revenue.

4. Foster a Practice of Knowledge Sharing

Access to knowledge and eliminating “silos” of knowledge across teams is critical to building a mature digital operations practice throughout an organization. Our study found mature companies that excel in fostering a culture of knowledge-sharing across the organization experienced 14 hours fewer downtime per month than their less mature peers (assuming 7 major incidents per month). In a market where anything less than 99 percent uptime is unacceptable, this is a critical metric to protect. Any amount of unplanned downtime can lead to not only lost revenue, but also lost trust from the customer.

5. Use the Right Technology and Tools

Many technologies and tools are available today to help organizations optimize business processes—and building a healthy and mature digital operations practice is no different. On-call employees should not be drowning in a flood of notifications and unactionable alerts. The below table outlines some key findings in how three key metrics corresponded to the maturity level of the organization.

Some notable takeaways:

  • The most mature respondents only had 25 percent of alerts as unactionable, while the least mature respondents had 28 percent of alerts as unactionable.
  • The most mature respondents had an incident-to-alert ratio of approximately 1:3, while the least mature respondents had an an incident-to-alert ratio of approximately 1:1.
  • The most mature respondents resolved 57 percent of issues with automation, while the least mature resolved only 16 percent of issues with automation.

The most common trait mature organizations share is the use of technology to learn from past incidents, whether it’s resolving incidents before they become customer-impacting or reducing alert noise. If employees are focusing reviewing unactionable alerts and manually resolving incidents that could be resolved through automation, countless personnel hours are lost—hours that could be spent innovating or working on another issue. The key is finding the right tools for your teams so your organization’s digital operations can mature from Reactive to Preventative, and focus on driving innovation to improve the customer experience.

Interested in learning more about how you can improve your organization’s operational maturity? Contact us today.