This UK-based education company had already gone digital with their products and services. The next step to remain competitive was to modernize its infrastructure, moving siloed workloads and applications from a physical data center to the Amazon Web Services (AWS). An integral part of this strategy was shifting to a DevOps approach, with distributed teams adopting a customer-centric service ownership model to ensure offerings were accessible 24/7, 365.
Millions of students and teachers around the world rely on this company’s products and services as a foundational part of their education. So, when mission-critical digital resources for testing and learning incur downtime, the consequences are serious. An incident could prevent students from logging in to complete coursework or prevent a teacher from administering a high-stakes exam—resulting in lost learning time for classrooms and customer faith (and, ultimately, revenue).
The development team wanted to focus on improving and launching new products but was spending the majority of time responding to incidents. Innovation was being pushed to the back burner. Internal security threats, bugs, third-party 404 error codes, and network issues were some of the hurdles teams had to investigate and resolve.
In these scenarios, seconds matter. It took time for the globally distributed teams to identify the root cause and then coordinate with the right internal and external stakeholders. The damage caused by downtime wasn’t just customer-facing; the engineering teams were burnt out from navigating continuous, major issues with limited visibility. The company was beginning its cloud-first strategy with its revenue-generating services and needed a solid platform that would orchestrate the right response across disparate teams—and keep downtime-causing incidents at an absolute minimum.
The company was committed to providing their customers with an “always on” solution, so they searched for a partner who would ensure that their digital transformation included best-in-class digital operations management. To achieve these goals, the education leader selected PagerDuty.
By modeling services in PagerDuty as they exist in the company’s infrastructure, there is clarity around dependencies, business impact, and team ownership. Now, if an incident occurs—for example, when educational content isn’t available—PagerDuty automatically mobilizes the right cross-functional team in seconds so they can start working on the issue. PagerDuty provides responders with all the information they need to resolve the incident quickly, while making it easy to manage stakeholder communication so that others in the business are aware of what’s happening.
Once the incident is resolved, PagerDuty Postmortem and Analytics capabilities determine how the problem in question could be actively prevented in the future, learning from each customer-impacting incident and helping them to scale best-in-class incident management practices.
PagerDuty enables the company to more efficiently coordinate and automate digital operations, accelerating the company’s move to service ownership and enhancing its functionality in the cloud. The result? Drastically reduced downtime rates. To date, the longest-lasting incident went from over 50 days pre-PagerDuty to under 30 minutes.
Since distributed teams can get closer to their work—and are diagnosing and treating customer-facing problems more rapidly—customers can rely on quick resolutions.
PagerDuty’s ability to increase visibility across customer and service teams led them to identify—and mitigate—a major point: iterative code changes or new code release that cause unexpected hiccups. With PagerDuty, teams can decipher digital signals, understand dependencies, and immediately identify the right plan of action to respond to an incident. Problem-causing updates are now rolled back much faster.
Ultimately, this evolved the company’s engineering culture from a reactive to a preventative one, increasing resource capacity for revenue-generating activities by deflecting unnecessary noise and non-critical work. The team now receives one third of the interruptions, and predicts a 10% savings in labor costs due to improved user productivity with PagerDuty. As engineering teams spend less time on issue resolution, they can spend more time on launching new products, upholding the company’s competitive edge and enhancing team health.
With reduced incident duration and downtime, fewer interruptions, and increased productivity, the company expects a $3 million annual savings and a 196% return on investment (ROI) during the next 3 years. Further, this customer-centric approach is working. “The improvements have reflected back in our Net Promoter Score (NPS)”, shared a Director of Enterprise Operations.
The company plans on expanding their use of PagerDuty to additional business units, and integrating additional services with the PagerDuty platform. Engineering is looking forward to implementing PagerDuty Process Automation to operate faster and more efficiently in the future.
To learn how PagerDuty can help your team simplify and transform operations in a digital-first world, contact your account manager or try a 14-day free trial today.