SAP is the market leader in enterprise application software, with customers in over 180 countries. More than three-quarters of the world’s transaction revenue touches an SAP system.
In recent years, SAP has been on a journey to digitally transform its business and move customer-facing services to the cloud. As part of his role within the Global Cloud Services (GCS) team, Mitchell Rose, Senior Program Manager, is responsible for the global uptime of these services. “SAP’s vision is to help the world run better and improve people’s lives,” he explained, “but to do this, we need to ensure that there are fewer—and less impactful—cloud outages and incidents that affect our customers.”
The Global Cloud Services team’s vision is to help technology teams within SAP ensure that their cloud services and infrastructure remains always on through intelligent outage management. “This meant creating a major incident service that could scale at an SAP level, helping us to ensure the uptime of services such as Ariba, Concur, and Fieldglass,” said Rose.
Considering the above challenges, the team knew that developing and rolling out such a service in an organization the size of SAP would be challenging. Many teams were using in-house tools customized for their respective technology teams; however, they weren’t scalable across the entire organization. Over the years, the acquisitions by SAP led to the use of different tools and processes across the organization, making collaboration and cohesion difficult.
“Across teams, there were very different operating models,” explained Rose. “There were differences in operational definitions, and the word ‘priority’ had different meanings to different teams. They also had different ticketing systems, ChatOps tools, processes, and practices. To be successful, we needed a best-of-breed platform that mapped to our vision for major incident response. This is why we adopted PagerDuty.”
SAP’s Global Cloud Services team uses PagerDuty to orchestrate their major incident response. Since adopting PagerDuty, SAP has improved its major incident handling, reducing initial response and communications times to critical incidents by 30% and resolution times by 26% in two months.
“We have successfully reduced the impact and duration of major incidents,” shared Rose. “With PagerDuty, we’re able to engage the right people, on the right issues, at the right time. As a result, we’ve reduced the number of people needed to resolve major incidents by 25% in just two months.”
PagerDuty has also helped improve communication between teams and stakeholders. When SAP needs to triage critical, customer-impacting incidents, such as cloud service disruptions, SAP activates “SWAT” mode, its internal critical response procedure. The SWAT team then drives internal business communications, including those responsible for customer communications.
Through PagerDuty, the SWAT team has access to real-time information about the status of an incident, allowing them to keep other stakeholders—including senior management—updated. Decisions to engage SWAT mode are made more quickly as a result, helping to reduce major incident response time from hours to minutes in many cases.
GCS has made PagerDuty a key part of its major incident framework so they can better collaborate with Major Incident Management (MIM) teams across SAP. Now, when a major incident occurs, the relevant team—such as the SuccessFactors or Ariba MIM team—is notified to help coordinate the best response.
“PagerDuty helped us align core business and technology teams around a common operating model for major incident response,” said Rose. “By using a common framework, we have aligned on processes and criteria for severity and priority. We’re also driving clear responsibility for services during a major incident, which has been scaled at an SAP level.”
Since implementing PagerDuty, SAP’s Global Cloud Services team has improved operational excellence, with benefits including:
“PagerDuty has become mission critical for SAP, enabling our teams to collaborate and rapidly respond to major incidents, and helping us to continue to provide SAP customers with world-class digital services,” concluded Rose.
SAP’s Global Cloud Services team works hard to improve incident troubleshooting and will use PagerDuty postmortem reports, as well as past incidents, to help troubleshoot current issues. In addition, SAP wants to further automate its major incident response process by creating automated runbooks, aligned to key business impact metrics using PagerDuty.