Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest features we've been working on — from event intelligence, machine learning, response automation, on-call, analytics, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 200 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Today’s enterprise IT is not your grandfather’s enterprise IT. Enterprise IT is evolving rapidly and on all levels — from user demand and departmental requirements, all the way up to corporate headquarters...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
As a global cloud-based performance and security solution to over 6,000,000 Internet assets, Cloudflare ensures that customer websites, applications, and APIs are secure, performant, and highly available. In serving over 10% of the world’s Internet traffic, it’s imperative that Cloudflare’s services remain online for customers at scale, while guaranteeing SLA uptime by identifying and resolving incidents, long before operations are ever disrupted. Cloudflare’s Site Reliability Engineering (SRE) team, lead by Michael Daly, sought an incident resolution solution which would help Cloudflare increase the stability of its operations, while delivering a flawless experience for every customer.
Challenges: Visibility, Communication, and Escalation
Cloudflare faced three challenges before adopting PagerDuty. The first was around optics. “We didn’t immediately know when something was broken because the engineering team did not receive automated alerts when an incident occurred,” Michael explained.
The second challenge was in managing incidents. Once a problem was discovered, the engineering team relied on manual processes to address it. Engineers spent time diagnosing the cause of the problem, and if a solution required assistance from another department, SREs were required to contact that person over phone, text, or chat — a duty that became difficult if incidents occurred after working hours or on weekends.
Given Cloudflare’s rapid growth, with less than 800,000 customers in 2013 to over 6 million in 2016, it was becoming difficult for Michael’s team to separate actionable, critical incidents from the growing volume of data generated by monitoring tools. While the team refused to dispose of potentially useful information, they needed to group related symptoms in order to gain actionable insight. Without the assistance of dynamic event management and triage, automation, and other capabilities available from PagerDuty, Michael and his staff had to evaluate the seriousness of each incident manually, a process that was becoming too slow to best serve the exponentially growing number of customers.
Increasing Stability and Response Time with PagerDuty
By adopting PagerDuty, Cloudflare resolved all of these challenges. PagerDuty ensures that Michael and his team are always notified of incidents as soon as they occur and, if an incident should be handled by a different team, PagerDuty forwards the notification automatically to save time.
The Cloudflare SRE team also uses the Operations Command Console, and benefit from capabilities like the highlighting of high-urgency incidents within the Major Incidents Application. As a result, with full-stack visibility into their infrastructure and pattern and anomaly detection, they no longer miss serious events. Michael explained, “When we adopted PagerDuty, we were able to take certain alerts and say to ourselves, this one is really important. We need to deal with it now.”
In addition, other capabilities such as PagerDuty’s HipChat integration made it easier for Cloudflare’s SRE team to streamline communication, collaborate, automate ops-related tasks with commands, learn together, and more when responding to incidents. PagerDuty also eliminated the need for SREs to manually look up contact information for the right expert, as individuals, teams, or business stakeholders can be informed and recruited into an incident in just a click. With PagerDuty, they can get in touch instantly.
Most importantly, PagerDuty reduced the time it takes Michael and his team to take action on incidents, to a small fraction of what it was previously. “Mean-time-to-action has dropped from minutes to seconds,” Michael said, adding that faster response time translates to greater service reliability and better customer outcomes — which is the ultimate goal and reason why Cloudflare sought out PagerDuty in the first place.
“We had several options, but we chose PagerDuty because we had to do less work to make PagerDuty work with our systems. It was very nicely formatted, the API just worked, and the output from the app was very easy to interpret.”