Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty Launches a New Set of Integrations for Jira IT Operations, DevOps, and Developer teams count on PagerDuty’s 300+ integrations to power their end-to-end real-time...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Following up on our previous post from yesterday, we wanted to share the actions we will be taking based on our initial root cause analysis.
As we looked through our timeline of events during the outage, we discovered that there were two issues:
As we have talked about in the past, we prefer to design our systems in a multi-master architecture as opposed to a failover architecture to achieve continuous availability. This approach while requiring significant systems design investment, has the benefits of: predictable capacity in degraded scenarios, forcing increased automation, and making incremental changes easier and safer. We did not have multi-master architecture in place for our DNS systems however. Instead, we required a manual failover to a secondary provider during the outage.
Measuring the end-to-end customer experience is always a challenge in the midst of DNS problems. After all, if a customer cannot talk to your systems, how can you tell what their experience is? We rely heavily on monitoring and alerting on every part of PagerDuty’s services. We have teams of engineers dedicated to making sure that each part of the customer experience is what our customers expect. During this outage, we were unable to properly diagnose customer facing problems due to the fact that customers were not able to reach our systems. This led to an increased resolution time for our customers.
In the coming weeks, we are looking at making several enhancements to our infrastructure, processes, and automation. These enhancements will help decrease the chance of a system-wide outage for the same root causes identified.
Our top priority underway is redesigning and implementing a new DNS architecture that allows for multiple DNS providers to be leveraged in a multi-master approach. We are updating our internal tooling and automation to make sure that both our external customer facing DNS records are leveraging multiple DNS providers, as well as making sure our internal servers leverage a similar system.
We have multiple endpoints that our customers use to interact with PagerDuty: Our website, our APIs, and our mobile applications. To ensure a consistent experience across all of these, we will be auditing DNS TTLs for our zones, including NS and SOA records for each zone.
Many public DNS providers offer the ability to proactively flush caches when records have changed. For example, Google provides this functionality via a web interface. We will be examining what our customer’s top DNS providers are, and determining the steps for each provider to proactively flush caches to provide up to date records faster when possible.
We leverage a combination of both internal monitoring systems and external providers. During this outage, we used these monitoring systems to assess what the customer impact was and determine how best to prioritize resolution steps. Unfortunately, most of the internal systems are designed to be a view from within our infrastructure, and did not adequately describe our end-to-end user experience, especially for our customers on the east and west coasts of the US. We will invest additional resources in global monitoring that takes an external and customer experience view of our systems and overall service offering. This includes our Website, APIs, and Mobile experiences, and our Notification experience as well.
At PagerDuty, we leverage a service oriented architecture to support multiple features that our customers leverage. For the majority of our customer facing incidents, there is only one part of our service that becomes affected when a disruption of service occurs. With a central component like DNS not being available, multiple components of our service were impacted. When bringing our services back up in the future, we need the ability to prioritize the most critical and important services that matter most to our customers.
As called out in the previous section, we have multiple teams on-call continuously for helping PagerDuty works properly. While we leverage our own product to assist us with our people orchestration efforts, we did not have all of the supporting tooling in place for certain teams involved. We plan to implement processes and improve upon our best practices so that each team is able to address problems in their own services effectively.
This past Friday was a difficult day for nearly every on-call engineer. At PagerDuty, we take great pride in providing a service that we know thousands of customers rely on. We did not meet the high expectations that we set for ourselves, and we are taking critical steps to continuously enhance the reliability and availability of our systems. From this experience, I am confident we will provide an even more reliable service that will be there when our customers need us the most.
As always, if you have any questions or concerns, please do not hesitate to follow-up with me or our Support team at email@example.com
Do any of these sound familiar? One of your best engineers just put in notice that they are taking a job elsewhere because the on-call...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018