Service Disruption Root Cause Analysis and Follow-up Actions from October 21st, 2016
Following up on our previous post from yesterday, we wanted to share the actions we will be taking based on our initial root cause analysis.
Primary and Secondary Root Causes
As we looked through our timeline of events during the outage, we discovered that there were two issues:
- Our failover approach to DNS problems
- The quality of monitoring to assess the end-to-end customer experience
As we have talked about in the past, we prefer to design our systems in a multi-master architecture as opposed to a failover architecture to achieve continuous availability. This approach while requiring significant systems design investment, has the benefits of: predictable capacity in degraded scenarios, forcing increased automation, and making incremental changes easier and safer. We did not have multi-master architecture in place for our DNS systems however. Instead, we required a manual failover to a secondary provider during the outage.
Measuring the end-to-end customer experience is always a challenge in the midst of DNS problems. After all, if a customer cannot talk to your systems, how can you tell what their experience is? We rely heavily on monitoring and alerting on every part of PagerDuty’s services. We have teams of engineers dedicated to making sure that each part of the customer experience is what our customers expect. During this outage, we were unable to properly diagnose customer facing problems due to the fact that customers were not able to reach our systems. This led to an increased resolution time for our customers.
In the coming weeks, we are looking at making several enhancements to our infrastructure, processes, and automation. These enhancements will help decrease the chance of a system-wide outage for the same root causes identified.
Taking a Multi-Master Approach towards DNS
Our top priority underway is redesigning and implementing a new DNS architecture that allows for multiple DNS providers to be leveraged in a multi-master approach. We are updating our internal tooling and automation to make sure that both our external customer facing DNS records are leveraging multiple DNS providers, as well as making sure our internal servers leverage a similar system.
Auditing all DNS TTLs
We have multiple endpoints that our customers use to interact with PagerDuty: Our website, our APIs, and our mobile applications. To ensure a consistent experience across all of these, we will be auditing DNS TTLs for our zones, including NS and SOA records for each zone.
Runbook for DNS Cache Flushing
Many public DNS providers offer the ability to proactively flush caches when records have changed. For example, Google provides this functionality via a web interface. We will be examining what our customer’s top DNS providers are, and determining the steps for each provider to proactively flush caches to provide up to date records faster when possible.
Improve Real User Monitoring
We leverage a combination of both internal monitoring systems and external providers. During this outage, we used these monitoring systems to assess what the customer impact was and determine how best to prioritize resolution steps. Unfortunately, most of the internal systems are designed to be a view from within our infrastructure, and did not adequately describe our end-to-end user experience, especially for our customers on the east and west coasts of the US. We will invest additional resources in global monitoring that takes an external and customer experience view of our systems and overall service offering. This includes our Website, APIs, and Mobile experiences, and our Notification experience as well.
Improve Prioritization of Resolution Steps
At PagerDuty, we leverage a service oriented architecture to support multiple features that our customers leverage. For the majority of our customer facing incidents, there is only one part of our service that becomes affected when a disruption of service occurs. With a central component like DNS not being available, multiple components of our service were impacted. When bringing our services back up in the future, we need the ability to prioritize the most critical and important services that matter most to our customers.
Improve Multi-team Response Process
As called out in the previous section, we have multiple teams on-call continuously for helping PagerDuty works properly. While we leverage our own product to assist us with our people orchestration efforts, we did not have all of the supporting tooling in place for certain teams involved. We plan to implement processes and improve upon our best practices so that each team is able to address problems in their own services effectively.
This past Friday was a difficult day for nearly every on-call engineer. At PagerDuty, we take great pride in providing a service that we know thousands of customers rely on. We did not meet the high expectations that we set for ourselves, and we are taking critical steps to continuously enhance the reliability and availability of our systems. From this experience, I am confident we will provide an even more reliable service that will be there when our customers need us the most.
As always, if you have any questions or concerns, please do not hesitate to follow-up with me or our Support team at firstname.lastname@example.org