Why Are Availability and Reliability Crucial?

In today’s digitally connected world, people expect the consumer and enterprise applications and services at their fingertips to operate seamlessly in real time, all the time. However, the underlying technology that supports digital services is incredibly complex to manage, and failures are bound to happen. Meanwhile, the costs of downtime are growing exponentially, with some Fortune 500 retailers telling us they lose hundreds of thousands of dollars per minute due to lost revenue and productivity.

As such, IT organizations maintain service level agreements (SLAs) around application and website reliability and uptime, defining standards required to keep the business running smoothly in spite of inevitable IT disruptions. Reliability, Availability, Maintainability, and Safety (RAMS) are key system design attributes that help teams understand whether systems fulfill key requirements such as performing as intended, and being functional and maintainable. Of these, the ones that IT teams typically care most about — especially as they relate to system performance — are availability and reliability.

These two terms can be defined as:

  • Availability is a measure of the percentage of time that an IT service or component is in an operable state.
  • Reliability, on the other hand, is a measure of the probability that the system will meet defined performance standards in performing its intended function during a specified interval.  

Key Metrics

Here are some key metrics that are typically used to measure Availability and Reliability.

Availability

Availability, as a measure of uptime, can be calculated as follows:

Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

Oftentimes, service providers provide an availability SLA based on the availability percentage table below, committing to ensure that functionality is up and running based on expectations.


Availability Level Allowed unavailability window
Per year Per quarter Per month Per week Per day Per hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 12.85 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 minutes 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 minutes 5.04 minutes 43.2 seconds 1.8 seconds
99.99% 52.6 minutes 12.96 minutes 4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds

Source: Google SRE Availability Table


Reliability

Reliability helps teams understand how the service will be available given real-world scenarios — in other words, measuring the frequency and impact of failures. Common metrics to measure reliability are:

Mean time between failure (MTBF) = total time in service/number of failures

Failure rate = number of failures/total time in service

In determining metrics for both reliability and availability, IT organizations need to make tradeoffs and decisions with respect to costs and service levels. They need to balance costs and investments in infrastructure/performance to maintain high service levels, with maximum allowable increments of downtime/failures that minimize impact to the business and user experience

Best Practices for Availability and Reliability

Automate Across the Software Delivery Lifecycle

An important part of delivering more performant, reliable services is reducing functional silos and implementing automation across the entire software delivery lifecycle  — from design, test and build, deploy, operation and issue resolution, and improvement. Automation allows teams to scale quickly and efficiently, and also improves reliability by minimizing the risk of manual errors.

Have The Right Monitoring in Place

Teams should implement redundant monitoring on their services to proactively detect issues, and keep a close eye on important metrics such as availability, and latency with a goal of improving such metrics over time.

Sustainable On-Call and Incident Response

In order to take action on issues quickly, ownership of services and accountability to take action on associated service disruptions must be well defined. As such, it’s key to implement a system to sustainably manage on-call rotations and escalations, and which effectively orchestrates the right experts when a disruption arises. The goal is to move away from a culture of heroism and firefighting, by empowering teams with the right information and tools they need to effectively manage incidents and use learnings to build reliability into their systems and processes.

Blameless Postmortems

Blameless postmortems are a crucial part of improving availability and reliability as they are intended to help teams focus on the aspects of the system and incident response processes that can be improved, to prevent recurring issues in the future.

Practice with Chaos Engineering

Chaos engineering is a great practice used by many modern operations teams to help teams identify failures before they become customer-impacting outages, and to prepare themselves for incident scenarios. By systematically injecting failure into systems, teams understand potential vulnerabilities and also become well practiced and rehearsed in incident response, building confidence in the system’s ability to withstand disruptions.

Improve Uptime Today

One of the keys to improving digital service and site reliability, and system uptime, is by implementing a real-time operations solution that harnesses data from disparate data sources, makes sense of the data, orchestrates teams, and facilitates learning and prevention. When a critical disruption occurs, it’s essential to leverage intelligence and automation to mobilize teams instantaneously as seconds matter.  The system your team relies on to stay reliable must itself maintain incredibly high SLA’s around reliability. It’s important to select a vendor that is highly transparent about its uptime and downtime status, and has no scheduled maintenance windows.

That’s why at PagerDuty, reliability is at the heart of how we help our customers elevate work to the outcomes that matter. PagerDuty uses multiple data centers, hosting, and communication providers to deliver a reliable and highly available service. We offer enterprise class security and control features, and guarantee the delivery of alerts at all times to thousands of organizations across the globe. Here are some benefits and capabilities we offer as part of our platform:


Uninterrupted Service at Scale Our service is distributed across multiple data centers, regions, DNS, and communications providers so that we always stay available.
Guaranteed Delivery Through systematic polling and testing of providers with automatic failover, we process billions of events per year and guarantee alert delivery with a reliability SLA.
Global Service Multiple communication providers including email, phone, and SMS providers enable service to 180+ countries.
Service Status Transparency We provide 24/7 transparency into our uptime via our status page at https://status.pagerduty.com
Chaos Engineering & Reliability Best Practices To enhance our reliability, PagerDuty runs ‘Failure Friday’ every week to test for and continuously improve our failure resilience.

Learn more

To learn more about reliability, please see the following resources:

    • Incident Response Documentation: Our own internal incident response documentation, that we outsourced to help other teams adopt incident response best practices toward the goal of improving reliability.
    • Enterprise Class capabilities: Platform web page describing PagerDuty’s capabilities around security, reliability, extensibility, and scalability.