PagerDuty Blog

7 Steps to Avoiding Downtime

Ensure High Availability for Your Applications With These 7 Steps

Several months ago, Delta experienced an IT outage that cost them over $150 million, dropping their overall profit margins by up to 3%. Customers were stranded for hours, 2300 flights were cancelled, and Delta had to pay for thousands of hotel and travel vouchers to compensate for the extended outage — despite a high likelihood that the incident caused certain customers to churn permanently1.

Downtime can strike at any moment with applications and services from even multi-million dollar brands, and just one extended issue can cause a business to lose hundreds of millions of dollars. But situations such as these can be largely avoided if you follow these steps:

  1. Adopt a microservices architecture
    Traditionally, applications were developed in the monolithic style, or by developing the entire app as one whole piece. Today, microservices architectures are becoming increasingly popular. They involve developing, testing, and deploying an application into smaller parts that are not entirely dependent on each other. This makes maintenance much easier because the components of the application are isolated from each other. So, if one particular component experiences failure, it can be targeted and fixed separately without it affecting other components. In a monolithic application, if something goes wrong, the entire app experiences downtime and it’s difficult to find what exactly went wrong. A microservices approach makes your app more resilient to downtime, and is the first step to achieving high availability. However, be aware that microservices architectures introduce far more complexity and increases in the volume of monitoring data generated, so it’s critical to be able to correlate related alerts and suppress non-actionable alerts to reduce overall noise.
  2. Make releases faster, and more frequent
    The biggest benefit of a microservices architecture is that it enables faster releases—multiple times a day for web apps, and bi-weekly for mobile apps. The old order was to have major releases every quarter or so, and downtime was inevitable with every release. With the modern approach, releases are fragmented. Deployments are rolled out to only portions of the application in the background at any one time so that the platform always remains up and running. This not only reduces the risk of downtime, it makes you more competitive as you increase your release velocity to deliver more cutting-edge features and value.
  3. Availability is a quality issue
    Quality and availability go together. A lot of organizations fail to see the importance of QA, to the point of neglecting it until the last minute. To prevent buggy software, the QA team must be involved as early as possible in the development process and tightly involved in the release lifecycle. QA should focus their efforts on automation and testing strategy. A test automation framework can help minimize errors while dramatically reducing costs and saving time in comparison with a manual approach. Additionally, testers do not just look for bugs; they must also be proactively engaged in the requirements process to help steer development in the proper direction. By helping to make sure the development team is building the right way from the beginning, the organization is less likely to have as much technical debt in the future. QA is about constant improvement, and your incentives should target that goal.
  4. Have a disaster recovery plan
    When core services in your app are disrupted, it is a disaster. In these situations, you need a good disaster recovery plan. With most organizations using hybrid architectures with both public and private cloud infrastructure, it’s important to have redundancy across your servers and make backups across different providers. Virtualization can be really useful when making an image backup of an existing physical server, and containerization even more so because the image backups are far more lightweight and take up less space. Strategies such as these ensure your data is available even in a time of disaster. Going further, you need to automate your backup plan end-to-end, so it doesn’t depend on an administrator’s permission especially if they aren’t available. Automation also allows your DevOps team to easily test the disaster recovery plan, and be ready for any disaster that may come their way.
  5. Employ ITSM change management
    Make sure standardized frameworks like ITIL are used for ITSM change management. Changes are highly beneficial to IT services, without which there wouldn’t be progress — but changes made must always be documented. Measure change success rates and publish the results in order to find which teams have a low change success rate. An ITSM tool like ServiceNow is great for more visibility and control over change management. It allows you to make changes quickly, efficiently and with minimal disruption to IT services.
  6. Use an incident management tool
    When inevitable downtime does happen, it’s critical to inform the right people on the team in real-time. But often, teams get too many alerts, and they can miss the really important ones, which affect mean time to resolution (MTTR). An incident management platform like PagerDuty helps manage and group alerts from different monitoring systems and will prove invaluable during an outage. It suppresses non-actionable alerts based on easily defined rules, groups related actionable alerts into incidents, and ensures only the high-priority incidents trigger a notification to the right people, with the right context. Further, with integrations with all your existing monitoring, ticketing, ChatOps and collaboration tools and more, PagerDuty equips your team to troubleshoot and resolve incidents quickly so your app is up and running as much as possible.
  7. Deliberately induce failures
    Planned failure ensures your team is always prepared to resolve any downtime. Netflix is popular for taking this approach. They use a script called Chaos Monkey that constantly runs in the background and randomly shuts down server instances. This helps the team always be prepared in case of real server downtimes, while serving their customers smoothly at the same time. PagerDuty also practices Failure Fridays every week, purposely injecting failure into the system to continuously improve response, ensure preparedness, and maximize reliability.
  8. Although achieving perfection is impossible, focusing on the people, processes, and tools that make up your DevOps team will bring you close. There isn’t a silver bullet that will eliminate all your downtime issues, but as you follow these steps, you’ll build apps that are more reliable, and earn and keep the trust and loyalty of your customers.





    Gensler, Lauren. “Delta’s Computer Outage To Cost Them $150 Million.” Forbes. Forbes Magazine, 07 Sept. 2016. Web. 13 Feb. 2017.