The Cost of IT Downtime: An Overview
What is Downtime?
As the adoption of cloud computing continues to encourage innovation across industries, high-performing and resilient systems have become a necessity in order to keep pace with the competition and meet internal/external SLAs (service level agreements). In terms of customer expectations, a minute of downtime can mean thousands of dollars in lost opportunity and a soiled customer relationship.
So what exactly is downtime? Downtime is best described as a period in which a system, device, or application’s core services, both internal and external, are unavailable or idle for a certain amount of time due to updates, maintenance, safety, precautions, and even due to unexpected outages.
Types of Downtime
When it comes to downtime, there are two distinct types: planned and unplanned.
Planned (or scheduled) downtime happens at a time that is most convenient and minimizes negative impact for users. Scheduled downtime is a proactive exercise to ensure optimal functionality of machines and services. There are two ways of scheduling planned downtime: fixed or flexible. Fixed downtime adheres to a set schedule with a specific start and stop time for the maintenance to occur. Flexible downtime is a window of time during which downtime will happen, though the exact start time is unknown.
Unplanned (or unscheduled) downtime is when a lapse in operations occurs because of an unplanned machine error or application/server outage, among other technical incidents. An example of unplanned downtime could range from a local computer crashing to an entire service unexpectedly going offline. Unplanned downtime can happen sporadically at any time of the day or night, and can be financially and reputationally costly for the business.
What Causes Downtime?
The cause of downtime generally falls into one of a few different categories. Human error is one of the most common. Regardless of whether a developer submitted broken code, or an administrator updated an untested package, when procedure isn’t followed or an obscure system bug isn’t accounted for, product uptime will suffer. Another cause is third-party service outages, when downtime is not caused internally but by peripheral service providers going down. Finally, there are highly unpredictable “black swan” occurrences, such as ransomware attacks, which can also have severe consequences.
Once the scope of the downtime’s impact is understood, businesses can quantify the actual dollar value losses. There are several areas that measurable downtime costs fall into. The first is employee costs with respect to loss of productivity. This can be calculated by multiplying the number of employees who can’t work by their hourly labour cost, then multiplying that by the number of hours of downtime. Other additional work-related costs, such as hiring temporary workers or paying workers overtime, can also be measured.
Business costs or opportunity costs can also be calculated from lost sales or lost productivity, especially when contrasted to output under normal circumstances. Finally, there are contract and penalty costs, in which customers covered by a Service Level Agreement (SLA) must be paid out in the event of an outage. If the impact of the downtime on customers is sufficiently severe, organizations may even face lawsuits—especially in regulated industries.
Perhaps some of the most important costs are those that aren’t inherently measurable. One example is damage to employee morale, as downtime can raise doubts about a company’s viability and preclude employees’ ability to get important work done. Downtime can also create unseen costs by blocking development and IT projects, as work progress is disrupted and cognitive load on technical teams is increased. Finally, there is the irreversible loss of key technology-based market opportunities, as a company’s reputation is dependent on how effectively it can keep its systems running.
Four Methods to Prevent Downtime
So, what can companies do to prevent failures and significantly decrease the length and frequency of those that do happen? These four strategies are quickly becoming industry best practices for increasing reliability:
Inject Failure for Success
Having backups of backups and a failure tolerant design is great—but it’s not enough. Backups that only kick in when things are breaking can hide code that fails when exposed to actual production workloads. Large companies with big budgets solve problems by creating automated tools that test applications for failure resilience, introduce artificial latency, or shut down entire availability zones. But for smaller companies, they can simply schedule regular times to do this manually.
At PagerDuty, we call this best practice “Failure Friday.” Injecting failure through scheduled attacks allows companies to proactively find system vulnerabilities and become adept at incident response, going beyond fixing problems to preventing them from occurring at all. When executing this drill, companies introduce attacks for a short duration and bring services back to a fully functional state between attacks. Teams should also use dashboards to better understand which metrics point to issues, and how that impacts systems.
Continuous Integration Practices
Continuous Integration (CI) is a software development practice where team members merge work to decrease problems and conflicts. In essence, it verifies code quality to ensure no bugs are introduced. In many cases, automated and repetitious tests are used, so once a bug is found, new tests are created to prevent that bug from being introduced in future code reviews. By using continuous integration, organizations will create a baseline quality of software that will lower the risk of every release.
There are five types of test to consider. The first is semantic tests, which study the relationship between data. Unit tests study the design and flexibility of the code. Functional tests check for human readability. Integration tests ensure everything is working when combined with all other services, including third-party services. Finally, load tests help determine volume capabilities and where performance bottlenecks might occur.
Never Deal with the Same Incident Twice
Mining historical performance data, analyzing the root cause of issues, and setting up an alert and response system will help prevent past downtime causes from cropping up again. Follow this five step process for success.
1. Review historical information for performance trends and drill down into specific issues, building a solid platform for preventing future issues.
2. Leverage third-party monitoring tools and centralize all information on performance metrics. This allows companies to drill down into the performance and dependencies between individual servers, websites and applications.
3. Set goals based on the needs of the business, past performance and how that performance translated into accessibility of business operations.
4. Transform goals into notification thresholds so that organizations get notified as soon as an issue begins, rather than simply waiting to be alerted when goals have been breached.
5. Aggregate actionable and related alerts into incidents to automatically escalate incident notifications if action isn’t taken. The right management tool will allow organizations to manage all event data in one place, engage with additional experts and keep all stakeholders informed.
Test Third Party Services
Many companies rely on third parties to deliver products and services to their customers. When an outage strikes Amazon Web Services, for instance, countless other websites are impacted as well. So, it’s crucial to ensure redundancy to prevent single points of failure.
When it comes to end-to-end SMS Provider testing, there are a few specific best practices to consider. For short codes, SMS tests should be sent every two minutes, with the frequency varying for less-common long codes. Additionally, internal alerts should be sent throughout the day using each of an organization’s providers. Organizations should also measure how long it takes for the message to be received to determine if a provider is available and their performance. Providers with an SMS delivery latency of more than three minutes should be degraded and replaced. Finally, organizations should suppress non-actionable alerts, and group related alerts. This reduces the number of pages on-call engineers receive, minimizing alert fatigue and allowing them to focus on solving issues and improving processes.
Prepare for an Outage
By taking the time to implement a plan for addressing inevitable downtime, organizations stand to realize thousands — or even millions — of dollars in quantifiable cost savings. As well ensure the health of arguably even more crucial qualitative factors such as employee morale, brand reputation, and customer loyalty.
To find out how PagerDuty can help your organization manage outages and downtime, sign up for a free 14-day trial.