When Big Systems Fail
Failure is not an option — that’s what we’d like to think, but we all know the truth. The question of failure is not if it’ll happen, but when. Large, complex systems are more prone to failure than others, as their infrastructures often have years of technical debt from intricate architectures often pieced together as a result of mergers and acquisitions. This, coupled with trying to keep up with the fast-paced evolution of digital demand across the business, and failure becomes a cause for concern. The airline industry knows this well.
A scan of recent news headlines indicates, it’s no easy task. From Southwest to Delta to the most recent British Airways system outage, we are starting to see a tipping point in an industry desperately trying to keep pace with digital innovation. We’ve seen a major airline brought to its knees by a power issue that cascaded through its systems, resulting in thousands of flights being canceled. With increasing demands for a digital-first customer experience, airline IT systems have become major liabilities. Decades of business mergers and advances in technology have lead to a patchwork of inconsistent and unreliable systems. In the digital and connected age, downtime is more than an inconvenience — it spells millions of dollars in lost revenue and shaken consumer confidence.
Complexity Comes with a Price
Airlines have come a long way from the days when you would walk up to the counter or call a travel agent to purchase a ticket. Complex automated internal and customer facing systems and experiences all contribute to optimizing revenue by ensuring flights are full, running on time, equipment usage is being maximized safely, and every salted peanut is accounted for. All of this digital complexity came with a price. Airlines didn’t have the luxury of building the industry with a digital-first mindset. They didn’t get to sit around a table and discuss the mobile versus online experiences of their customers in relation to scheduling algorithms before planes were in the air. Like a lot of other industries that have been around for many years, they had to adapt, build, refine and patch over decades of changes in technology, passenger expectations and business practices without disrupting service. This is an enterprise-level house of cards that we have recently seen struggling in the news.
The Cost of Downtime
IT systems fail, they just do and sometimes there is no way around it. The DevOps culture has embraced failure and as a result have built digital companies, products, and services with the ability to innovate and react quickly in the event of downtime. Modern operations requires a sophisticated incident management processes that hopes for the best but prepares for the worst. Incident management has to be a top priority and receive significant investment. Every second of downtime in today’s digital-first world directly correlates to lost revenue. Southwest estimates their outage cost $54 Million and Delta Airlines estimates a $100 Million price tag for their outage. Looking at those numbers, it makes sense for modern operations teams to invest in the right people, processes, and tools to ensure that when critical incidents do occur, they are resolved as quickly as possible.
Catching up to modern operations doesn’t happen overnight. The airline industry has come a long way in a relatively short period of time, but it has a long way to go towards meeting the demands of a digital-first society.
Addressing Outages and Downtime
To learn to adapt and evolve with the changing times, it’s crucial IT operations be up-to-date with best practices around what to do when an outage or disruption in service occurs, and how to react efficiently and reliably to restore service in the shortest time possible. In this day and age, systems being down or services being disrupted for any period of time is unacceptable. To help prevent extended periods of downtime or outages, it’s crucial to enable your team to communicate better in a crisis, monitor the IT stack more carefully, and implement a modern operations solution for incident management.
Here are three resources to help you prepare against outages, as well as how to prepare in the event an outage does occur.
- A Modern Operations Solution for Incident Management
The complex nature of IT operations, IT use cases, and IT service management today have made homegrown incident management solutions obsolete. Advances in technology and diversity have rendered most commercial solutions incapable of addressing availability, scalability, or reliability requirements. Download this report by The Enterprise Strategy Group (ESG) to learn why PagerDuty stands apart in the incident management market and how it can help you keep your business up and running. Download now.
Despite your best efforts to prevent outages, systems can sometimes still go down. Learn best practices for communication in the event of an outage and what types of monitoring practices are critical to establish in order to efficiently respond to events.
- Best Practices in Outage Communication
This guide provides the structure for crafting an outage communication plan for your business. When something goes awry on a large scale, it’s crucial for teams to discuss it quickly and effectively in order to prevent prolonging the outage. Strategizing communication ahead of time will help you know what to share and how to share it, both internally and externally. Download now.
- Best Practices in Monitoring: Reduce Outages and Downtime
Without good monitoring practices in place or a reliable incident management platform, your critical applications and infrastructure can go down for longer than acceptable periods, impacting your bottom line, brand, and customer loyalty. It’s worth spending the time to develop an intelligent, streamlined way to monitor system events and respond to them. You’ll save yourself serious headaches in the future, keep customers happy, and avoid lost revenue. Learn what to monitor, motivate your team to respond quickly, and avoid common monitoring mistakes. Download now.