(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
What does incident management mean for the travel and hospitality industry? There are times when it can mean everything.
In this post, we’ll take a look at the potential cost of data processing and IT downtime in the hospitality and travel industries, and what you can do to manage downtime incidents to minimize potential damage.
What kind of downtime are we talking about?
Data system problems in the travel and hospitality industries can be broken down into the following categories:
Client-End Booking and Scheduling
Failures during customer access or agent access to booking and scheduling systems are among the most visible — and the most annoying — types of problems. They are also the most likely to result in immediate loss of business.
If you’re booking a flight and the booking system stops working, what do you do? Likely one of two things: If you are intent on booking with a particular airline, perhaps you’ll wait.
But if you’re simply looking for a flight at the right time for a reasonable price, there’s a very good chance that instead of waiting around, you will simply book a flight with another airline.
Provider-End Booking and Scheduling
In some ways, service-end failures can be even more damaging to the reputation of a hotel or airline than failures in scheduling.
For instance, what if you arrive at the airport and find out that you are not on the passenger list for that flight that you had scheduled? Or you arrive at your hotel after a long trip, only to find that your room reservation never went through?
Whatever you do, chances are pretty good that it will involve telling your family and friends what a terrible time you had and that they should never do business with that airline or hotel chain again.
Failures of on-site service and transaction-related software may be minor annoyances, but they add up. These failures can affect the quality of room service, transaction and credit card processing, as well as basics such as telephone and Internet access. While they may not be as destructive to customer loyalty as booking failures, they can run a close second.
It’s important to remember the significance of security, too. Often, security incidents are costlier than most, and a breach that exposes customer information is catastrophic. Along with damage to the provider’s reputation, there’s the strong possibility of lawsuits, fines, and even criminal charges.
This includes failures of the scheduling systems for maintenance, supplies, and ongoing services. Your customers may not notice IT breakdowns of this type, but there is a good chance they will be aware of the results. It probably won’t take them long to notice if room service is short of coffee and breakfast rolls, and they will definitely notice if their room runs out of toilet paper.
Failure to schedule ongoing maintenance can have even more serious results, including civil penalties or loss of certification if crucial systems don’t pass inspection.
Needless to say, the problems caused by IT breakdowns have an even greater potential for damage when it comes to the new breed of Internet-based travel and hospitality companies like Uber, Lyft, and Airbnb since, unlike the more traditional hospitality companies, their entire mode of operation has been dependent from the start on modern IT.
But even for the most traditional travel and hospitality companies, the long-term damage resulting from the failure of key booking, scheduling, security, and maintenance systems can be enormous.
Traveling can be a stressful time, even for many people who consider themselves to be seasoned travelers. Whether it’s a vacation, a business trip, or a family reunion, travel plans that are disrupted or seriously sidetracked by IT failure on the part of the hospitality service provider can have a major and sometimes very negative impact on the lives of the people involved.
Unhappy customers don’t stay silent, and the more unhappy they are, the more likely they are to be very vocal about the problems that they encountered. In retail, a good, basic rule to keep in mind is that one highly dissatisfied customer may talk to as many as 10 other people about his or her experiences.
In the hospitality industry, that number is likely to be much greater, perhaps by one or two orders of magnitude, if not more. People like to talk about their travels, and travel-related problems have a way of becoming the subjects of very colorful and dramatic monologues. Worse yet, major booking and scheduling failures can find their way into the news, and turn your operation into the object of endless jokes and scornful memes on social media.
In an ideal world, you might prevent such problems by maintaining a failure-proof IT infrastructure. But in real life, there are no failure-proof systems. Of course, you can and should do everything possible to prevent failure in your crucial IT systems. But along with preventive measures, it is important to put a system in place that rapidly detects and facilitates a best-practice response to IT failures when they do occur.
Such an incident management system will typically consist of the following basic components:
This includes real-time monitoring of system functions and user interactions, as well as such things as log analysis. Monitoring and analytics can help you detect failures as soon as they occur and, ideally, even anticipate failures—for example, failures based on declining performance metrics and anomalous system behavior.
In a situation where time is crucial, simply detecting a problem isn’t enough. You need to have a system in place that will automatically generate alerts based on criteria such as component failures, out-of-bounds performance metrics, and anomalous user or system behavior.
Once alerts are generated, they need to be processed. Much (if not all) of this processing can be automated. This includes initial filtering to weed out alert noise and low-priority incidents. Alerts must also be routed to the correct responders and dispatched using the appropriate methods of communication. Depending on the priority of the incident, alerting should include backup responders in case there’s a problem in contacting the first response team.
Response teams should be trained and equipped to quickly determine the extent of the incident and contain the damage. Training should encompass standard IT response team responsibilities such as accurate triage and diagnosis, short-term repair or replacement, interim steps to get the system up and running, and recommendations for long-term remediation. In addition, they must be able to assess the extent of the immediate damage, quickly determine what customers or services may have been affected, and immediately report their findings to the damage control team.
In the travel and hospitality industry, damage control should ideally consist first of taking care of the most immediate and time-sensitive problems (such as switching to backup booking and scheduling systems, and handling near-term reservation failures), followed by a quick cleanup of any booking and reservation failures that may occur over the next few hours.
With a modern incident management system in place, even serious IT failures can be contained so that damage is limited. With real-time alerting and response orchestration, IT failures can even be turned into positive opportunities to demonstrate to the public that your organization can be counted on to swiftly and effectively take care of customers, even in the event of a system breakdown.
While serious IT incidents may be unavoidable in the hospitality industry, if you have the right tooling and process in place, they will be far less catastrophic.