PagerDuty Blog

How Retailers Can Prevent Downtime with Incident Management

In recent years, it feels as if many major brands have suffered major infrastructure failures during one of the busiest holiday shopping days — Black Friday. Because many of they have.

Seeing big names associated with website crashes and business disruptions may be intimidating to many admins. If large international retailers struggle to keep their infrastructure running smoothly on the busiest shopping day of the year — when they know far in advance that they’ll need all hands on deck — how can smaller companies prevent downtime on a normal day?

That’s a daunting question. Fortunately, it doesn’t mean that hope is lost for everyone. By following the right incident management procedures, even small teams can minimize the impact of inevitable disruptions to business operations.

This post explains how to do that, with a focus on the needs of retailers.

Defining Retailer Priorities

To perform effective monitoring and incident management for a retailer, admins first have to understand what a retailer’s top requirements are when it comes to infrastructure availability and uptime.

For most modern retailers, which have both brick-and-mortar as well as online sales outlets, ensuring the following is essential:

  • Keep customer-facing websites operating. This is difficult because, by definition, customer-facing sites are on the public Internet, where they may be subject to intrusion attempts (DDoS attacks and so on), not to mention the threat of crashing from simple, non-malicious traffic spikes. These sites are also essential for retailers, because they fuel sales. Customers typically use websites to plan future purchases, whether those purchases are ultimately made online or in-store.
  • Keep backend systems running. Backend servers, which handle tasks like keeping track of inventory and storing transaction histories, are also vital for business operations. While they can generally be secured more easily from attackers than public-facing sites because they can run on private networks, backend systems are more vulnerable in other ways. They are likely to contain highly sensitive information, for example; making effective monitoring essential.
  • Ensure uptime of point-of-sale (POS) systems. Brick-and-mortar retailers can’t make sales if their POS terminals crash. Keeping these systems running requires effective management of a complex mix of variables, from local network connectivity to physical security and power supply.
  • Protect IoT assets. As retailers make greater use of the Internet of Things (IoT) to personalize and automate workflows, guaranteeing the uptime and connectivity of the devices and sensors that power retail operations is crucial. In this respect, the move toward highly automated, device-based business operations also raises new challenges for organizations in the context of monitoring.

These are retailers’ primary requirements to ensure completed transactions. Now, let’s discuss how monitoring and incident management can be used to meet key challenges.

Preventing Retailer Downtime

If you want to keep the most important parts of your retail infrastructure running smoothly, you’ll want to adhere to these guidelines:

  • Maximize visibility across the infrastructure. With so many variables in play, retailers tend to have especially complex and diverse IT infrastructure. As noted above, it includes not just public websites, but also backend systems and a variety of special-purpose devices and sensors. To keep track of infrastructure like this, organizations require across-the-board visibility. All monitoring information needs to be centralized into a single location, as that’s the only way to truly make sense of it.
  • Deploy flexible monitoring solutions. A diverse infrastructure also requires diverse monitoring tools. Retailers should make sure that they have monitoring agents installed on all the different parts of their infrastructure, and that the monitoring information they collect is forwarded and normalized within a central management platform.
  • Respond in real-time. For retailers, just a few hours (or even minutes) of downtime on a sales site or POS system has very costly repercussions. In addition to the sales lost as a direct result of the downtime, companies also suffer damage to their reputations. The effects can therefore last months. To mitigate these risks, retailers need to be sure that their incident management systems and workflows enable real-time response powered by actionable insights so that service is restored as quickly as possible.
  • Communicate effectively. One of the challenges of incident management in the retail sector is that a company’s infrastructure tends to be very large and very distributed, especially for retailers that have large networks of stores and warehouses. The admins who keep infrastructure running are likely to be distributed, too. Addressing this challenge requires an incident management system that provides seamless communication tools, and takes advantage of collaborative, shared workflows such as ChatOps. This way, a large team of admins spread out over a wide area can communicate effectively when resolving problems.

It’s safe to say that there will never be a total eradication of the threat of downtime. But modern monitoring and incident management solutions play a key role in helping retailers large and small avoid becoming the next headline about a major service failure.

Download our latest ebook to learn more about incident management for retailers and impact of downtime on retailers, or contact us today.

DOWNLOAD NOW