• PagerDuty
    /
  • Articles
    /
  • Protecting Customer Experience Through Operational Resiliency Planning

11 min read

April 1, 2025

Protecting Customer Experience Through Operational Resiliency Planning

By Leigh Shevchik

Customer experience is a key differentiator in industries where consumers have a great deal of choice. In retail, the drive to manufacture a competitive in-store experience has led to an increase in tech adoption.

Shoppers hardly think about the tools that go into building the omnichannel experiences they crave. However, with consumer loyalty slipping across the board, retailers must carefully consider each element of their tech stack.1 Every new addition adds the risk of outages, breaches, or other serious incidents.

Digital outages pose a special risk to companies that rely on technology to support in-store and other physical customer experiences. Building resiliency into your tech stack and customer service operations can help mitigate the risk of incidents that directly impact the customer experience.

Unique resiliency considerations for in-person operations

The consequences of outages are more widespread for companies that offer in-person and digital interactions.

Outages that hit key systems simultaneously disrupt online and on-site experiences.

Between retail’s high turnover rates and frequent use of temporary employees, in-store workers may not be trained for high-pressure situations. Times of high demand, like Black Friday or other holiday sales, come with the risk of overloading systems.

The scope of the fallout depends on factors including the number of legacy systems involved, dependencies, and the potential for automation to aid in the restoration process. For retailers, many of whom still rely on legacy systems due to various market pressures, the risk of an extended outage is high.2

Challenges for customer-facing employees

In-store and on-site staff are the face of any outage that affects in-person experiences. They must navigate multiple challenges while upholding the best customer experience possible.

In-store staff may be the first to notice outages, meaning they must face the repercussions before corporate support logs in to help. Even once your IT and executive teams are on the case, these employees will be the ones informing customers of the outage.

The more urgent or time-bound people’s needs, the more stress involved in this interaction. A customer who woke up early to get the pre-6 a.m. doorbuster doesn’t have time to wait for a solution. To uphold their experience, a sales associate will need a finely tuned message and restitution of some sort.

Typically, such elements are passed down from corporate. Coordinating communications among distributed teams of harried employees is not an easy task on a good day. In times of crisis, front-line employees are harder to reach and have less ability to craft the optimal response.

All customer touchpoints can be disrupted

Each element of your tech stack that affects in-store or on-site customer experiences adds another layer of risk. Not all outages harm your customer experience to the same extent, but any deviation from typical operations disturbs customers.

Payments: Outages can affect your ability to process debit and credit cards, accept payments from digital wallets, and offer Buy Now, Pay Later (BNPL) options. Nearly half of Americans—42%—don’t concern themselves with carrying cash and may be unable to complete a purchase if your systems are down.3

Personalization: When loyalty programs go down, customers may lose access to personalized discounts or be unable to earn rewards on their purchases. Businesses also miss the chance to collect customer purchase data. The 63% of Americans who make purchasing decisions based on their loyalty programs may opt out of buying if the incentive isn’t present.4 Plus, missed data collection will harm future efforts to engage customers through personalization.

Omnichannel experiences: Convenience-focused buyers have adopted Buy Online, Pick Up In Store (BOPIS) or learned to use real-time inventory availability to plan their shopping trips. Outages can disrupt orders or result in a mismatch between in-store and online inventory counts. Because BOPIS is a big driver of foot traffic, shopper numbers may dip during an incident.5

Appointments: For businesses that offer appointment-based services, scheduling software and CRMs are essential to keeping things running. Downtime can disrupt automated reminders, keep front-line workers from accessing important customer context and account information, and confuse employees who are used to relying on the system to organize their day.

Internal systems: Customer-facing tools aren’t the only place outages can cause trouble. If your staff scheduling tool or time card application goes offline, retail locations can fall into disarray due to a lack of employees on the floor.

The distributed footprint of retail operations increases recovery time because individual terminals often need to be rebooted or updated. Additionally, the nature of an outage can result in extended fallout. An appointment scheduler that loses a week’s worth of data will leave employees scrambling to catch up. Outages in supply chain or inventory management software can result in logistics snarls that take weeks to unwind.

Direct and indirect consequences of in-person service disruptions

The losses caused by a serious outage are often amplified by the expectations surrounding in-person experiences. Customers come into the store seeking certain experiences, the majority of which blend the physical and digital.6

Because customers have already chosen the less convenient purchasing method, any additional inconvenience will seem even worse. While some will stick out the hardship, others will leave empty-handed.7

A survey of retailers found that 23% estimate their hourly losses during an outage exceed $1 million.7

Each poor or missed customer interaction can also lead to indirect losses. Irritated shoppers may share their negative experiences on social media or post a bad review online. They might visit less frequently, decreasing their customer lifetime value (CLTV). And more than half of consumers will simply switch to a competitor after a single bad experience with a brand.8

Optimizing the incident management lifecycle for customer experience

Most businesses run on a complex web of vendors and systems, all of which introduce risk into your operating ecosystem. PagerDuty Operations Cloud is built for this reality and contains tools to help companies prepare for and navigate serious outages in a way that minimizes risk.

Before: Assess and minimize risk factors

The goal of operational resiliency is to minimize the risk of incidents and prepare your team to respond efficiently and effectively when problems arise. These four steps will help you gain a thorough understanding of your company’s risk factors.

  1. Increase visibility: A sprawling tech stack is hard to monitor effectively. Rather than making IT chase down bugs in various apps’ interfaces, integrate your systems for more effective monitoring. PagerDuty integrates with over 700 commonly used enterprise applications. Operations Cloud brings together the most important systems information into centralized monitoring dashboards so IT can quickly spot and respond to serious issues.
  2. Deflect noise: PagerDuty makes it easy to create automated workflows and runbooks to address frequent low-level problems. Taking care of minor annoyances automatically frees up IT to do higher-level work.
  3. Develop communications plans: When something goes wrong, everyone needs to know where to turn for information. Set up contingencies to empower employees in their customer interactions and maintain messaging consistency among distributed teams.

    A good communications plan isn’t just internal, either. Reach out to third- and fourth-party vendors now to learn their incident management and communications plans, so stakeholders at your company know how to respond when an outage starts.

  4. Assign roles: Every employee should know what will be managed at the corporate level and what will be delegated to individual locations. After making this distinction, clarify which seniority levels will be responsible for each part of your response plan. Preparing workers will make them feel more in control during the uncertainty of an outage and prevent overzealous or underconfident team members from inadvertently throwing a wrench into your operations.

These steps aren’t exercises to be completed and then put away. Stakeholders, including IT, customer-facing teams, and executives, should all know and understand the plan ahead of time. Incident management drills keep employees fresh on their responsibilities and can uncover weaknesses or gaps in your procedures.

During: Clarify, contain, and communicate

Planning incident management with an eye on customer experience is not just about fixing disruptions. It’s also about how customers are treated during the incident and whether your response efforts earn goodwill or break customer trust.

Clarify: The first step of any incident response is to understand what the problem is and where it’s happening. PagerDuty can help you identify the root cause of the problem.

PagerDuty’s remote operations management tools track phased rollouts and, therefore, catch errors as soon as the first wave of reboots completes. Customer Service Operations, another PagerDuty offering, connects your customer service and IT systems. If customers are the first to catch the error, their reports will be seen by employees who can act on them.

If the incident is widespread before your team catches wind of it, PagerDuty AIOps can analyze the cascade of alerts and events to pinpoint where everything started.

Contain: When a major incident hits, IT needs to focus on finding a fix. Modern incident management uses automation to handle lower-stakes problems that don’t need an expert’s touch.

Auto-remediation prevents smaller issues from escalating. It can decrease the blast radius by, for example, stopping your phased rollout so fewer locations are affected. PagerDuty Automation steps in so IT doesn’t have to, using alerts and error messages as triggers to deploy the runbooks you set up during your planning phase.

PagerDuty’s Incident Management toolset further boosts efficiency with its guided remediation capabilities. Because PagerDuty has full visibility over your system, it can surface the issues that need to be prioritized to your IT team.

Communicate: Executives, customer service teams, front-line employees, and your end customers will all want to know what’s going on. Providing play-by-play updates is a waste of time for IT, but rapidly evolving situations like outages need more than a daily status update. Slack, Teams, or email alerts sent on pre-set triggers help keep everyone on top of IT’s progress.

Front-line employees, who aren’t “online” except over in-store PA systems, need to be able to find the information on their own terms. Automated update pages are easy for them to visit via a phone or in-store terminal. PagerDuty’s Customer Service Operations tools also make it easy for customer-facing employees to coordinate across multiple locations and keep customers up-to-date.

After: The cycle of continuous learning

After an incident is solved, companies that put their customers first take the time to understand what happened. Analyzing the technical problems and the shape of your company’s response allows you to improve your incident management processes.

PagerDuty automates the production of post-incident reports that trace your team’s response step-by-step. Providing shared visibility across departments and teams can spark discussions about ways to improve collaboration and communications. PagerDuty reports bring together information from across all your systems, simplifying reporting and offering deeper insights. PagerDuty’s remediation records are also sufficient to help your company comply with any regulations regarding breaches, hacks, or other incidents.

As your team levels up from your postmortem, your tools should as well. PagerDuty’s machine learning capabilities analyze the incident and your response to optimize the tool’s auto-remediation and guided remediation capabilities. With your team and your incident management suite making improvements, your next incident response will be faster and more efficient.

The “after” phase then loops you back into the “before” stage as you communicate and act on what you’ve learned. Front-line employees won’t get the chance to review the incident unless individual locations do manual debriefs. Prepare them for the next one by updating your incident management plan to further clarify communications practices and the steps each person should take.

Outages happen, but bad customer experiences don’t have to

Like all good infrastructure, a well-tuned tech stack is virtually invisible as long as things are going well. Operational resiliency planning can help you maintain that invisibility and uphold the customer experience as much as possible during any outages.

Empowering customer-facing and IT teams with an incident management plan and robust tools like PagerDuty will reduce the severity of an outage or other service disruption. While you can’t prepare for every individual error or hardship you might face, an overarching strategy that guides your team during times of uncertainty will pay off when the unexpected happens.

 

About the Author

Leigh is a Senior Content Strategist at PagerDuty.