PagerDuty Blog

Best practices to help retailers make the grade for the holiday season

It’s hard to believe we’re already talking about the return to school, but it’s set to be a big one. In fact, this year promises to be the biggest in the last five years. The National Retail Federation expects back-to-school spending to reach $37.1B, up from $33.9B last year. Back-to-college spending is also expected to rise, reaching $71B this year.

This increase is buoyed by parents and students gearing up for their first in-person classes after a year of virtual learning. Categories predicted to see the highest demand are footwear (21%), school supplies (16%), and apparel (14%).

The last 18 months has changed retail beyond measure, and back-to-school shopping is no exception. US shoppers have also recently reported becoming more anxious about visiting stores due to rising cases of the Delta variant. This means much of this year’s back-to-school shopping will take place online rather than in store.

This will lead to increased digital traffic. Though online retailers experience periods of peak traffic each year around big events like Black Friday and Cyber Monday, the ongoing uncertainty around the pandemic adds a new dimension. Retailers should be ready to deal with more peak traffic events, and back-to-school shopping is their first test.

In a world where uptime is money, and issues can affect brand reputation and customer satisfaction, retailers need to be able to quickly identify and resolve incidents.

Enabling hypercare mode

Hypercare is a period of time where a planned and elevated level of support is available to ensure seamless adoption or operation of a system. The key word is planned. This includes major events like Black Friday or Christmas, rather than something like a DDoS attack. Here are three things retailers can do to enable hypercare mode to help them prepare for the upcoming holiday season and beyond.

Document incident response processes: On-call can be daunting, especially during times where traffic levels are high. It’s important to make sure that everyone who could be asked to respond to an incident knows what procedures to follow. In many organizations, this information is tribal knowledge, or knowledge gained through expertise and time that is not codified. Scripts, tools, historical information, and more that help on-call team members get to the bottom of an incident often only exist in the heads of subject matter experts (SMEs).

To accelerate incident response, retailers should standardize incident response by capturing this tribal knowledge and automating toilsome tasks when possible. This allows teams to be more confident when responding to critical incidents. When documenting, make sure you define:

  • Incident response roles and responsibilities: When roles and responsibilities are not standardized, you may find yourself tripping over teammates during response. Or, crucial communications might fall through the cracks. By standardizing who does what, you can ensure all your bases are covered. Common roles include: incident commander, deputy, scribe, internal liaison, customer liaison, and SME. While all these roles may not be necessary for smaller incidents, it’s important to have a structure in place in case you have to resolve a complicated critical incident. If you want to read more about roles and responsibilities, check out our Incident Response Ops Guide.
  • Severity levels and escalation protocols: As noted above, you may not need all hands on deck for every single incident. But when should you alert the next tier of responders? At what point do you need to delegate communication to a specific role? Many of these decisions are based on severity levels. The higher the severity of an incident, the more escalations are likely involved, and the more team members required to resolve the issue. To learn more about how to determine severity levels, check out this resource.
  • Runbooks to leverage for common failures: In an ideal world, no incident would occur the same way twice, but this isn’t the case and some issues are recurring. Even if they aren’t exact replicas, they are similar enough that runbooks can help teams resolve incidents faster. A runbook is an actionable process that is implemented when these common issues and tasks occur in order to provide the operator with detailed instructions for quickly and effectively solving the solution – no matter how new or experienced you are on the team. This documentation means anyone holding the pager can feel confident in their ability to resolve some of the most common issues.

Processes that are most important for you to document ahead of the holiday season come from lessons learned during back-to-school. If your team had problems knowing who owned which services, what runbooks to use, or more, examine those pain points and make sure there’s documentation to support it.

Practice makes perfect: It’s one thing to have these processes documented, it’s another to be able to execute seamlessly under pressure. While the pressure of Black Friday or Cyber Monday is hard to replicate, dry runs can still help your team understand what to do when incidents happens.

Retailers looking to make the most out of the holiday season should examine incidents that occurred during the peak back-to-school shopping period. Then, once the team has determined a few good incidents to dig into, it’s time to roleplay. To practice the steps to take when dealing with major incidents, teams can organize mock incident response scenarios or “game days.”

To make a game day successful, there are a number of steps to consider. PagerDuty customer Shopify wrote a blog post on how their team plans game days.Firstly, the team lists everything that could break. Then team members compare the estimated impact of each potential issue against how difficult it would be to simulate it.

Secondly, the team creates controlled experiments by taking the list of things that could break and thinking about how they will break, and the impact it will have.

Thirdly, the team analyzes its performance to see how it works under pressure to help create protocols. To do this, the team defines the types of interactions expected during an incident, and measures how this stacks up as the incident unfolds.

Finally, the team patches any holes identified during the game day testing to ensure as much risk as possible has been mitigated.

Game days can’t totally erase the possibility of incidents occurring, but it can help teams be more prepared to face them, and more aware of team strengths and weaknesses. Practicing also helps take off some of the pressure for on-call team members.

If you want to learn more about how Shopify leverages PagerDuty, you can watch this short video.

Adopt a full-service ownership mindset for the long term: When a major incident strikes, it pays to be prepared to respond as soon as possible. A key tactic that can help is adopting a full-service ownership (FSO) mindset. Full-service ownership means that people take responsibility for supporting the software they deliver, at every stage of the software/service lifecycle. This can help to ensure that the right people are notified every time whenever trouble strikes.

Here are some of the questions teams typically ask when they start moving towards an FSO model:

  • What are my services and what do they deliver, both technical and business? This involves creating a list of all your services, defining each service, and explaining what they are responsible for.
  • What are the boundaries of each service? This involves clearly defining where a service begins and ends, its dependencies and relationships with other services.
  • Which teams or team members are responsible for each service? Each service should be owned by the team supporting it via on-call rotation. If multiple teams share responsibility, it’s better to split the service up into separate services where possible.

While adopting a full-service ownership model might not be something you can accomplish before this holiday season, it’s a journey you can embark on that will have outsized benefits for next year and each year to come.

Preparing for the surge

Hypercare is a critical tool in a retailer’s bag of tricks to keep the business always on, ensure customers are happy, and protect the bottom line. Even though the 2021 holiday season might seem a long way off, peak traffic will be here before we know it. Use what your team and others learned during the back-to-school rush to make positive changes going forward.

These can seem daunting, but small and incremental changes can reap outsized rewards. One switch teams can make is adopting a digital operations management platform as a partner. UK retailer John Lewis & Partners adopted a full-service ownership model enabled by PagerDuty.

PagerDuty allowed John Lewis & Partners to prepare and scale to handle a 10x increase in traffic at peak periods around Christmas, the Summer Sale, and Black Friday—at peak, there could be more than eight orders per second and tens of thousands of page views.

If you’re ready to see how PagerDuty can help you make the grade, try a 14-day free trial today. Or, if you’re looking to learn more about hypercare, download your copy of our Hypercare Readiness Checklist.