How retailers are improving productivity, transforming incident response, and empowering teams with PagerDuty
For retailers, uptime is money and issues can cost thousands of dollars per minute. With infrastructure comprising complex services such as payment gateways, inventory, and mobile applications, maturing digital operations is vital for ensuring services are always on and customers get the best experience.
At our 2021 Summit, representatives from retailers Hudson’s Bay Company, Bunnings, and Loblaw discussed how PagerDuty is helping them to empower teams, achieve visibility, and proactively resolve problems.
Improving Productivity and Employee Satisfaction Through Noise Suppression
Canadian retail business group Hudson’s Bay Company uses PagerDuty for end-to-end incident response across its 100 services.
It found that its teams were receiving too many alerts. Without any priority mapping, unnecessary alerts often woke people up in the night. To reduce this alert fatigue, the first step was to gather data. The company pulled a year’s worth of alert history from PagerDuty and identified which services generated the most noise. As it turned out, the top four alerting services generated 80% of alerts. Next, the team met with service owners to create strategies around alert management. They did this in four ways:
- Prioritize alerts: Service owners assigned incidents that needed acknowledgement a priority level of P1 or P2 and sent less urgent P3 or P4 alerts via email and other less disruptive channels.
- Issue warnings instead of alerts: The teams replaced very low priority alerts (such as server memory or CPU utilization) with soft warnings.
- Group alerts based on time or type: Service owners decided how many alerts were needed before issuing a notification through the PagerDuty platform for noisier services.
- Improve taxonomy: By standardizing incident language, the company reduced confusion and improved cross-team collaboration.
Over the six-month project, Hudson’s Bay reduced all alerts by 61%, cutting around 8,000 in total. The top four services reduced alerts by 76%, or approximately 5,500. Mean time to acknowledge (MTTA) improved by 38% with most incidents acknowledged within three to four minutes. Teams were delighted with the results, and the company regularly reviews its data to find new ways to improve.
Transforming Incident Response
Australian consumer hardware chain Bunnings wanted to bring its ecommerce function in-house, and needed a digital operations management platform to support its goals. Like many retailers, it had a complex and interconnected technology ecosystem, including inventory solutions, website authentication and authorization tools, and pricing platforms.
The project was an opportunity to re-platform to alternative technologies, and Bunnings wanted a system that would enable it to easily onboard new applications. Bunnings also noticed a significant increase in incidents during marketing campaigns or following price changes, and was looking for a way to scale its incident response process. It turned to PagerDuty for help.
Using PagerDuty’s intelligent event management, Bunnings quickly implemented major improvements to its incident management process. It mapped priorities to alerts which instantly reduced alert noise, and improved team collaboration and communication by syncing notes between incidents. PagerDuty also enabled Bunnings to improve on-call rotations and escalation policies, and resolve complex issues faster. For example, PagerDuty helped the team correlate issues affecting interdependent applications such as location look-up, checkout, and pricing.
With PagerDuty, Bunnings now has full-system visibility and an efficient incident response process that enables it to accurately measure MTTA and mean time to resolve (MTTR), and ensures alerts reach the correct teams. Engineers are empowered to take ownership of keeping databases of technical service information up to date. This information helps the business and support team accurately determine the impact of an issue.
Optimize Customer Experiences with Real-Time Operations
The pandemic piled more pressure on retailers than ever before as brick-and-mortar stores saw traffic move to online shopping. With this heightened load, many retailers faced unprecedented challenges in ensuring uptime across platforms. As our Summit presenters have shown, working with PagerDuty can help to proactively resolve problems to protect the bottom line and ensure ecommerce services are kept always-on.
Empowering Teams with Self-Reliant Engineering
Loblaw is a Canadian supermarket and pharmacy chain with more than 2,400 stores countrywide. Since launching an ecommerce business in 2012, its development arm Loblaw Digital has gradually shifted to a full-service ownership [MA1] model. The retailer is working with PagerDuty to redefine its incident management processes.
Previously, Loblaw’s development teams relied on centralized change management and response orchestration. As its ecommerce business scaled and teams grew, it became clear that this approach was inefficient and unsustainable. For example, if a team wanted to make a deployment, it had to submit a request days in advance and was subject to a lengthy approvals process. Similarly, incident alerts went out to all teams leading to alert chaos. In order to work smarter, Loblaw Digital needed to become self-reliant.
Moving to a full-service ownership model radically transformed both incident management and change management processes. With the help of PagerDuty, teams could now resolve their own incidents quickly and keep colleagues up to date without the need for long group calls. Teams also had the flexibility to deploy throughout the day without having to worry about a complex chain of approvals.
When the pandemic hit in 2020, Loblaw saw a ten-fold increase in online traffic and incidents went through the roof. Loblaw Digital believes its self-reliant approach enabled teams to be more resilient and nimble in the face of changing circumstances. All the work with PagerDuty to empower teams, stabilize systems, manage alerts, and establish on-call rotations and escalation policies, put them in an ideal position to handle the load.