Lessons from Virtuoso: Three Steps You Can Take to Reduce Alert Volume by up to 94% in Three Weeks
How a Customer Greatly Reduced Alert Volume and Improved Their Operations with PagerDuty and Event Enrichment
We recently sat down with Shawn Motley, Senior DevOps Engineer at Virtuoso, to talk about his experiences with PagerDuty and the Event Enrichment Platform (EEP). Virtuoso is a travel portal for high-end clients, with over 200 employees and 8 web properties. As a global luxury travel company, image is an important part of Virtuoso’s business. If one of their clients is having issues while on an expensive vacation, their travel advisors need access Virtuoso’s websites regardless of where they are or what time it is. Their websites need to be up all the time, period.
When Virtuoso began focusing on their DevOps initiative 7 months ago, they were receiving thousands of events every 24 hours, the majority of which were noise. They needed to reduce alert volume, and fast. For most organizations, suppressing alerts in each of their individual monitoring systems is not possible given constraints of resources, time, and operational duties. With the EEP, suppressions are managed in a central location via an intuitive web UI which encourages active event suppression and management. By utilizing EEP and PagerDuty, the Virtuoso DevOps team was able to rapidly reduce their daily Operations event load to just a few events per day.
Step One: Put a System in Place
When they put PagerDuty and the Event Enrichment Platform in place, there was an immediate increase in Operation efficiency. With the EEP PagerDuty integration, their alerts are funneled to EEP, classified as actionable or noise, enriched with remediation information, and then sent on to PagerDuty for guaranteed delivery alerting of their Ops team.
Step Two: Enrich and Customize Your Alerts
They added enrichments, which are specific resolution steps, to the original alert so that anyone responding to an incident had the information needed to triage and address the problem. They routed these now enriched alerts via EEP notifiers to specific PagerDuty services. The enrichment steps appeared in the incident, and included a link back to the incident in the EEP with full detail.
Clicking “acknowledge” or “resolve” for an alert in EEP automatically performed that action in PagerDuty. Similarly, responding to the PagerDuty SMS or Mobile App alert would also acknowledge and resolve the event in EEP.
Step Three: Set Up Weekly Event Triages
Now with dramatically reduced event load and the remaining alerts enriched, they had the information that they needed, when they needed it most. They saw a great opportunity to further reduce their noise and add remediations to their alerts by setting up weekly triages. From the EEP, they could download a list of all their recent incidents and assign them to one of two categories: noise or actionable. The platform made it easy for them to quickly suppress large clusters of unnecessary events with EEP classifications.
Within a week, they had decreased their alert volume by 82% and within three weeks to 94%.
With their daily alert count at under a hundred, they continued with their weekly triages and were able to further decrease their daily alerts. Currently, Virtuoso receives just a handful of incidents per day, only some of which require escalation and engagement with other teams. PagerDuty and EEP helped Virtuoso DevOps dramatically improve the situational awareness of their infrastructure.
“It’s a brilliant system,” said Shawn. “It takes your business to the next level, and works with all these other partners out there.”
Now, thanks to the precious time that use of PagerDuty and EEP freed up for them, Virtuoso has significantly accelerated their deployment, build, and release intervals. Focusing only on those events that matter has enabled the Operations team to be very successful in their infrastructure optimization initiatives.
“Now we can really apply DevOps philosophies to our team,” he added. “We focus on automating our infrastructure, not sorting through alerts.”
“Because we were able to remove the noise, we now have much better telemetry for our servers, which allows us to better differentiate between server and code issues,” said Shawn. “We now remediate system problems much more rapidly and escalate to developers as needed for code level issues.”