Verivox, one of Germany’s leading comparison sites for utilities, mobile, insurance, and more, serves over eight million consumers looking to compare prices and switch service providers. With so many customers relying on Verivox to provide them with accurate information, Verivox’s website must remain stable and reliable. And with competitors snapping at its heels, 13 development teams pushing out new features weekly, and its engineering teams dispersed across the country, the company needed a better way to scale and automate its digital operations in order to mitigate downtime.
In the past, Verivox relied on its site reliability engineering (SRE) team to manually review alerts and notify teams of incidents. However, the company’s alerting protocols routinely triggered invalid alerts, eating up resources and thwarting visibility into network health. Additionally, with one person on call for an entire week after business hours (including weekends), Verivox risked both staff burnout as well as potentially missing meaningful alerts in the middle of the night.
From Manual to Automated
According to Waldemar Spitschak, Head of Site Reliability Engineering, “First and foremost, we needed PagerDuty to automate alerting.” As PagerDuty has over 200 integrations, it made it easy for Verivox to connect the PagerDuty digital operations management platform to all of its monitoring tools—like New Relic, Zabbix, and AWS Cloudwatch—across its entire hybrid production environment of databases, cloud applications, Windows and Linux servers, and more.
PagerDuty automation enabled Verivox to better define and assign on-call roles. As a result, the company can immediately route issues to people who know how to fix them rather than force an intermediary to pick up the phone and track someone down. If the on-call team needs to add more resources to assist, they can run a response play to automatically tap the right people. “By eliminating manual interactions, PagerDuty has enabled our alerting process to take a huge step forward,” commented Spitschak. “And we’re no longer losing track of incidents that affect production.”
“We’re reacting to and resolving incidents faster than ever before, which is really important since our development cycle is so short,” he added.
Automation also evens out the peaks and valleys of Verivox’s seasonal workflow by standardizing the on-call process and enabling the company to better predict costs. With PagerDuty, on-call teams now deliver the same comprehensive coverage all year round, maintaining a consistent level of expertise beyond the peak Q4 time.
Improved Visibility Shines a Light on Digital Operations
Using PagerDuty, Verivox now has a better understanding of incidents—Spitschak’s team can see the exact number of incidents per service and how quickly they’re resolved. The data helps them determine whether the platform is performing adequately or if a particular service is impacted. With PagerDuty’s rich API functionality, Verivox can generate different reports and alert mechanisms and set automated maintenance.
“We’re getting a more holistic view with PagerDuty. Before, we had to make decisions based on a gut feeling. With PagerDuty, we have a clearer picture of what’s going on in our production environment,” said Spitschak.
The increased transparency also helps Verivox improve the quality of monitoring and alerts. Because Verivox removed invalid, legacy alerts from PagerDuty, its monitoring is now in a much better place than before. And fewer alerts mean Verivox handles fewer incidents. “In the past, our alerting system was sending 10 to 20 times more emails than the on-call person needed to act on,” Spitschak shared. “Now the ratio is more like 1:1.”
The company soon plans to deploy PagerDuty throughout its organization and its parent company subsidiaries. “With PagerDuty, we get a much clearer view of the health of our production environment, and we’re looking into PagerDuty’s Operations Command Console and Operational Health Management Service,” said Spitschak.
While Verivox initially selected PagerDuty for its alerting features, the company is now using it to enhance other key dimensions of its digital operations management. And since getting more bang for the buck is what helps fast-growing companies like Verivox stay ahead in a competitive market, it also plans to use PagerDuty to define and measure key performance indicators.
“By eliminating manual interactions, PagerDuty has enabled our alerting process to take a huge step forward. And we’re no longer losing track of incidents that affect production.”