Customers / StumbleUpon
  • Size
  • 75

  • Industry
  • Consumer Web

  • Location
  • San Francisco, CA

  • Customer Since
  • March 2011

Benefits With PagerDuty

  • Reliable and effective alerting
  • Easy-to-manage on-call scheduling and escalation policies
  • Tracks all incidents in one central location for trend analysis
  • iOS and Android apps make it easy while on the go

StumbleUpon: Helping Users Discover with PagerDuty

StumbleUpon makes it easy to discover new and interesting pages from all corners of the Internet. Over 30 million people use the service to discover a vast collection of curated content that is personalized to their specific interests. With so many users accessing the service daily, StumbleUpon needs a reliable alerting system to minimize downtime.

Early StumbleUpon Challenges

To keep attracting users and advertisers, StumbleUpon must be able to ensure that the website is always available. The company uses Nagios and Pingdom for monitoring, but both of those systems lack reliable alerting. Whenever a server failed at StumbleUpon, one of the alerting systems would send an email or SMS alert to reach the on-call engineer. These alerts were easy to miss if the engineer was on the road or asleep, and it sometimes led to website outages. For StumbleUpon, downtime can have serious financial costs. “We are an ad-supported company,” said Michael Hobbs, Operations Manager at StumbleUpon, “so anytime a user can’t get to our website we aren’t fulfilling our advertising content. It can be expensive.”

Through proactive alerting, StumbleUpon tried to reduce downtime before it reached the user. However, the company had no way to track alerts across the different systems. This made it difficult to spot weak areas in its IT infrastructure.

The on-call schedule had its own issues as well. StumbleUpon was using a manually maintained system to keep track of on-call engineers. The schedule became a stressful mess that was difficult to manage. When a substitution needed to be made, a manager had to manually input the contact information changes from the schedule to the different monitoring systems every time. This laborious process was ripe for mistakes and consumed far too much of the managers’ time. “Previously I have used a Google Calendar or a wiki, and had to update email addresses in the monitoring systems,” Hobbs said. “It was very painful.”

How Did PagerDuty Help?

StumbleUpon turned to PagerDuty for a solution to these problems. PagerDuty’s wide breadth of notification methods helped improve the mean time to response. With PagerDuty, engineers can be contacted by SMS, email, phone calls to multiple numbers, and iOS or Android push notifications. Each user can decide how they will be alerted and at what time intervals. There are also clear escalation policies if the first person is unreachable, so that every alert gets a response. With PagerDuty, engineers can customize their own notification preferences. “We had one guy who was a really heavy sleeper,” said Hobbs. “He couldn’t find an SMS sound that was loud enough to wake him so he would have PagerDuty call him four times to make sure he wouldn’t miss any alerts.”

“PagerDuty just makes things easier.”

StumbleUpon uses PagerDuty’s iOS app to contact engineers about in-depth issues when they are on the road or without WiFi. “I really like the iOS app,” Hobbs said. “The ability to acknowledge, escalate, and resolve all incidents while away from the office is really nice.”

“PagerDuty is one of those companies that does what it does without any flaws.”

PagerDuty provides a sophisticated, proactive alerting system. “It’s amazing that PagerDuty keeps track of the root cause of an incident and has a central place collecting data from all our monitoring systems,” said Hobbs. This data gives StumbleUpon’s engineers the information they need to spot recurring trends and prevent downtime.

StumbleUpon also easily integrated its homegrown monitoring systems directly into PagerDuty. “If someone spots a problem that one of our monitoring systems didn’t pick up, they can shoot a message to our emergency email address and it instantly jumps into PagerDuty so we never miss anything,” said Hobbs.

“PagerDuty makes it so we don’t have to worry about scheduling and we can focus on other aspects of our work.”

On-call schedule changes are now a breeze for StumbleUpon’s managers. Managers can quickly and easily update changes in PagerDuty, which automatically adjusts the schedule. PagerDuty’s calendar clearly displays who is on-call, how they can be reached, and the escalation policies that will be used if the original on-call engineer is unresponsive.

PagerDuty’s escalation policies allow different types of incidents to be sent to StumbleUpon’s on-call teams: DevOps, operations, and general engineers. When an incident occurs, an alert no longer needs to be sent to one person who must manually escalate it to the right engineer. Instead, the alert is sent directly to the correct person, reducing mean time to resolution and relieving StumbleUpon’s fear of missed alerts.