Imagine the frustration you feel when you’re writing something in Google Docs and you suddenly lose Internet connection. Or the panic you experience when you’re searching through your Notes app on your phone for one very particular note you typed on your computer about elephants in Djibouti so you can win the trivia game—and can’t find it.
From meeting notes to random trivia tidbits, Evernote’s job is to help people create, assemble, nurture, and share information. Our unique search capabilities allow people to find information when they need it, no matter the format it was stored in—whether in a note, image, PDF, or voice recording. Our product is a cross-platform software-as-a-service application designed to enable people to organize, personalize, consume, and share thoughts from any device at any time. We currently have over 220 million people using our product globally, and that number increases daily.
As the SRE Manager, my team of site reliability engineers are responsible for customer happiness by ensuring that our product works as intended. This means minimal downtime, but if downtime does happen, we need to act fast and resolve the issue as soon as possible.
This is where PagerDuty comes in: When I joined Evernote in 2012, we were using PagerDuty primarily for alerts and notifications, as well as on-call rotation scheduling. In 2016, we began a major evolution of our hosting infrastructure, which centered around migrating many workloads to Google Cloud Platform. By moving to the cloud, engineers were able to iterate and build services quicker than ever before.
But with this increased agility came new challenges—namely, tracking key performance indicators that tie into our service-level objectives (SLOs), which we use internally to identify which incidents have the most negative impact on the customer journey.
For example, our customers care about how long it takes to open, write, and sync a note across their devices, so when any one of those actions experiences an issue, my team needs to be aware immediately and resolve that incident as quickly as possible. On the other hand, if one server goes down and we have eight of them still running, we’ll still receive an alert. But if it doesn’t affect our customers’ experience (and our SLO), then it probably isn’t a big deal and we can plan to address it later on. PagerDuty helps with this by funneling all of our alerts and grouping them together so we can figure out what to prioritize, allowing us to look at things from the top of the funnel down versus from the bottom up. Additionally, the platform’s advanced analytics capabilities gives us a single source of truth for visibility into production issues.
As we continue to grow, we plan to expand our use of PagerDuty within the company, specifically in regards to using the available postmortem templates and incident response plays to further automate our incident response process.
Garrett Plasky is SRE Manager at Evernote. His team is responsible for running Evernote’s production service infrastructure. See the full case study to learn more about Evernote’s story.