PagerDuty Blog

Set Up for Success: Service Taxonomies in PagerDuty

It’s 2:37 a.m. on a Tuesday night, you’re asleep—but it’s also your turn to be on call. You receive a phone call from PagerDuty. Your partner hits you with a pillow in an attempt to wake you up. It worked. You groggily answer the call and hear your favorite robo-guy on the other end of the line:

Robo-Guy:
“PagerDuty Alert. You have 1 triggered alert on Service: Datadog. Press 2 to acknowledge. Press 4 to Escalate.”

“PagerDuty Alert. You have 1 triggered alert on Service: Datadog. Press 2 to acknowledge. Press 4 to Escalate.”

“PagerDuty Alert. You have—”

You press 2, then get out of bed as quietly as you can so that pillow doesn’t turn into a kick.

You log into PagerDuty and click on the incident that’s assigned to you. Since the incident was triggered on a service called “Datadog,” you assume the issue is related to something that Datadog caught. But, you wonder to yourself, I haven’t worked on anything Datadog-related for months, so why am I even on call for this service? This Datadog payload doesn’t give you much information, so you log into Datadog to take a look.

Which stack is this Datadog watching? West Coast data center? East Coast? Database? API?

Deep Sigh

After a few minutes of clicking, you find what’s broken. Now, all you have to do is switch over to PagerDuty and reassign the incident to the right team and you can go back to sleep!

So you go back to PagerDuty, click “Reassign” and the option of reassigning to a User or Escalation Policy comes up. Now, Escalation Policies (EP) should be named after Services or Teams, so that’s probably a safe bet. You look through the list of EPs, and you find:

  • Lisa’s Test EP
  • Offshore 24/7
  • Sprinkles and Unicorns
  • Leadership Team
  • Batman

Another Deep Sigh

Sound familiar?

As a Digital Insights Consultant, I work with companies of all sizes and verticals that use PagerDuty, and have heard of this scenario time and time again. Due to the flexibility of the platform, I can be working with 10 different companies and see 12 different PagerDuty configurations. A huge part of my role is advising current users on how to maximize their incident management workflow using PagerDuty, which I provide via Expert Services packages or our Operations Health Management Service.

Set Up for Success

When I was working with a multi-billion dollar entertainment company on maximizing their PagerDuty experience, one of issues I ran into on that engagement was that their real-life teams were not synchronized with their teams in PagerDuty. There are many reasons for this phenomena; for example, employees migrating amongst teams or the creation of temporary project-based teams that aren’t removed when no longer relevant. If teams aren’t kept up-to-date in PagerDuty, responders are at risk of being woken up in the middle of the night for something they haven’t touched in weeks, months, or even years.

Another configuration issue that I run into is PagerDuty services named after the teams, not the business application services being monitored. That approach makes sense in a small company where one small team is responsible for an entire product. It also makes sense if the team has only worked on one product and the team is static. While that option may be viable in the beginning, the one-team-to-one-product structure simply doesn’t scale.

Good Practice

Best practices require a consistent taxonomy for your PagerDuty Teams, Schedules, Escalation Policies, and Services. Why is this important? Properly named services can shave crucial minutes off of incident response time by giving the responder context around what’s broken— making it easier to escalate incidents, bring in more subject matter experts (SMEs), and, most importantly, decrease the business impact of incidents.

Additionally, asset taxonomy should be service-centric, doing so empowers you to clearly see which component of your business-critical service is causing the most problems.

So what exactly makes a well-named service? Here are some examples of badly named services:

  • Datadog
  • DevOps
  • AWS
  • Email Integration

And here are some service-centric examples for naming your services:

  • Business Service-Software Service-Monitoring Tool
  • (Production/QA/Dev/Stg)-Business Service-Software Service-Monitoring Tool

Better Practice

A feature that PagerDuty provides (and is seldom used) is naming the integrations on your service. By default, the name of the integration is the monitoring tool. But when every team in your organization has a Datadog integration, how do you know what your team’s Datadog is monitoring? To help prevent confusion, I recommend naming an integration based on what it’s monitoring. For instance, Datadog integrations can be named more meaningfully:

  • Datadog-Component
  • Datadog-Application

Another integration nomenclature could be:

  • Monitoring Tool-Application-Component

Additionally, since PagerDuty can send out alerts from any sort of system that sends emails, correctly naming your email integration is crucial. I suggest something along the lines of:

  • Component-Monitoring Tool-Email

Best Practice

Most companies have a Service-Level Agreement (SLA) around their services, and PagerDuty’s Escalation Policies help them meet those SLAs by speeding response time. In this case, we recommend naming your Escalation Policies with the context of the service it belongs to and the team. For example:

  • Team-Application-Software Service-SLA min
  • Team-Application-Software Service-Prod/Stg/Dev

Using these formats provides you context around which service is causing an incident, which team that service belongs to, and how soon you can expect someone to respond—all at a glance! This provides context for NOC/Support teams, who sometimes file/escalate incidents by hand, to quickly find the right team to triage.

Schedules are made up of users, who usually belong to teams. Depending on how your organization is set up, you can either name schedules after the SMEs for that service or teams that support that service. For example:

  • Team Name-Service Name-Primary/Secondary
  • Service Name-Primary/Secondary

Success!

At the conclusion of my engagement with this multi-billion dollar entertainment company, we executed the following:

  1. Unified two PagerDuty teams into one to better reflect their reality. This removed cruft and provided a single-pane-of-glass view of their team and notifications.
  2. Teased apart the confluence of integrations that all went into one service (NOT a best practice). They also named the new services after the business application and monitoring tool. Since there is only one integration per service, we then applied Event Intelligence to the signals sent into PagerDuty. With Event Intelligence, the Time-Based Alert Grouping feature confidently groups all alerts that come in from the same tool for the same application within a two-minute window, which helps reduce non-actionable noise from alert storms. Responders can then quickly pinpoint the source of the error and act on the incident.

At 2:37 a.m., the last thing you want to do is sift through organization documentation. Mature operations teams have a standard taxonomy for their hosts and servers—and so should the platform that they use to orchestrate their major incident response.