PagerDuty Logo

PagerDuty Blog

Intelligent Service Design

Co-authored by Chris Bonnell, PagerDuty Data Scientist VI

Hello and welcome to the fourth post in our EI Architecture series focusing on Intelligent Alert Grouping. Previously we have talked about how to train Intelligent Alert Grouping using incident merges (here) and how to configure your alert titles to improve default matching. In this post, we’re going to cover how service design can also impact your experience with Intelligent Alert Grouping as well as the PagerDuty app in general.

A little about services

Before you can dive into how to design, or re-design, your services it’s important to have a service definition that works in your organization. The definition needs to be specific enough to be understood, but broad enough that multiple teams have the same understanding of what a service abstractly is. At PagerDuty we use the following definition:

A service is a discrete piece of functionality that provides value and that is wholly owned by a team.

The owning team is important to know as they are the team that will build and maintain the service, and this includes responding to any incidents. For a recap on services and ownership, please see our Full Service Ownership Ops Guide on Defining a Service.

In addition to thinking about services and who owns them, you’ll also need to be mindful of the service names. You should be able to skim the Service Directory and be able to easily understand what each service is without any additional institutional knowledge. Succinctly it’s the difference between a service named “Payment Service” or knowing / referencing internal documentation that all transactional services are named after Greek gods and then looking at which Greek god equates to the service that handles payment. We go into this in detail in the Naming Services section of the Full Service Ownership Ops Guide.

One last piece of knowledge about services before we continue: in the PagerDuty app, services are distinct from business services. Thus far, everything I’ve mentioned above is relevant to services. You may also see them referred to as technical services in our documentation to prevent confusion with business services. Business services are aggregates of technical services or other business services, usually according to your business logic and/or stakeholders. Intelligent Alert Grouping only makes use of technical services, not business services, so when I refer to services throughout this post I’m only referring to technical services.

Granularity 

Figuring out the balance for how to separate services is a non-trivial task and there is no “one size fits all” solution for it. You’re essentially balancing and exchanging more vs less granularity. A highly granular use of services would be, for example, having a single monitoring tool that has all its component functions broken down into separate services in the PagerDuty application. On the other hand, a more broad / less granular use of services would be having all mobile applications as a single service even if iOS and Android development is handled by separate teams with separate responsibilities. This latter example also drives home why there can’t be a single recommendation for how to structure your services, since some organizations will have a singular mobile team, no mobile team, separate mobile teams divided on different criteria.

So what can we do? We can abstract out some advice that can help you navigate your service definitions. The first place to start is the ownership piece. One of the key reasons to define services in the PagerDuty application is for the purpose of incident response, which means that knowing who can actionably respond to issues with a service is who owns the service, and thus who owns services in your organization can guide how you define them in PagerDuty. This is important because if you have an existing service structure that is not a full service ownership model, but you do know what your desired escalation paths are, you can make use of that knowledge. In this case when Intelligent Alert Grouping groups the incidents, the result is that even if the services are only related in their desired escalation path, that is the one that matters in this context and will be the end result that you achieve.

You should also review your current projects and see which ones are effectively a functional unit, and have those defined as a single service in PagerDuty. This should ensure that projects that have the same escalation paths are still grouped correctly, but prevent the scenario where multiple projects are defined as a single service only due to their escalation path, losing visibility. Going a bit further, if you see two or more services that consistently have incidents in tandem, then they may need to be aggregated into a single service. As a specific example, let’s say you have a monitoring microservice that has separate components for metrics, heartbeats, logs, and so on. If each of these have their own PagerDuty service, then most likely an incident will be across all these services simultaneously, and they should instead be grouped into one service as the monitoring microservice itself. On the other hand, if separate projects are defined as one service because the same people work on them, but they are separate entities, the services might not be granular enough. In this scenario it would be difficult to tell which of the entities in the service need more attention than the others as they all are one “thing” insofar as the PagerDuty app is concerned.

Where to go from here

Start taking a look at your existing, and possibly soon-to-be-created, services and check their:

  • Escalation paths
  • Names
  • Units of functionality

And then use these to guide how you define the services in the PagerDuty application. Intelligent Alert Grouping will take it from there, by grouping alerts under correlated services. If you’re designing services from scratch, or overhauling service definitions, take a look at our Full Service Ownership Ops Guide for best practices on naming and ownership for response, as well as our documentation on (technical) services and business services. In our next, and last, post of the series we’ll be re-capping everything we’ve covered. Use the ei-architecture-series tag to take a look at any of the pieces in this series.