Why We Use On-Call Shadowing On-call shadowing is an essential practice at PagerDuty. For a new engineer, a shadowing period serves as a kinder, smoother...by Max Timchenko
March 26, 2019
Congratulations! You’ve just purchased PagerDuty, meaning you’ve decided to make an investment in your incident management process. However, in order to maximize your investment, you will need to understand all the moving pieces within PagerDuty.
Today, we’ll be setting up PagerDuty for one team: the Bikini Bottom Team. I’ll walk you through how to set up Users, Teams, Schedules, Escalation Policies, and Services, and highlight why they are important to the incident response process.
This gold-standard Bikini Bottom Team can be used as a template to roll out PagerDuty to the rest of your organization. (Please be sure to read the entire article before configuring assets in your PagerDuty account—trust me, it’ll save a lot of tearing-your-hair-out moments).
“You don’t need a license to drive a sandwich.”
– Patrick Star, Bikini Bottom
The first step in setting up your PagerDuty instance is to decide which users need PagerDuty access. There are two different types of user licenses: user and stakeholder.
User licenses are for the folks who will be on call, paged for incidents, and are expected to post updates or resolve these incidents. Their managers will also require a user license in PagerDuty. So for the Bikini Bottom Team, SpongeBob, Patrick Star, and Squidward—who will be responding to incidents— all need user licenses. Their manager Mr. Krabs will also need a user license to set up the applications he wants his team to respond to.
Stakeholder licenses are read-only licenses where the individual can receive updates about incidents, but they are not expected to participate in the remediation process. For example, Gary the Snail is an executive that the Bikini Bottom team reports to, so he will only need a stakeholder license because he’s not participating in incident triage. Gary the Snail just needs visibility into the status of the incident and the actions of the Bikini Bottom Team.
Great, we’ve determined who needs to be in PagerDuty! Now it’s time to consider what base role each user should have.
Base roles determine the access a user has to PagerDuty account-level settings.
A PagerDuty best practice is to set all users’ base roles as ‘“Observer,” and then grant greater access according to their team roles (more about that in the next section). Those who aren’t added to a team, or haven’t requested to be added to a team, may not need PagerDuty access.
Additionally, the PagerDuty admin team should be granted “Global Admin” access so they can make changes to the account as roles change, like removing team managers who no longer need PagerDuty access.
|Role||All Teams||All Schedules||All Escalation Policies||All Services||All Incidents||All Users|
|Global Admin||Edit access||Edit access||Edit access||Edit access||Edit access||Edit access|
|Manager||Edit access||Edit access||Edit access||Edit access||Edit access||View Only|
|Responder*||View Only||View Only||View Only||View Only||Edit access||View Only|
|Observer*||View Only||View Only||View Only||View Only||View Only||View Only|
|Restricted Access*||No Access||No Access||No Access||No Access||No Access||No Access|
Sample breakdown of access rights given to various roles in an organization.
*Roles with asterisks can be granted greater access with team roles.
In my asset taxonomy blog post, I discussed the importance of naming teams within PagerDuty to reflect real-life teams. If your PagerDuty teams are up-to-date with “real-world” teams, it will prevent the wrong person being paged. For on-call responders, waking up in the middle of the night is part of the job. But waking up for an alert that you no longer manage is how souls die. Think of it this way: every time an alert gets sent to the wrong responder, the ocean temperature increases by 1 degree. (Too soon?)
So, team roles. There are three different team roles available for each user: Observer, Responder, and Manager. Observers on private teams will have visibility (but no edit rights) to the team’s assets. Team responders will be able to acknowledge, resolve, and schedule overrides only for incidents assigned to them. Team managers will be able to add and remove users, schedules, escalation policies, and services only for the assets on their team.
For the Bikini Bottom team, Mr. Krabs will be assigned a team manager role and SpongeBob, Patrick, and Squidward will be assigned the team responder role.
PagerDuty’s schedule feature allows you to set up your on-call rotation and pattern once; we handle the rest. In the schedule console, you can set up which users go on call, when they go on call, in what order they go on call, and which hours of the day someone is on call for, to name a few benefits. You should name the schedule with as much context as you can, taking into account the following:
For example, Patrick and Squidward specialize in “Lettuce Maintenance,” so Spongebob doesn’t need to be part of the “Lettuce” service schedule. This schedule would be named “Bikini Bottom Team|Krabby Patty|Lettuce Rotation,” a subset of the “Bikini Bottom Team Schedule.” You can reference the schedule to know that it’s specific to the “Lettuce Maintenance,” and clicking in will let you know who are the SMEs for “Lettuce” related issues.
Additionally, DevOps best practices cite that only responders who can take action on the alert should be on the on-call rotation for that particular service. So in this case, though you can put someone like Plankton (who’s not part of the Krabby Patty team on the schedule), you shouldn’t because he can’t take action on any alerts anyway.
Here’s a test scenario about schedules—pick the best response!
Sandy Cheeks is on call for Krabby Patty Service. An alert comes in, the server’s on fire, she has no idea what to do, so she wakes up Squidward to fix it. What’s your reaction?
A) This is perfectly fine—the more the merrier in incident response.
B) Sandy should not be on the rotation for this service.
C) Sandy should be trained on incident triage for Service A before she’s put on the schedule.
D) Sandy should go test the ocean temperature after each notification.
If you picked C, then you are correct!
When Sandy has to call Squidward to fix an issue in your environment, then Squidward is your single point of failure. Remember: Your application stack is only as resilient as your single points of failure.
If you have SLAs or SLOs to adhere to, the Escalation Policy (EP) is your best friend. An EP determines who to notify when an alert comes into a Service.
Let’s say Mr. Krabs defines the SLA for the Krabby Patty Service where an engineer is: 1) required to respond to an incident within 30 minutes and 2) required to resolve said incident within two hours.
For the best chances of meeting his SLAs, Mr. Krabs should set up the EP as follows:
Here’s what this EP will look like in PagerDuty:
However, not all services are created equal. The Krabby Patty Service has a two-hour SLA, but other, less urgent services allow responders more time before they escalate to the next responder. In the example below for the Muscle Beach Service, Patrick has two hours to respond before the incident is escalated to Mr. Krabs.
Setting Up Services in PagerDuty
Hang on tight, we’ve almost come full circle! The first thing you want to do after getting access to your PagerDuty instance is to determine what you’d like PagerDuty to send alerts on—in other words, what services do you want to be alerted for?
So why am I talking about services last?
Fair question. But there’s a method behind my madness: In order to be able to set up a Service, you need to first set up your Users, Teams, Schedules, and Escalation Policies.
In a nutshell, the Bun Burner Monitoring Tool sends events to a Service in PagerDuty (Krabby Patty|Buns), which has an Escalation Policy. The Escalation Policy’s schedule determines who to notify. In this case, Buns burn quickly, so the escalation policy has an SLA of 5 minutes, and users from Team Bikini Bottom are on the schedule, as visualized below:
Which brings me to the next question: How do you determine which Services to set up in PagerDuty?
First, you need to figure out what’s important to you.
Here are a few options to narrow down what’s most important:
Ideally, you should break each Service down as granular as possible. For example, if you have a Krabby Patty application, you’d want to create a Service for:
So when it’s 2 a.m. and SpongeBob receives an alert from the Buns Service, it’s much more informative than receiving an alert from the Krabby Patty Service, enabling him to respond and resolve the incident faster to meet SLAs.
Now that you’ve decided which Services you need to create in PagerDuty, you can work backwards and decide on their EPs and who to put on the schedule for each Service. This will then determine who needs a user license in PagerDuty.
While Patrick’s mind may be an enigma, your incident management process doesn’t have to be. Just by setting up PagerDuty correctly, you’ve identified:
This is the bare minimum to set up PagerDuty. We have a plethora of features that further matures your digital operations management, such as Modern Incident Response, Event Intelligence, Analytics, and Visibility. Check back in the future for more best practices to learn how you can get the most out of your PagerDuty investment!