PagerDuty is excited to participate in ServiceNow’s Knowledge19 event in Las Vegas this week. As a ServiceNow Gold Technology Partner, this is an event our...by Andrew Marshall
May 6, 2019
Guest blog post by Elle Sidell, Lukas Burkoň, and Jan Prachař Testomato. Testomato offers easy automated testing to check websites pages and forms for problems that could potentially damage a visitor’s experience.
We all know how important monitoring is for making sure our websites and applications stay error free, but that’s only one part of the equation. What do you do once an error has been found? How do you decide what steps to take next?
Rating the severity of an issue and making sure the right person is notified plays a big role in how quickly and efficiently problems get resolved. We’ve pulled together a quick guide about the importance of error severity and how to set severity levels that fit your team’s escalation policy.
In simple terms, the severity of an error indicates how serious an issue is depending on the negative impact it will have on users and the overall quality of an application or website.
For example, an error that causes a website to crash would be considered high severity, while a spelling error might (in some cases) be considered low.
Severity is important because it:
Having a severity alert process in place can help you prioritize the most crucial problems and avoid disturbing the wrong people with issues that are outside their normal area of responsibility.
On a larger scale, it makes decisions about what to fix easier for everyone.
Understanding the benefits of rating the severity of an incident is easy, but creating a severity process that works for your team can be tricky. There’s no silver bullet for this process. What works for you many not work for another team – even if it’s the same size or in the same industry.
How you choose to set up your severity levels can vary depending on your team, the project and its infrastructure, the organization of your team, and the tools you use. So where do you start?
In our experience, there are 3 main things you need to think about when creating an escalation process:
Errors with higher severity will naturally require a more reliable notification channel. For example, you might choose to send an SMS using PagerDuty for a high severity error, while one that is considered minor may not trigger an alert to help reduce noise. Instead, you could choose to leave it as a notification from Testomato, which can be viewed by someone at a later time.
1) Severity Structure
One of the easiest ways to set up a severity structure is to identify the most critical parts of your website or application based on their business value.
For example, the most critical parts of an e-shop would be its product catalogue and its checkout. These are the features that would cause an e-shop to severely affect business if they were to stop working. These issues need to be prioritized before all other issues.
Here’s one method we’ve found helpful for creating a severity structure:
There’s no right or wrong way to do this. The most important thing to know is how your team will classify specific incidents and make sure that everyone is on the same page.
2) Organization Structure
The next thing you’ll want to do is take a look at the structure of your team.
Having a clear understanding of how your team is structured and automating issue communication will help you define a more efficient flow of communication later on. For instance, team members responsible for your environment should be notified about issues immediately, while a project manager may only want to be kept in the loop for critical issues so they’re aware of possible problems.
Based on what we’ve seen with the project teams at Testomato, development teams are usually structured according to the following table:
|Company/ Team Size||Team Management||Project Development||Monitoring|
|freelancer||client||one person team||none / manually
|small team*||CEO||a few developers||none
developer / admin
|a team of developers||none
a team of testers
a team of admins
*A small team would generally be found in a web design agency or early stage startup.
For a more detailed structure, here’s a few more questions to keep in mind:
3) Communication Structure
One of the hardest parts of severities can be putting together a communication structure, especially if you don’t have a strong idea about how alerts should flow through your team structure.
Think of it this way:
The main goal of severity levels is to make sure the right people are aware of issues and help prioritize them. Setting a communication structure lets you connect different levels of your severity structure to roles from your organization and add more defined actions based on time sensitivity or error frequency. This way you can guarantee the right people are contacted using the proper channel that is required for the situation. If a responder is not available, there is an escalation path to ensure someone else on the team is notified.
Assigning notification channels and setting thresholds that correspond to your team organization means that problems are handled efficiently and only involve the people needed to solve them.
For example, if a critical incident occurs on your website, an admin receives a phone call immediately and an SMS is sent to the developer responsible for this feature at the same time. If the problem is not resolved after 10 minutes, the team manager will also receive a phone call.
On the other hand, a warning might only warrant an email for the team admin and any relevant developers.
Within PagerDuty, you can create 2 Testomato services – one general and another that is critical – and match these services to the escalation policy needed. If you have SLAs of 15 minute for critical incidents, that escalation path with be tighter than general incidents.
Here’s a basic overview of how we use severity levels at Testomato using both PagerDuty notifications and our own email alerts:
Team Members: manager, 2 admins (responsible for production), and 2-3 developers.
When errors occur on their project using the following process:
PagerDuty – SMS and Phone Call
Testomato – Email
We hope you’ve found this post helpful! What severity alert process works best for your team?