The Importance of Severity Levels to Reduce MTTR
Guest blog post by Elle Sidell, Lukas Burkoň, and Jan Prachař Testomato. Testomato offers easy automated testing to check websites pages and forms for problems that could potentially damage a visitor’s experience.
We all know how important monitoring is for making sure our websites and applications stay error free, but that’s only one part of the equation. What do you do once an error has been found? How do you decide what steps to take next?
Rating the severity of an issue and making sure the right person is notified plays a big role in how quickly and efficiently problems get resolved. We’ve pulled together a quick guide about the importance of error severity and how to set severity levels that fit your team’s escalation policy.
What Are Severities and Why Are They Important
In simple terms, the severity of an error indicates how serious an issue is depending on the negative impact it will have on users and the overall quality of an application or website.
For example, an error that causes a website to crash would be considered high severity, while a spelling error might (in some cases) be considered low.
Severity is important because it:
- Helps you reduce and control the amount alerting noise.
- Makes the process of handling errors smoother.
- Improves how effectively and efficiently you resolve issues.
Having a severity alert process in place can help you prioritize the most crucial problems and avoid disturbing the wrong people with issues that are outside their normal area of responsibility.
On a larger scale, it makes decisions about what to fix easier for everyone.
How to Create Escalation Rules That Works for Your Team
Understanding the benefits of rating the severity of an incident is easy, but creating a severity process that works for your team can be tricky. There’s no silver bullet for this process. What works for you many not work for another team – even if it’s the same size or in the same industry.
How you choose to set up your severity levels can vary depending on your team, the project and its infrastructure, the organization of your team, and the tools you use. So where do you start?
In our experience, there are 3 main things you need to think about when creating an escalation process:
- Severity structure
- Team organization structure
- Thresholds and their corresponding notification channel
Errors with higher severity will naturally require a more reliable notification channel. For example, you might choose to send an SMS using PagerDuty for a high severity error, while one that is considered minor may not trigger an alert to help reduce noise. Instead, you could choose to leave it as a notification from Testomato, which can be viewed by someone at a later time.
1) Severity Structure
One of the easiest ways to set up a severity structure is to identify the most critical parts of your website or application based on their business value.
For example, the most critical parts of an e-shop would be its product catalogue and its checkout. These are the features that would cause an e-shop to severely affect business if they were to stop working. These issues need to be prioritized before all other issues.
Here’s one method we’ve found helpful for creating a severity structure:
- Create a list of the key features or content objects on your website or web application. (e.g. catalogue, checkout, homepage, signup, etc.). It’s a good idea to keep your list simple to make it easier to prioritize issues.
- Analyze your alert history and identify any common problems that may require a different severity level than you would normally assign (i.e. false timeouts may need to be marked as low severity, even though a timeout would be categorized somewhere higher on your scale).
- Decide on the levels you’d like to use for your scale (e.g. low, medium, high). You can add more levels depending on the size of your project and team.
- Once you have completed your list and analysis, estimate the severity level of each feature or content object, as well as any recurring errors that you found in your history.
There’s no right or wrong way to do this. The most important thing to know is how your team will classify specific incidents and make sure that everyone is on the same page.
2) Organization Structure
The next thing you’ll want to do is take a look at the structure of your team.
Having a clear understanding of how your team is structured and automating issue communication will help you define a more efficient flow of communication later on. For instance, team members responsible for your environment should be notified about issues immediately, while a project manager may only want to be kept in the loop for critical issues so they’re aware of possible problems.
Based on what we’ve seen with the project teams at Testomato, development teams are usually structured according to the following table:
|Company/ Team Size||Team Management||Project Development||Monitoring|
|freelancer||client||one person team||none / manually
|small team*||CEO||a few developers||none
developer / admin
|a team of developers||none
a team of testers
a team of admins
*A small team would generally be found in a web design agency or early stage startup.
For a more detailed structure, here’s a few more questions to keep in mind:
- Who needs to be part of the alert process?
- What are each person’s responsibilities when it comes to fixing an issue?
- At what point does an alert require that this role be brought into the communication loop?
3) Communication Structure
One of the hardest parts of severities can be putting together a communication structure, especially if you don’t have a strong idea about how alerts should flow through your team structure.
Think of it this way:
- Severity Structure: How serious is this problem?
- Organization Structure: Whose responsibility is it?
- Communication Structure: If X happens, how and when should team members be contacted?
The main goal of severity levels is to make sure the right people are aware of issues and help prioritize them. Setting a communication structure lets you connect different levels of your severity structure to roles from your organization and add more defined actions based on time sensitivity or error frequency. This way you can guarantee the right people are contacted using the proper channel that is required for the situation. If a responder is not available, there is an escalation path to ensure someone else on the team is notified.
Assigning notification channels and setting thresholds that correspond to your team organization means that problems are handled efficiently and only involve the people needed to solve them.
For example, if a critical incident occurs on your website, an admin receives a phone call immediately and an SMS is sent to the developer responsible for this feature at the same time. If the problem is not resolved after 10 minutes, the team manager will also receive a phone call.
On the other hand, a warning might only warrant an email for the team admin and any relevant developers.
Within PagerDuty, you can create 2 Testomato services – one general and another that is critical – and match these services to the escalation policy needed. If you have SLAs of 15 minute for critical incidents, that escalation path with be tighter than general incidents.
Here’s a basic overview of how we use severity levels at Testomato using both PagerDuty notifications and our own email alerts:
Team Members: manager, 2 admins (responsible for production), and 2-3 developers.
When errors occur on their project using the following process:
PagerDuty – SMS and Phone Call
- All errors are sent to PagerDuty.
- PagerDuty sent SMS immediately to both admins.
- After 5 minutes, an admin is called according to the on-call schedule.
- After 15 amount of time, a team manager is also called.
- Developers are not contacted by PagerDuty.
Testomato – Email
- Both errors and warnings are sent as Testomato email notifications to both admins and the developers.
- Warnings are only sent as emails.
- Developers are sent emails about both errors and warnings to stay informed about production problems.
We hope you’ve found this post helpful! What severity alert process works best for your team?