PagerDuty
/
Blog
/
Best Practices & Insights
/
Monitoring Best Practices Learned from IT Outages

Blog

Monitoring Best Practices Learned from IT Outages

by Vivian Au September 30, 2014 | 5 min read

Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.

At Datadog we eat our own… dogfood. We track hundreds of thousands of metrics internally. Learning what to alert on and what to monitor has taken us some time. Not all metrics are made equal, and we have come up with a simple way to manage them, which anyone can master. Here’s how we do it.

Monitoring goals

Why would you spend time getting better monitoring?

To know about an issue before your customers or your boss
To know how your systems & applications are performing
To minimize your stress level

Classifying metrics

What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.

Work metrics
A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics.

Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.

Resource metrics
The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.

Optimizing your monitoring

Now that we’ve defined work and resource metrics, we can move to best practices.Classify key metrics as work or resource

1. Classify key metrics as work or resource

Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.

2. Only alert on work metrics

Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics.

In other words, you want to get alerted on things that measure how useful your system is.

I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.

3. Only alert on actionable work metrics

The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about.

For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.

A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.

4. Review metrics and alerts periodically

The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.

Back to goals

Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.

a. To know about an issue before your customers or your boss

Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result

b. To minimize your stress level

Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control

c. To know how your systems & applications are performing

Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.

Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.

Datadog monitoring best practices

Incident Management

AIOps

Automation

Customer Service Ops

Status Pages

Stakeholders Communications

Integrations

PagerDuty Advance

Developer Platform

Professional Services

Security

Enterprise Class

Integrations

Blog

Monitoring Best Practices Learned from IT Outages

Monitoring goals

Classifying metrics

Optimizing your monitoring

Back to goals

You may also love these...

Preparedness as a Competitive Advantage: Building Resilience Year Round

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

Managing Vendor Incidents: Customer Impact That Isn’t Your Fault