Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.
At Datadog we eat our own… dogfood. We track hundreds of thousands of metrics internally. Learning what to alert on and what to monitor has taken us some time. Not all metrics are made equal, and we have come up with a simple way to manage them, which anyone can master. Here’s how we do it.
Why would you spend time getting better monitoring?
What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.
A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics.
Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.
The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.
Now that we’ve defined work and resource metrics, we can move to best practices.Classify key metrics as work or resource
1. Classify key metrics as work or resource
Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.
2. Only alert on work metrics
Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics.
In other words, you want to get alerted on things that measure how useful your system is.
I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.
3. Only alert on actionable work metrics
The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about.
For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.
A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.
4. Review metrics and alerts periodically
The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.
Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.
a. To know about an issue before your customers or your boss
Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result
b. To minimize your stress level
Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control
c. To know how your systems & applications are performing
Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.
Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019