Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack.
The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an airplane, or even a thing that simply exists by means of miraculous biology, like the human body. A thing that would be perfect, but for its interaction with the dirty, analog reality of meatspace; the playground of entropy, and chaos where the best engineering we can manage eventually goes sideways through age, human-error, and random happenstance.
Airplanes and bodies are systems. Not only do we expect them to operate in a particular way, but we have well-defined mental models that describe their proper operational characteristics — simple assumptions about how they should work, which map to metrics we can measure that enable us to describe these systems as ‘good’ or ‘bad’. Airplane tires should maintain 60 PSI of pressure, human hearts should beat between 40 and 100 times per minute. The JDBC client shouldn’t ever need more than a pool of 150 DB connections.
This is why we monitor: to obtain closed-loop feedback on the operational characteristics of systems we cannot fully control, to make sure they’re operating within bounds we expect. Fundamentally, monitoring is a signal processing problem, and therefore, all monitoring systems are signal processing systems. Some monitoring systems sample and generate signals based on real-world measurements, and others collect and process signal to do things like visualization, aberrant-behavior detection, and alerting and notification.
Below is an admittedly oversimplified diagram that describes how monitoring systems for things like human hearts and aircraft oil-pressure work (allowing, of course, for the omission of the more esoteric inner-workings of the human heart, which cannot be monitored). In it, we see sensors generating a signal that feeds the various components that provide operational feedback about the system to human operators.
All too often however, what we see in IT monitoring systems design looks like the figure below, where multiple sensors are employed to generate duplicate signals for each component that generates a different kind of feedback.
There are many reasons this anti-pattern can emerge, but most of these reduce to a cognitive dissonance between different IT groups, where each group believes they are monitoring for a different reason, and therefore require different analysis tools. The operations and development teams, for example might believe that monitoring OS metrics is fundamentally different from monitoring application metrics, and therefore each implements their own suite of monitoring tools to meet what they believe are exclusive requirements. For OS metrics, operations may think it requires minute-resolution state data on which they can alert, while development focuses on second-resolution performance metrics to visualize their application performance.
In reality, both teams share the same requirement: a telemetry signal that will provide closed-loop feedback from the systems they’re interested in, but because each implements a different toolchain to achieve this requirement, they wind up creating redundant signals, to feed different tools.
When different data sources are used for alerting and graphing, one source or the other might generate false positives or negatives. Each might monitor subtly different metrics under the guise of the same name, or the same metric in subtly different ways. When an engineer is awakened in the middle of the night by an alert from such a system, and the visualization feedback doesn’t agree with the event-notification feedback, an already precarious, stressful and confusing situation is made worse, and precious time is wasted vetting one monitoring system or the other.
Ultimately, the truth of which source was correct is irrelevant. Even if someone successfully undertakes the substantial forensic effort necessary to puzzle it out, there’s unlikely to be a meaningful corrective action that can be taken to synchronize the behavior of the sources in the future. The inevitable result is that engineers will begin ignoring both monitoring systems because neither can be trusted.
Fixing false-positives caused by disparate data sources isn’t a question of improving an unreliable monitoring system, but rather one of making two unreliable monitoring systems agree with each other in every case. If we were using two different EKG’s on the same patient — one for visualization and another for notification — and the result was unreliable, we would most likely redesign the system to operate on a common signal, and focus on making that signal as accurate as possible. That is to say, the easiest way to solve this problem is to simply alert on what you see.
Alerting on what you see doesn’t require that everyone in the organization use the same monitoring tool to collect the metric data that’s interesting to them, it only requires that the processing and notification systems use a common input signal.
The specific means by which you achieve a commonality of input signal depends on the tools currently in use at your organization. If you’re a Nagios/Ganglia shop for example, you could modify Nagios to alert on data collected by Ganglia, instead of collecting one stream of metrics data from Ganglia for visualization, and a different signal from Nagios for alerting.
Librato and PagerDuty are an excellent choice for centrally processing the telemetry signals from all of the data collectors currently in use at your organization. With turn-key integration into AWS and Heroku, and support for nearly 100 open-source monitoring tools, log sinks, instrumentation libraries, and daemons, it’s a cinch to configure your current tools to emit metrics to Librato.
By combining Librato and PagerDuty, all engineers from any team can easily process, visualize, and correlate events in your telemetry signals as well as send notifications and escalations that are guaranteed to reflect the data in those visualizations. Your engineers can use the tools they want, while ensuring that the signals emitted by those tools can be employed to provide effective, timely feedback to everyone in the organization. Sign up for a Librato free trial today and learn how to integrate PagerDuty with Librato .
I love writing software, but I hate dealing with bugs. They take you away from what you want to be doing and often lead you...
A few weeks ago, I had the pleasure of attending PagerDuty Summit 2018 as Zenoss was a proud partner and sponsor for the conference. It...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019