Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In a world where everything comes down to moments of truth, teams must respond to issues and opportunities in seconds. Rising customer expectations demand real-time...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO of a monitoring company, creating the right set of alerts for me to stay informed of incidents in progress or potential issues was harder than it seemed at first glance.
While the advent of cloud and open-source technologies has enabled us to build software much more quickly, today’s environments are significantly more complex to monitor and manage for a number of reasons, including:
The result for many of us is a storm of false positive or redundant alerts. Alert fatigue not only hinders your team’s ability to find and address issues in real time—but when left unaddressed for long enough, it also destroys team morale and results in preventable outages.
Reducing alert fatigue starts with broadening one’s focus. While fine-grained measurement of metrics is extremely useful during troubleshooting and forensic analysis, the most actionable alerts rely on a combination of signals that create higher-level indicators of application health. In particular, you should consider:
Monitoring Populations, Not Individual Instances
Define and subscribe to per-service or per-population health indicators as opposed to alerting on the status of every individual component in your environment. For example, you could track the 99th percentile latency of an API call across service instances, the average CPU utilization for a given cluster of nodes, or the sum of API errors for a group of containers that serve it.
Aggregated system metrics across 1,436 hosts
Alerting on Patterns and Trends Rather Than Fixed Numerical Thresholds
Use algorithmically generated thresholds that can adapt to changing environments. Distributed systems often behave in mysterious ways, which makes it extremely difficult to determine the “right” amount of CPU usage or API errors that occur before an alert fires.
Alerting on raw # of sessions vs. week over week change
By accounting for regular patterns (e.g., higher weekday traffic) or predictively alerting (alert when a cluster is about to run out of disk space in the next N days), you can further differentiate between regular system behavior and something that warrants a response.
Chart displaying a metric trending to capacity
Defining Overall Measures of Application Performance
Combine metrics from different microservices to derive higher-level signals and alerts. Two possibilities are the number of page loads per logged-in user or the count of API errors measured as a percentage of total API calls. One of our customers combines metrics from all their microservices to create a “health score” for deployment versions that indicates whether application performance improved as a whole.
Despite the fact that we use all of these techniques at SignalFx, I was still experiencing too many false positive alerts. Keep in mind the following:
What other signals could I measure? At SignalFx, we have a Slack channel that we’ve named #outage that’s specifically for discussing incidents. This channel also receives critical alert notifications from PagerDuty to preserve context for those discussions. Knowing that significant issues often cause multiple users to collaborate on Slack and escalate via PagerDuty, I decided to gather metrics on human activity in #outage. The result looked something like this:
Gray: “Normal” SignalFx alerting workflow
Yellow: Alerting with social signals
I used an AWS Lambda set to query and classify messages (e.g., human vs bot-generated), then publish them to SignalFx. Next, I created an alert detector that notified me when more than three unique human authors were typing in #outage for a period of five minutes or longer. Alerts were sent to my phone via PagerDuty and a direct message in Slack.
Notification for potential outage in progress
This worked surprisingly well—while I still received a few false positives, the amount dropped to almost zero, and I was notified for every single incident of interest to me. Interestingly enough, I was also notified of a few potential incidents brewing that I had no active alerts set for, but our engineers had uncovered as part of their general observation of the service.
I was initially disappointed at being unable to create the “perfect” alert using only application and infrastructure metrics, but this may have been a naive expectation. Crafting the right alert requires not only understanding your environment, but also how your organization responds to incidents.
Measuring human behavior was enough for my specific use case, but given how interoperable and data-agnostic many of today’s tools are, there are a wealth of other signals that we could potentially incorporate into our monitoring.
Real-time business requires real-time operational intelligence, and today’s technologies emit far more data than traditional monitoring tools can handle. SignalFx collects streaming metrics from every component in your environment to provide analytics and alerts in seconds, so you can find and address issues before they impact customers.
With SignalFx and PagerDuty, you can automatically open incidents in PagerDuty when an alert detector is triggered in SignalFx, map to different escalation policies depending on the alert, and automatically mark incidents resolved when things return to normal.
At SignalFx, we help organizations monitor all the signals that matter—in real time, at any scale—and give them the confidence to innovate faster than ever before.
Arijit Mukherji is CTO at SignalFx and passionate about monitoring. He was one of the original developers of Facebook’s metrics solution (ODS), and subsequently managed the development of Facebook’s networking tools, data visualization, and other infrastructure monitoring software. While focused on the monitoring space for more than a decade, his diverse career of over 20 years also spans IP telephony, VoIP conferencing, and network virtualization.
Disclaimer: This post is not meant as a religious statement, but merely an analogy to illustrate how DevSecOps has impacted engineering culture, both internally at...
As you may expect from a company founded by former Amazon employees, PagerDuty has been helping AWS users automatically turn any signal into the right...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018