Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Monitoring is pivotal in the sustained proactivity in your ITOps architecture. In recent years, we have seen an explosion in both the number of and types of tools classified as “monitoring” tools. While this ever-increasing tools landscape has vastly increased ITOps visibility, the occasional side effect of integrating this vast array of tools is to create even more noise. The “visibility and noise” paradox has turned the monitoring landscape into a catch-22 for many IT departments, while others have streamlined their proactivity to issue resolution. Let’s look at the monitoring landscape and build an integrated environment that succeeds.
Monitoring breaks down to tool types:
Some tools encompass more than one monitoring type and some organizations require just one or the other. But every organization requires incident management because no matter what monitoring happens upstream, if the system monitors without making it loses its value.
Below are some typical KPIs that are logged and monitored for alerts:
In any IT department, whether you work internally or as a consultant, noise is the enemy. It’s unanimous. Time is our most precious commodity and the moment our day gets thrown into reactive fire fighting is the moment we need to re-think how streamlining the monitoring and alerting process could have saved us. The first step in this goal is to ensure a foundational level of your monitoring strategy. This includes setting up incident tracking for high SLA, mission-critical stack services such as network traffic, server uptime, application availability, security services, and resource utilization. Once this foundational level is in place, ITOps teams gain the visibility and critical insights to prevent product SLA violations.
Most of the tools and systems available feature pre-built templates to help return these critical services to production. However, configuring proper thresholds and incident priority is paramount to decreased noise reduction and higher visibility. It will take some finesse to configure CPU, disk, memory, and network thresholds to fit your ITOps needs. The key is to set these thresholds to give your team enough warning to react to issues and identify high-priority response incidents.
Once the baseline monitoring framework is configured, it’s time to move on to more intelligent service monitoring such as log analysis, application insights, operational intelligence, and intrusion detection. Leveraging tools like Splunk can give an immense amount of cross-platform insights to ITOps and SecOps teams. More specifically, tools like Rollbar and Errorception can help Dev teams gain visibility into application insights by generating incidents for code errors. Furthermore, tools from Rapid7, AlienVault, and Threat Stack can help illuminate security monitoring and threat intelligence. Whichever toolset best fits your environment, the goals remains the same: reduce alert noise and accelerate mean time to resolution (MTTR).
With a solid monitoring foundation in place, we can next aim our gaze at actionability. How do we translate alerts into action, especially if we are leveraging multiple tools to give us a greater monitoring profile? This is the point where aggregating the alerts from multiple monitoring tools into an incident management platform can pay huge dividends. Incident management platforms like PagerDuty can not only connect critical IT services, they also take the event data generated and immediately recruit and notify the right teams. Incident management platforms turn the issues that your monitoring systems have created into alerts and incidents. Moreover, automated escalation policies enable your team to execute quickly and efficiently on your incident resolution by ensuring that a responder takes action on the issue. This is the pivotal point in which you are maximizing the ROI on your monitoring tools.
With a solid monitoring framework, ITOps has the tools and visibility to be proactive in their operations and faster in their response to incidents. Overall, the goal is not to overwhelm ITOps with alerts, but to generate and detect the critical alerts that need immediate action.
Noise costs money — in terms of personnel cost, productivity loss, downtime, and even lost revenues. By ensuring you have the right monitoring framework in place and having an incident management platform that centralizes, classifies, and enriches events you can avoid the “visibility and noise” paradox.
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019