The PagerDuty Incident Response Process is a detailed document that provides a framework for how to structure your incident response process. But sometimes it helps...by George Miranda
June 20, 2019
Monitoring is pivotal in the sustained proactivity in your ITOps architecture. In recent years, we have seen an explosion in both the number of and types of tools classified as “monitoring” tools. While this ever-increasing tools landscape has vastly increased ITOps visibility, the occasional side effect of integrating this vast array of tools is to create even more noise. The “visibility and noise” paradox has turned the monitoring landscape into a catch-22 for many IT departments, while others have streamlined their proactivity to issue resolution. Let’s look at the monitoring landscape and build an integrated environment that succeeds.
Monitoring breaks down to tool types:
Some tools encompass more than one monitoring type and some organizations require just one or the other. But every organization requires incident management because no matter what monitoring happens upstream, if the system monitors without making it loses its value.
Below are some typical KPIs that are logged and monitored for alerts:
In any IT department, whether you work internally or as a consultant, noise is the enemy. It’s unanimous. Time is our most precious commodity and the moment our day gets thrown into reactive fire fighting is the moment we need to re-think how streamlining the monitoring and alerting process could have saved us. The first step in this goal is to ensure a foundational level of your monitoring strategy. This includes setting up incident tracking for high SLA, mission-critical stack services such as network traffic, server uptime, application availability, security services, and resource utilization. Once this foundational level is in place, ITOps teams gain the visibility and critical insights to prevent product SLA violations.
Most of the tools and systems available feature pre-built templates to help return these critical services to production. However, configuring proper thresholds and incident priority is paramount to decreased noise reduction and higher visibility. It will take some finesse to configure CPU, disk, memory, and network thresholds to fit your ITOps needs. The key is to set these thresholds to give your team enough warning to react to issues and identify high-priority response incidents.
Once the baseline monitoring framework is configured, it’s time to move on to more intelligent service monitoring such as log analysis, application insights, operational intelligence, and intrusion detection. Leveraging tools like Splunk can give an immense amount of cross-platform insights to ITOps and SecOps teams. More specifically, tools like Rollbar and Errorception can help Dev teams gain visibility into application insights by generating incidents for code errors. Furthermore, tools from Rapid7, AlienVault, and Threat Stack can help illuminate security monitoring and threat intelligence. Whichever toolset best fits your environment, the goals remains the same: reduce alert noise and accelerate mean time to resolution (MTTR).
With a solid monitoring foundation in place, we can next aim our gaze at actionability. How do we translate alerts into action, especially if we are leveraging multiple tools to give us a greater monitoring profile? This is the point where aggregating the alerts from multiple monitoring tools into an incident management platform can pay huge dividends. Incident management platforms like PagerDuty can not only connect critical IT services, they also take the event data generated and immediately recruit and notify the right teams. Incident management platforms turn the issues that your monitoring systems have created into alerts and incidents. Moreover, automated escalation policies enable your team to execute quickly and efficiently on your incident resolution by ensuring that a responder takes action on the issue. This is the pivotal point in which you are maximizing the ROI on your monitoring tools.
With a solid monitoring framework, ITOps has the tools and visibility to be proactive in their operations and faster in their response to incidents. Overall, the goal is not to overwhelm ITOps with alerts, but to generate and detect the critical alerts that need immediate action.
Noise costs money — in terms of personnel cost, productivity loss, downtime, and even lost revenues. By ensuring you have the right monitoring framework in place and having an incident management platform that centralizes, classifies, and enriches events you can avoid the “visibility and noise” paradox.