PagerDuty Blog

Breaking Down Monitoring

mangagedmonitoringconsoleMonitoring is pivotal in the sustained proactivity in your ITOps architecture. In recent years, we have seen an explosion in both the number of and types of tools classified as “monitoring” tools. While this ever-increasing tools landscape has vastly increased ITOps visibility, the occasional side effect of integrating this vast array of tools is to create even more noise. The “visibility and noise” paradox has turned the monitoring landscape into a catch-22 for many IT departments, while others have streamlined their proactivity to issue resolution. Let’s look at the monitoring landscape and build an integrated environment that succeeds.

Monitoring breaks down to tool types:

  • Application Performance Monitoring (APM): Looking solely at the application layer
  • Log Analysis: Typically directed at the infrastructure layer.
  • Exception Monitoring: Setting up alerts for every exception block at the code level.
  • Artifact Monitoring: Making sure all the artifacts and components in the application are free of vulnerabilities and current.
  • Incident management: Building into any part of the stack and making sure that you know right away when something goes wrong.

Some tools encompass more than one monitoring type and some organizations require just one or the other. But every organization requires incident management because no matter what monitoring happens upstream, if the system monitors without making it loses its value.

Below are some typical KPIs that are logged and monitored for alerts:

  • Performance (CPU, Disk, Memory, Network, Utilization)
    • This is a critical element to any monitoring stack and includes performance and utilization monitoring for critical hardware such as CPU, disk, memory, and network. Red flags to look out for are high utilization, IO errors, or predictive failures.
  • Uptime (Resource Availability, Server Availability, Network Availability)
    • These generally rank into high priority response tickets as this means that one of your servers, network devices, or key resources is no longer functioning. This can be monitored simply by network access or service availability.
  • Application / System Events (Errors, Requests, Warnings, Failures)
    • This category is where you track errors and events on key applications and systems. As an example, monitoring HTTP errors and requests on a web server or monitoring services that power a particular application component.
  • Security (IDS/IPS, Credential Management, Incident Detection)
    • Anything security and visibility. This includes monitoring your firewall, endpoint, encryption services, and other security systems. Monitoring detected intrusions and failed login attempts are two of a plethora of security alerts you can establish.
  • Logging (Syslog Services, SNMP, Log Aggregation, Enrichment, and Notification)
    • Log aggregation and enrichment services. Examples include sending network and security notifications to a syslog server or 3rd party tool in order to enrich your log data and generate data-driven alerts.

Noise is the Enemy

In any IT department, whether you work internally or as a consultant, noise is the enemy. It’s unanimous. Time is our most precious commodity and the moment our day gets thrown into reactive fire fighting is the moment we need to re-think how streamlining the monitoring and alerting process could have saved us. The first step in this goal is to ensure a foundational level of your monitoring strategy. This includes setting up incident tracking for high SLA, mission-critical stack services such as network traffic, server uptime, application availability, security services, and resource utilization. Once this foundational level is in place, ITOps teams gain the visibility and critical insights to prevent product SLA violations.

Most of the tools and systems available feature pre-built templates to help return these critical services to production. However, configuring proper thresholds and incident priority is paramount to decreased noise reduction and higher visibility. It will take some finesse to configure CPU, disk, memory, and network thresholds to fit your ITOps needs. The key is to set these thresholds to give your team enough warning to react to issues and identify high-priority response incidents.

Once the baseline monitoring framework is configured, it’s time to move on to more intelligent service monitoring such as log analysis, application insights, operational intelligence, and intrusion detection. Leveraging tools like Splunk can give an immense amount of cross-platform insights to ITOps and SecOps teams. More specifically, tools like Rollbar and Errorception can help Dev teams gain visibility into application insights by generating incidents for code errors. Furthermore, tools from Rapid7, AlienVault, and Threat Stack can help illuminate security monitoring and threat intelligence. Whichever toolset best fits your environment, the goals remains the same:  reduce alert noise and accelerate mean time to resolution (MTTR).

Time to Get Actionable

With a solid monitoring foundation in place, we can next aim our gaze at actionability. How do we translate alerts into action, especially if we are leveraging multiple tools to give us a greater monitoring profile? This is the point where aggregating the alerts from multiple monitoring tools into an incident management platform can pay huge dividends. Incident management platforms like PagerDuty can not only connect critical IT services, they also take the event data generated and immediately recruit and notify the right teams. Incident management platforms turn the issues that your monitoring systems have created into alerts and incidents.  Moreover, automated escalation policies enable your team to execute quickly and efficiently on your incident resolution by ensuring that a responder takes action on the issue. This is the pivotal point in which you are maximizing the ROI on your monitoring tools.  

Steps to breaking down monitoring

  • Foundation: Baseline and advanced incident detection
  • Enrichment: Deduplication, thresholding, and prioritization
  • Actionability: Notification and alerting
  • Success: Increased operational agility and reduced MTTR

With a solid monitoring framework, ITOps has the tools and visibility to be proactive in their operations and faster in their response to incidents. Overall, the goal is not to overwhelm ITOps with alerts, but to generate and detect the critical alerts that need immediate action.

Noise costs money —  in terms of personnel cost, productivity loss, downtime, and even lost revenues. By ensuring you have the right monitoring framework in place and having an incident management platform that centralizes, classifies, and enriches events you can avoid the “visibility and noise”  paradox.