PagerDuty Blog

Putting the “Action” in Actionable Intelligence

AIOps combines machine learning and people to deliver technical outcomes in IT operations. The promise of this capability continues to drive new contenders to the market. AIOps has become a core messaging component for all the major event management players. Many have just rebranded their products to specifically highlight AIOps features. Emerging event management players have arrived and tried to also claim the AIOps space. Nearly all the observability and APM vendors have done the same, claiming that they are now the AIOps tool of choice.

What Is AIOps? 

But let’s take a step back and be real for just a minute. AIOps is not a tool; it is a set of capabilities. Accordingly, AIOps as a product set is difficult to define — akin to claiming your tool is the DevOps tool of choice. Even industry-leading analysts have similarly disagreed on what the core approach of AIOps should be and what the specific heuristics of AIOps are. Despite these discrepancies, we, as practitioners, can safely bucket the vast majority of AIOps solutions into two core camps.

Option #1: Application Monitoring 

Application monitoring tools own the first camp. This monitoring-centric approach aims to leverage metrics, KPIs, logs, and so on, and use machine learning and trend analysis to make predictions, allowing for smarter alerting sooner. The upside is that by monitoring everything, you can potentially get closer to a root cause. The downside is that you either end up replicating monitoring or having to rip and replace large portions of your current toolset to leverage these toolsets. In addition, instrumenting all your networking, storage, applications, performance monitoring, and more with a single tool can be costly, especially when replacing “good enough,” monitoring tools.

Option #2: Event Management

The second approach is event-management led. This group of solutions maintains a domain- agnostic view by integrating disparate monitoring, and you end up with a centralized-NOC capability type that focuses on a single-pane-of-glass outcome. This approach drives the promise of centralizing all the disparate information, ideally to make better decisions. However, you may end up with a bottleneck of capabilities, as you have to have a centralized place to update rules. Moreover, sizing the solution can be difficult as many vendors have different charge metrics based on data like peak usage, average daily usage, number of nodes, or number of event sources.

What both approaches leave out is that even if you can get to the “perfect” root cause, the “now what?” is missing. How do you fix the problem? Teams who use these solutions would still be left with critical questions that help with the actual firefight. What service is affected? Who owns that service? Who is on call for it? What are the diagnostics required? What automation can be deployed? 

Without these answers, restoring service can be painful.

A Better AIOps Solution 

PagerDuty is taking up this challenge to solve the real-time work issue most AIOps solutions ignore. We help reduce noise, create the context to isolate the root cause, and drive automation to reduce toil and restore service. With PagerDuty, teams can leverage a full-service ownership approach to help the builders and innovators drive solutions to market faster than their competitors and iterate value for their clients. Rather than any kind of rip and replace, we leverage the tools, teams, and capabilities you already have in place to quickly help you with tactical operational wins while supporting you as you build broader strategic advantages for Digital Transformation.

Automation First Approach 

Our automation-first approach can transform how your teams work today by leveraging Rundeck, our runbook orchestration platform, as your first responder. Using Rundeck, teams can often resolve issues without ever mobilizing a team. This automated resolution can greatly improve MTTR, but just as importantly, it can allow your subject matter experts to stay focused on their day jobs. If automation can’t immediately resolve the issue, our automated diagnostics can create context for first responders so they can understand affected services, customer impact, and SLA implications. That way, they can gather information from logs, scripts, and procedures that will guide them to driving their automated responses. This all creates a comprehensive audit trail that improves post-mortems and ITSM problem management to avoid issues in the future.

Our platform leverages an API configuration capability that allows for larger organizations or multiple teams to manage through self-service. So rather than being dependent on a centralized team to update rules or manage configurations, administrators can leverage repositories and tools such as Terraform to ensure teams quickly get updates they need without the gridlock of centralized-only capabilities.

We believe an automation-first, data-driven, self-service approach that brings teams and machine learning together to fix problems rather than just find root cause delivers on the true promise of AIOps. Leveraging your good-enough and best-of-breed monitoring where appropriate, this domain-agnostic approach allows you to focus on getting the right information to the right people at the right time when seconds count. By putting the action in actionable intelligence, we can reduce noise and alert fatigue, enable first responders to fix problems, reduce toil, and allow builders and innovators to deliver new capabilities instead of just chasing incidents.