PagerDuty Blog

A Closer Look at PagerDuty’s New AIOps Capabilities

Another PagerDuty Summit is in the books, and we’re still coming down from the excitement and energy our customers and community showed us over the past week. We made several big announcements over the course of the conference, but none more significant than the AIOps advancements on our digital operations platform.

We introduced a number of ways customers can apply machine learning algorithms and automation to a wide range of workflows across the platform. From noise reduction and root-cause analysis, to auto remediation and advanced analytics, our release strives to make IT teams more efficient by applying AIOps to reduce complexity and human labor at a time when organizations are trying to do more without adding resources.

PagerDuty is also moving to reduce the fatigue that inevitably ensues when managing increasingly complex IT environments—fatigue that has been accentuated by the global pandemic. AIOps is key in determining the relationship between the thousands of alerts that all the elements of an IT environment can now generate. The goal is to provide IT teams with more context and actionable intelligence.

To learn more about AIOps as a use case category and its definition, be sure to check out these blogs:

 

Now, let’s dive into the details of PagerDuty’s latest AIOps capabilities.

Innovation Deep-Dive

Intelligent Recommendations

Nothing is more important than the health and sanity of your team. But in today’s fast-moving and complicated IT landscape, environmental triggers can work against this imperative by causing fatigue and burnout. Intelligent Recommendations use machine learning to suggest actions to reduce noise and improve team efficiency and health, while also providing the projected ROI results of adopting the prescribed actions.

  • Noise Reduction Recommendations automatically identify services suffering from alert noise, diagnose the cause, and give responders and service owners unique recommendations of methods to reduce noise that isn’t important. PagerDuty has found that, by implementing noise reduction recommendations, customers can see up to a 67% decrease in alerts and incidents on average—that’s 67% less false alarms and wasted work!
  • Team Health Recommendations. Teams are stretching more than ever to keep businesses online, but fatigued responders make mistakes. Improve employee health and keep your on-call team fresh by surfacing late-night or off-hours work to the appropriate team or manager, and sending a shift override recommendation for the responder.

 

Curated Advanced Analytics and Maturity Model Planning

Expanding on the broad and powerful analytics API released by PagerDuty to open our rich data set to customer queries, our latest enhancements surface the most useful and widely used analytic insights directly to our users in an easy-to-use interface. Once specific queries and reports are identified, regular scheduling of reports can be configured to keep diverse stakeholders informed.

  • PagerDuty Analytics Lab extracts insights from PagerDuty’s deep dataset for personalized analytics to answer a myriad of questions (e.g., What was the cost of the last incident? Which incidents affected my resolution time?). In addition, we’ve codified our Maturity Model and benchmark data from over 13,000 customers to help users gauge where their businesses are in their digital journey, and to show them how they can improve their maturity by taking action on optimization recommendations.

You can automate report creation and share analytics where your team works through our Slack integration. Reports available include:

    • Critical and Most Impacting Incidents
    • Service Health and Optimization
    • Operational Cost Efficiency Opportunities
    • Team Health and Optimization
    • Business Impact Analysis

  • On-Call Readiness Analytics helps properly set up teams so they can successfully deliver on their on-call responsibilities. Use this report to improve your on-call posture and track your progress towards organizational readiness. Proper team assembly can reduce resolution times and improve your organization’s ability to respond to incidents.
Dynamic Service Dependencies

The key to PagerDuty’s service-oriented approach is that our unique, real time Service Directory is up-to-date and accurate. We’ve made major improvements to our Service Directory by streamlining how dependency information is captured and updated. Now you can automatically surface upstream and downstream dependencies to speed up issue resolution, reduce work duplication, and prevent future incidents. We’ve also automated the chore of keeping dependency information in your service directory fresh with machine learning recommendations and bi-directional integrated service data from key partners so that your team can operate from one source of truth.

  • User-Defined Dependencies can be rapidly mapped and defined in a streamlined interface. Now you can track upstream and downstream technical service and business service dependencies and relations between them, with low-effort maintenance.
  • Automated dependency-awareness capabilities suggest relevant dependencies via machine learning and highlights them directly within the Incident Details page. During active incident triage, this dependency information helps you avoid dead ends, collaborate quickly, and take the right actions to resolve issues. PagerDuty has released several variations of this functionality including:

  • ServiceNow Integration v7 helps customers strengthen their integration and get more value from their investments in ServiceNow and PagerDuty by leveraging new, bi-directional functionality like running a PagerDuty response play in ServiceNow, or posting a Call to Action from ServiceNow to the PagerDuty Incident details page to provide users with live status updates. Additionally, both business and technical service dependencies from ServiceNow’s CMDB can be shared with PagerDuty’s service directory, allowing teams to more clearly understand the impact of incidents and identify critical services.

Change Impact Mapping

An estimated 80% of incidents are caused by changes. That’s why PagerDuty has fully integrated change events from the software delivery process (CI/CD pipelines) and code repositories, enabling visibility across changes to better understand their impacts. Leverage this real-time context to immediately identify where changes have caused failures and predict what risk future changes could have across critical business services.

  • Change Investigation for Incident Resolution reduces resolution times by helping DevOps responders understand which changes likely caused or contributed to an issue. They can use contextual information about recent software or configuration changes to diagnose and potentially prevent a problem from getting worse, or take swift next steps to coordinate an effective response.
  • Change Events Integrations with GitHub, Puppet, and Evolven give PagerDuty customers a simpler way to ingest change events from their software delivery pipeline.

Flexible Automation Controls

Applying AI and automation to something as critical as a company’s digital operations requires complete trust. That’s why we have created flexible automation controls to safely ensure that a human is in control at all times. PagerDuty suggests where automation may help, and can ensure no humans are interrupted when they shouldn’t be, but alerts them when needed. This provides enterprises with a trustworthy way to integrate and accelerate automation across the organization. Furthering our commitment to user-friendly automation, PagerDuty announced a definitive agreement to acquire Rundeck—a leading provider of runbook automation for the enterprise—in a move that will make auto-remediation more accessible to customers looking to automate their incident response processes.

  • Paused Incident Notifications reduce operational noise by delaying triggers, giving machines a chance to auto-remediate before notifying responders. Customers can keep an audit trail of triggers and actions, regardless of whether a responder was notified.
  • Event-Triggered Webhooks give response teams a way to resolve incidents faster and with fewer resources by leveraging push button automation using event rules to trigger external processes and workflows. Users can also monitor and track the state of automation sequences triggered on a service.
  • Dynamic Field Enrichment and Extraction helps DevOps engineers normalize alert content to fit their unique terminology formatting requirements. This can improve the outcome of reporting and analytics, and helps remove barriers to adopting resource-saving functionality like intelligent and content-based alert grouping.

If your team could benefit from any of these enhancements, be sure to check out our free trial or sign up to gain early access to the new features.