Dutonian Story

DataOps Team reduced noise by using intelligent grouping in AIOps

Learn how one team reduced incidents by 37% and supported the workload of 20-30 FTEs with just 10 FTEs by leveraging intelligent alert grouping and auto-incident creation.

Phase 1

The Challenge

How They Were Working

The DataOps team was inundated with PagerDuty incidents from multiple, critical, monitoring sources, responding to as many as they could with the limited resources they had.

Before workflow diagram

Pain Points

Incident Volume

Incident volume was very high because duplicate incidents are created. Reporting on past incidents was inaccurate and made learning harder.

Increased Time to Resolve

Team spends time diagnosing, grouping, and resolving incidents that are duplicative.

Data Inaccuracies

Data being used by end user teams could be inaccurate if a problem in the data sync took longer to identify and resolve.

Key Challenge

Managing an increasing amount of incidents from multiple monitored data sources with limited resources.

Phase 2

The Solution

What They Did

1

Set up intelligent grouping in AIOps and auto-incident creation

2

Configure rules to group alerts into incidents

3

Configure rules to suppress alerts

Phase 3

The Results

How They're Working Now

After workflow diagram

With intelligent grouping and alert suppression in place, the DataOps team now focuses on strategic initiatives while AIOps automatically manages incident noise and grouping.

Team Testimonials

With PagerDuty, we are able to increase workload without increasing resources

— DataOps Team Member

Wins

Productivity Increased

Allowed team to shift their focus from reactive operational tasks to proactive and strategic development activities including GenAI.

Risk Avoided

Proactively identified and resolved data quality issues, improving overall reliability and customer trust.

Time Saved

Team of 10 FTEs support what would normally be a minimum of 20-30 FTEs.

By The Numbers

37%

Downtime Reduced

Intelligent Grouping reduced incidents by 37% in the first 6 months of usage.

2-3x

Resource Efficiency

Team of 10 FTEs support workload that would normally require 20-30 FTEs.

Increased workload capacity

Able to increase workload without increasing resources.

Improved data reliability

Proactively identified and resolved data quality issues, improving customer trust.

Lessons Learned & Tips

  • Group and ungroup incidents to teach the intelligent alert grouping algorithm proper grouping behavior
  • Dedicate time every week to review past incidents and identify noisy, non-actionable events to suppress
  • Make use of incident urgencies to separate critical actionable events from non-critical actionable ones to prioritize operational work

Ready to reduce alert noise and improve efficiency?

Start your free trial today and see the difference.

Start Free Trial