Dutonian Story

How PagerDuty's AI Data Engineering Team Cut On-Call Incidents by 30%

Learn how one team used PagerDuty's analytics and automation features to reduce alert noise, eliminate manual snoozing, and make on-call shifts actually manageable.

Phase 1

The Challenge

How They Were Working

The team was experiencing high incident volumes with many alerts that were not actionable.

Before workflow diagram

Pain Points

High incident count

The on-call would get notified urgently about multiple incidents, some actionable, but many ignorable.

Manual snoozing

The on-call would snooze incidents and monitor them to see if they would self-resolve.

Unscalable on-call load

The on-call was responsible for managing a growing number of incidents as new services were deployed.

Key Challenge

Managing a high alert volume while continuously adding more services and delivering on new features.

Phase 2

The Solution

What They Did

1

Pull an Insights >> Incidents report by team

2

Include the created at, resolved at, acknowledged by, TTR, resolved by, auto-resolve columns

3

Download the report to CSV

4

Filter report to view potentially unactionable incidents

  • Incidents that were not acknowledged or resolved by a human
  • Incidents that auto-resolved within 2-5 minutes
  • Incidents for "staging" or "test" environments
5

Implement configuration changes and measure results

Configuration Options Available

Option 1: Configure alert suppression rules if the alert should never notify an on-call but should remain as a suppressed alert record in PD for data auditing
Option 2: Configure an alert drop rule if the alert should never notify an on-call and its alert details should never be stored in PD
Option 3: Configure an alert thresholding rule if the alert should only notify an on-call if the alert is triggered so many times within a specific period of time
Option 4: Configure an alert pausing rule if the alert should have been suppressed and only triggered an incident if it didn't resolve itself within x minutes
Option 5: Configure an alert pause setting at the service-level to automatically-pause alerts that have historically self-resolved within a short time period.
Option 6: Configure alert grouping rules to auto-group similar incidents
Option 7: Configure rules to lower the urgency of high urgent incidents
Option 8: Adjust monitors or thresholds on the monitoring side to bring in more contextual data into alerts
Phase 3

The Results

How They're Working Now

After workflow diagram

With automated alert management and intelligent filtering, the team now focuses only on actionable incidents.

Team Testimonials

As someone who previously would have asked 'why does it matter how we have our alerts set up?', I will happily eat my words - being on-call no longer weighs me down throughout my shift, and responding when there is an actual incident is much easier.

— Data Engineer, On-Call Team Member

My on-call experience has been so much quieter that at first, I kept checking my phone, thinking I had to have missed an alert or something!

— Data Engineer, On-Call Team Member

After implementing these changes, I feel like I barely need to own a smartphone and my wife no longer dreads my upcoming shifts.

— Data Engineer, On-Call Team Member

Wins

Lower Incident Count

The on-call only gets urgently notified about actionable issues requiring intervention.

Automated Snoozing

Self-resolving incidents are automatically snoozed and only notify the on-call if they are not resolved within a certain period of time.

Manageable On-Call Load

Newly provisioned services inherit previously configured rules to ensure known noisy alerts are deprioritized or suppressed.

By The Numbers

30%

Fewer Incidents

Significant reduction in incident volume through intelligent alert management.

67%

Reduction in MTTA

Mean time to acknowledge dropped dramatically with better alert quality.

75%

Decrease in High Urgent Incidents

Fewer false alarms means responders can focus on what truly matters.

Improved Alert Quality

Better content and context in alerts helps teams respond more effectively.

Reduced Manual Work

Less time spent snoozing and monitoring alerts that resolve themselves.

Lessons Learned & Tips

  • Review monitors to ensure the right content is included in alerts
  • Run analysis weekly to continuously monitor and capture noisy incidents

Ready to transform your on-call experience?

Start your free trial today and see the difference.

Start Free Trial