Dutonian Story

Notifications Management automates the process of diagnosing DLQ alerts

Learn how one team eliminated manual steps and improved work-life balance by automating DLQ message retrieval through incident workflows and AWS Lambda integration.

Phase 1

The Challenge

How They Were Working

The Notifications Management team relied on a manual, multi-step process to diagnose DLQ (Dead Letter Queue) alerts. The on-call engineer had to log in to AWS, navigate through multiple screens, and manually retrieve message details to understand and diagnose incidents.

Before workflow diagram

Pain Points

Manual toil

Multiple manual steps had to be executed by the on-call to reach the information they needed to diagnose the pain.

Reduced work-life balance

The person on-call was required to be at their laptop to log in to AWS, causing disruption to those on-call during the evenings.

Less time spent on higher priority work

The time spent manually retrieving DLQ messages distracted the on-call from focusing on high priority work and projects.

Key Challenge

Responding to repetitive alerts with repetitive diagnostic steps without taking time away from on-call engineers.

Phase 2

The Solution

What They Did

1

Create an AWS Lambda function that retrieves DLQ messages

2

Create an incident workflow that:

  • Triggers based on the title of the incident on a specific service
  • Invokes an AWS Lambda function
  • Uses JavaScript to format the Lambda function response
  • Send a POST API request to add the formatted response as a note on the incident
3

Configure the Slack integration to post incident notes to a Slack channel

Phase 3

The Results

How They're Working Now

After workflow diagram

Wins

Reduced manual toil

The on-call spends less time following manual steps to retrieve message details to diagnose the incident.

Improved work-life balance

The on-call can quickly and easily retrieve DLQ messages from the PD or Slack app without needing to be on their laptop.

More time spent on higher value work

With less time spent on manual work, the on-call is able to focus more time on higher priority incidents and projects.

Outcomes

100%

Eliminated manual steps

Reduced from 6-8 manual steps down to 0 steps through automation.

Faster diagnosis

Instant access to DLQ messages directly in PagerDuty and Slack.

Improved on-call experience

Better work-life balance with mobile access to diagnostic information.

Lessons Learned & Tips

  • Use Dev/QA tools to write code before implementing in production
  • Test code in a sandbox environment to ensure it works as expected before rolling out to the team

Ready to automate your incident diagnosis workflow?

Start your free trial today and see the difference.

Start Free Trial