New enhancements to PagerDuty’s SRE Agent: triage faster without waking a human
This blog post is part of PagerDuty’s ongoing series on how we’re helping customers navigate their journey towards autonomous operations. Read on to learn about how the recent EA/GA enhancements to PagerDuty’s SRE Agent builds towards this vision.
AI promise and AI capabilities often diverge, with developers often reporting much faster code production, but not enough change in how incidents are handled. When the rate of change is faster than ever, but the rate of recovery from incidents isn’t moving, developers wind up stuck in firefighting mode. And, when these systems fail, it’s costly. According to PagerDuty’s State of AI-First Operations, over a third of surveyed companies report losing $500K per hour of downtime. It’s not sustainable for the company or the teams.
The way forward is fighting fire with fire and making sure that the AI tools developers use to fix what’s broken match the intelligence of the tools they’re building with. SRE agents have become a new category for teams looking to cut toil and triage times and free up more bandwidth to build.
PagerDuty is announcing enhancements to the PagerDuty Advance SRE Agent that make it even smarter and more powerful. It can automatically conduct triage, kicked off from an incident workflow as a part of a team’s automation. It will be able to use agent connectors, tools, and skills as a triage data source and provide intel to humans before they even look at the incident. And, as people work to resolve the incident, they can interact with the SRE Agent directly on the Incident Details Page. Let’s look at these enhancements and what they mean for a world moving rapidly towards autonomous operations.
Trigger autonomous investigations
During an incident, there are so many competing priorities requiring responder attention. Sometimes, the easiest actions are left undone in the chaos. It can feel like leaving a crucial teammate out of the loop.
To solve this, the SRE Agent will be able to function as a true virtual responder, intelligently triggered via Incident Workflows (available for Early Access) to conduct triage. Users can configure these workflows to automatically engage the SRE Agent the moment an incident triggers, or upon meeting criteria such as priority or severity. This eliminates the delay of waiting for a human to acknowledge an alert and begin manual discovery.
Once automatically activated, the SRE Agent will come to the incident pre-armed with triage data to jumpstart the remediation process. It uses its memory of past incidents to analyze the current state of your systems and pinpoint the root cause. And this can all happen before a responder acknowledges the incident.
Rapid diagnosis with agent connectors, tools, and skills
A standalone AI agent is only as smart as the data it can see. Most teams lose critical time because AI-gathered insights remain siloed in different tools, forcing engineers to manually bridge the gap that agents should be able to fill in.
PagerDuty’s SRE Agent now features new configuration experiences (EA) that make it easy to extend the agent’s capabilities through connectors and tools, with skills EA planned in May.
- Connectors let you plug the SRE Agent into third-party data sources like Grafana, New Relic, Honeycomb, and more via MCP or API — just enter your credentials and authorize.
- Tools enable the agent to retrieve logs and metrics from observability platforms (Splunk, Dynatrace) and pull context from knowledge bases (Confluence, GitHub).
- Skills arm the agent with custom instructions, scripts, and domain expertise, giving it specialized capabilities tailored to your environment.
Together, these allow the SRE Agent to intelligently deduce troubleshooting steps before a human even looks at the incident. Trigger triage automatically via Incident Workflows — the moment criteria are met, the agent pulls data and begins analysis. No swivel chair required.
Bringing it all together for humans-in-the-loop
SRE agents can’t replace human experience and problem-solving skills. In many cases where the problem is new, novel, major, or just partially understood, humans are still a necessity to bring an incident to a close. In this case, the SRE Agent acts like a helpful assistant, bringing this triage data, as well as data from other agents like Scribe Agent and Shift Agent, to the table as soon as a responder sits down. It can even go a step further by recommending the right workflow to take on the incident based on key incident context, reducing the cognitive load for responders and accelerating resolution.
And now, users can access the SRE Agent directly within the Incident Details Page, in addition to interacting with the agent on Slack, Microsoft Teams (EA), or the Operations Console. Wherever a responder needs to work, the SRE Agent is there to support.
Because the SRE Agent has gathered all this critical triage data the second an incident occurs, it has the time to analyze and suggest remediation steps. From the Incident Details Page, chat, or Operations Console, responders can see suggested remediations laid out in plain language. And to further improve outcomes, users can enhance agent memory either by interacting with the agent or by directly updating the memory via the shared memory API. The human is still in the loop, making the calls. But now humans are informed at the start with critical information from disparate sources, synthesized so that they can make the best decision under challenging circumstances.
A new anatomy of an incident
Let’s take a look at how this links together to drive a real impact for developers. Previously, a spike in error rates would lead to a page, then manual triggering of the SRE Agent, plus digging through data that didn’t neatly map to complex systems.
With the enhanced SRE Agent, the flow might look like this:
- Trigger: A high-severity alert triggers an Incident Workflow.
- Context Gathering: The SRE Agent immediately uses agent connectors and tools to pull logs from Datadog. It notices a specific database query latency spike that matches a pattern from three months ago. Using its configured skills, the agent can check for a comprehensive service status using service dependencies in PagerDuty.
- Analysis: Within seconds, the SRE Agent posts a summary to the Incident Details page: “Detected 15% increase in checkout errors. Correlated to recent DB migration. Found 3 similar incidents in history.”
- Recommendation: The SRE Agent presents a suggested remediation: “Run Database Optimization Workflow.”
- Remediation: The responder clicks the button after a quick gut check. The workflow executes, the latency drops, and the incident is resolved.
In this example, humans were consulted for any major decision, such as running the remediation. But the time spent away from value-add work was significantly reduced. The developer was able to get back to building faster.
The path to autonomous operations
Your team shouldn’t be drowning in alerts while AI multiplies the complexity. PagerDuty’s approach to autonomous operations:
- Puts intelligent agents to work at scale—handling the noise, accelerating resolution, and keeping you in the loop where it matters
- Deepens the full incident management life cycle by empowering teams to resolve incidents faster
- Broadens the platform and ecosystem with capabilities that help teams prevent incidents from happening
PagerDuty’s SRE Agent is paving a path towards autonomous operations. By leveraging Incident Workflows for automated triggering and new agent connectors, tools, and skills to break data silos, the SRE Agent can conduct triage and diagnosis without interrupting a human. Humans can work hand-in-hand with the agent to run the right actions and resolve from their surface of choice, whether that’s chat, the Operations Console, or Incident Details page. Together, these enhancements empower developers to reclaim high-value time, fighting fire with fire alongside an SRE Agent that has their back.
Want to try out some of the early access features you read about? Sign up here or connect with your PagerDuty account team.
This blog contains forward-looking statements, including the expected availability of new functionality. These forward-looking statements are not guarantees of future performance and involve significant risks that may cause our actual results to be different from the results expressed or implied by these forward-looking statements. For a complete description of such risks, we refer you to the Company’s most recent Form 10-K and subsequent filings with the SEC, available for review at the SEC’s website at http://www.sec.gov.