We Built an SRE Agent With Memory And It’s Transforming Incident Response
If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.
When we started this journey, I had been working on various AI products at PagerDuty, including alert correlation and neural networks. We knew how important the ability to deliver an experience that continuously improves would be to our customers, but what we didn’t expect was just how critical memory would become. And not just any memory, but the ability to connect data points across systems. That’s what makes the SRE Agent truly helpful and accurate.
Customers told us the memory feature was “make or break.” Through interviews, a pattern emerged: siloed knowledge was the hidden catalyst behind their biggest inefficiencies. Swarming incidents with multiple subject-matter-experts, lost time hunting for context, and ultimately, customer impact—all traced back to the same root cause.
Data silos and missing documentation aren’t new problems. But somewhere in the AI gold rush, while everyone chased the next breakthrough model, we forgot something fundamental: models are only as good as the data they can access—including the institutional knowledge that’s often most valuable and the transient insights that vanish during incidents.
Tool sprawl and lost knowledge aren’t going away, but for the first time, we have AI capable of rising to this long-standing challenge. What we realized is this: capturing and consolidating knowledge across humans and tools isn’t just about making incidents faster to resolve—it’s about fundamentally changing how automated operations adapt and improve over time.
Why “memory” matters for real-world incident response
Plenty of tools can summarize or even correlate alerts. Memory is different. PagerDuty’s SRE Agent remembers what actually happens in your environment—changes, dependencies, past incidents, conversation history, and most critically, the steps human responders took to diagnose issues and restore service. That memory compounds over time and produces the following benefits:
- It sharpens triage by recognizing patterns and related incidents across services.
- It accelerates diagnosis by connecting change events with symptoms and past fixes.
- It upgrades your operations over time by generating smarter runbooks and actionable post-incident reviews.
The result is shorter incidents, fewer responders needed, and less cognitive load on the people who are on call.
Built on the signal that matters most: operations data
PagerDuty has 15+ years of operational expertise guided by real incident data. That heritage powers the SRE Agent’s ability to turn messy, multi-source operations data into actionable context for responders. It doesn’t just read logs and metrics; it correlates them with service topology, recent deployments, and incident history to tell you what’s likely happening and what to do next. Here’s how it accelerates and improves the incident lifecycle:
- Detect and triage: With 700+ integrations and an open API, the SRE Agent pulls data from across your stack and separates the signal from the noise.
- Diagnose: It runs automated diagnostics, queries logs and metrics, and consults runbooks and prior incidents to present likely causes with evidence.
- Remediate: With human approval, it can execute recommended actions, validate service recovery, and record exactly what worked.
- Learn: Get context from the right incidents. The SRE Agent improves its recommendations over time and generates new or updated runbooks to prevent recurrence.
Using memory for action
Built to be vendor-agnostic, PagerDuty’s SRE Agent works seamlessly across observability, automation, infrastructure, and collaboration tools to provide a complete picture without forcing tool consolidation. It was designed for incident management at its core, operating directly within PagerDuty’s system of intelligence and action, where mobilization, escalation, and remediation happen. Enterprise-grade reliability is foundational, with governance and safety controls that minimize hallucinations and support compliance for high-stakes environments. Taking an automation-first path to self-healing, the agent doesn’t just suggest next steps—it executes approved remediations and verifies outcomes, paving the way for increasingly autonomous operations. And with contextual memory that spans services and incidents (not just pre-attached context or monitor-bound memory), the SRE Agent compounds institutional knowledge and improves outcomes over time.
And because modern ecosystems are collaborative, not monolithic, we’re building to connect to the places responders actually work. The SRE Agent is vendor-agnostic today, and support for MCP will connect it into a broader AI ecosystem, so it can work alongside other agents and platforms rather than compete with them.
What does this look like in practice?
In Slack or the Operations Console, the SRE Agent surfaces triage analysis the moment responders arrive, highlighting key findings, current and past related incidents, relevant change events, and recommended next steps pulled from your runbooks. It then runs targeted diagnostics automatically, retrieving logs, and comparing current behavior against recent deployments, so responders never start from a blank slate.
When it proposes a remediation, it cites the signals and history behind the recommendation, and with approval, it executes, and soon will also verify restoration of stable operations and summarize the outcome. Afterward, it enriches the post-incident review and updates runbooks with what worked, so next time you can resolve faster with fewer people involved. Teams tell us this shifts the center of gravity in incident response: less paging everyone into a war room, more finishing fixes quickly, and getting back to shipping.
How PagerDuty’s SRE Agent compares to alternatives
- Observability platforms: Great at mining their own data, but limited beyond it. Importantly, they lack incident history. The SRE Agent correlates across tools and connects technical symptoms to business impact and human response patterns—the part most vendors can’t see. It also has the ability to recall information from past incidents.
- Incident management startups: Limited feature sets and integrations, unproven security and scalability. PagerDuty has the ability to go broader with automated diagnostics and remediation. PagerDuty also offers enterprise-grade, comprehensive governance controls to help maintain compliance and operational integrity.
- ITSM suites: Broad AI strategies, but heavy to configure and not optimized for the speed of SRE workflows. PagerDuty integrates with ITSM to keep you compliant while resolving critical, time-sensitive issues faster.
Memory that builds momentum
The SRE Agent’s memory is a foundational component; it’s the engine behind compounding operational gains. It strengthens post-incident reviews by automatically capturing what happened and why, cutting the manual effort of assembling timelines and evidence. It makes runbooks smarter by turning proven fixes into living, up-to-date procedures, so teams don’t waste time reinventing responses. It accelerates time-to-resolve by spreading the hard-won knowledge of senior responders across the team in weeks instead of years. Over time, this creates a virtuous cycle: fewer tickets, fewer escalations, and fewer late-night pings.
What’s available now
- Available interfaces: ChatOps experience (Slack) and Operations Console
- Integrations to bring in signals from observability and knowledge bases (e.g., Datadog, Confluence) with more coming online
- Agentic triage that assists side-by-side with responders
- Automated diagnostics and a governed path to automation and remediation
- Context from past incidents, runbooks, and conversations to improve recall and accelerate remediation
Ready to see how the PagerDuty SRE Agent can transform your incident response?
Your incidents aren’t slowing down. With a teammate that remembers, adapts, and acts across your entire stack, you’ll be more prepared to handle the next one. PagerDuty’s SRE Agent is here to turn chaos into action—and turn every incident into an opportunity to get better. Try SRE Agent today, or see how it works in practice in our interactive product tour.