Engineering Blog

Context Over Cleverness: Building PagerDuty’s SRE Agent

by Micah Mayo December 5, 2025 | 14 min read

We didn’t try to build a clever agent. We built one that shows up pre‑armed.

The lesson arrived earlier this year, as we began developing the SRE Agent, in a familiar-looking incident at 9:23 p.m. PT: consumer lag in production. Years earlier, we had documented a rare race condition in our runbook: duplicate records created through the REST API. We wrote a safe cleanup, promised ourselves we would add the proper constraint, and moved on. Two refactors and more than two years later, the same failure returned through Kafka consumers. The shape was the same; the door was different. People didn’t immediately connect it to the old notes. The response stretched to almost three hours, and it required extra responders late at night.

This incident and its lessons were fresh in our minds while we developed the SRE Agent. We saw an opportunity to break down silos, surface forgotten runbook knowledge, and draw connections between events that initially looked unrelated. We ran a small experiment, giving a prototype three things: the alert payload, one telling log line, and the service runbook. It identified the failure immediately and proposed the same cleanup, quoting the runbook text verbatim. The problem wasn’t intelligence. It was context assembly and recall.

That experience shaped the approach.

The working premise

Our first strategy: start where effective humans start: PagerDuty’s facts, your runbooks, and the agent’s memory.

“Memory” here is the agent’s own recall in your tenant, including service-scoped observations, incident recollections that separated signal from noise, and human‑promoted playbooks. The memory isn’t used for cross-customer learning. Each tenant’s memory stays isolated for security and efficiency. As the agent handles more incidents for a service, it builds a richer set of observations to pull from. That can include the relevant runbook section, the pattern that matched last time, or the context that mattered. And it’s efficient because it works with what is already there; no lengthy periods are needed to train a new model to digest the information.

The SRE Agent reasons only from what it already has—the alert, the runbook, its memory—before reaching out to external systems. It makes an API call or queries logs only when the answer would actually change the next step, not speculatively. This keeps response time predictable and ensures every action is justified. Recommendations tie back to concrete details responders can verify at a glance.

These became the philosophical tenets by which the SRE Agent would use to reason and create the context the agent needs to solve incidents; everything else flows from that.

The working set we compute (and why it’s not a summary)

We don’t wait for your prompt to act. When an incident opens, the agent assembles the incident context — the working set it will reason over for this incident and this service. We precompute it so the agent is immediately useful. You shouldn’t have to ask “what’s going on?” It arrives with a coherent view and a first proposal. Follow-up actions and questions from the agent are targeted enrichments—checking a particular log, asking about a recent deployment—not open-ended discovery.

We built this as the team focused on reducing the signal-to-noise ratio. Years of working on Alert Grouping, Auto Pause, and intelligent incident triage taught us what PagerDuty’s event stream can tell you and what it can’t. By leveraging that knowledge, we realized we also needed to augment these features with context, including your runbooks and logs or traces.

The inputs are intentional: the incident object, raw alert payloads, as well as a suite of powerful intelligent triage features such as Past Incidents (with notes to provide rich historical context), Related Incidents (which help inform blast radius “why”), Outlier Incidents, and Related Change events, because many times incidents are caused by recent changes. This triage data is joined with relevant runbook or doc sections we already know about or fetched via links we’ve seen. This isn’t a summary. It’s a high‑fidelity representation of the incident as PagerDuty understands it: the same fields responders trust, kept structured so the agent can reason over them and quote precisely.

Early on, we tried letting the model “discover” PagerDuty facts one API call at a time. It added latency and new failure paths, so we cut it.

What’s left is predictable. We rebuild and reconcile the context on trigger, when notes are added, and on resolve, with additional state changes coming next. If notes arrive in a burst, we debounce updates so the agent doesn’t operate on stale context. If a source is slow or missing, we proceed and note what was skipped.

Useful memory: observations and recollections

We keep two kinds of information: durable facts about the service, and incident details that actually changed decisions.

Observations are steady, service‑scoped facts: stack and framework, deploy pattern, where logs and metrics live, the dashboards people trust, the runbook anchors they use, and common dependencies. They make the first moves sane and avoid basic questions.

Recollections are incident‑scoped. They record what separated signal from noise and what was tried: the query that isolated the issue, the error fingerprint people keyed on, the dashboard slice that proved or ruled something out, the related change that shifted the hypothesis, and the steps we tried that didn’t help, plus why they didn’t. We keep actions, referenced artifacts, and outcomes, not transcripts. When a human says a pattern is definitive, we promote that recollection to a playbook: same evidence and steps, versioned, with any preconditions or guardrails.

To recognize the same alert through variance, we normalize the alert shape when a recollection is promoted. Think “drop volatility, keep discriminators”: remove tokens like timestamps and UUIDs; keep the fields responders actually reason about. We produce a stable signature from that shape. Later, when a live alert arrives, we compute the same kind of signature from the alert and match it to the playbook. That’s how “same problem, new IDs” lines up while we still quote the original sources.

Past Incident analysis is how recollections come back when there isn’t a definitive paybook. On open, the agent builds the incident context and, in parallel, pulls memories from past incidents for this service. It compares those recollections to the current context concurrently—matching by shape, error stems, and referenced artifacts—then ranks what’s most relevant now. The result isn’t a blob of history; it’s a short set of retrieved recollections with the specific queries, log lines, and outcomes that mattered last time, including what we tried that failed. Each item comes with literal quotes so responders can verify the basis at a glance. 

If a playbook matches, it’s recalled automatically. If not, Past Incident Analysis supplies the next best evidence. Scope is strict: memory lives within your tenant and the specific service. We recall data when it’s relevant. Your incident history, observations, and playbooks stay isolated to your tenant. The agent improves by building a deeper catalog of what’s happened on each service, not by feeding your data into a model.

How the SRE Agent makes decisions

By the time the thread opens, the agent isn’t “going to look around.” It stands inside the incident context and only reaches out when a fetch would materially change or sharpen the next step—for example, to confirm a discriminative log line or to read the exact clause in a linked runbook that guards a cleanup step. Bias to relevance over speculation; no fetching for decoration.

For runbooks, there are two paths: 

  • Deterministic: if the alert payload uses custom_details.runbook_url with github or confluence set to a valid URL, we eagerly add it to context. 
  • Opportunistic: teams often tuck runbooks/SOP links elsewhere in custom_details or inside a runbook section we’ve already pulled; the agent can follow a relevant link once, or fetch on explicit user request. It doesn’t start cold and trawl—it follows cues already present in the incident.

The agent doesn’t typically auto‑search logs. It runs a short, scoped log query only when the provider, target, and time window are already clear from context, and confidence is high that the result would change or increase confidence in the recommendation; otherwise, it proposes the query. On explicit request, it runs the query with the details it has and asks for anything missing rather than guessing. In practice, it formulates the query from existing definitions (runbooks, playbooks, or service memory), derives the window from the alert, and quotes only the telling line if it matters.

Runbook text and logs are quoted directly, and we always render suggestions with proper Markdown—commands and queries in fenced code blocks, field names and identifiers as inline code, and links back to the source. The agent aims to ground every step in the available context and to show its basis plainly. When the grounding is thin, it flags what’s absent and asks for it—especially the runbook—instead of pushing a definitive claim. The bar is verifiable reasoning with evidence: assertions stay tied to what we can show.

The design choices that stuck

Precompute the Incident Context 

Choice: assemble the working set lazily as the agent reasons, or build it up front when the incident opens.

Early prototypes let the model gather PagerDuty data on demand—one API call at a time. It felt more “agentic.” In practice, response times stretched past 60 seconds, the model over-weighted whatever it pulled first, and answers flipped when the same data arrived in a different order.

Precomputing the Incident Context dropped first-response latency to around 10 seconds and stabilized accuracy. It stuck because the agent could show up with a coherent first move instead of spending thirty seconds discovering what to ask.

The agent as a responder 

We faced a choice: wait for the responder to ask a question, or show up with an analysis and proposal immediately.

We picked the latter. When the incident opens, the agent posts its read and a first recommendation—unprompted. The alternative—wait to be asked—adds a round trip at the worst possible time and trains people to think of the agent as a tool they operate rather than a responder working the incident with them.

It stuck because it raised the bar: the incident context and first message had to be immediately useful, or the whole interaction fell flat.

Restrict tools to external signal

We faced a choice: give the agent access to PagerDuty APIs—services, schedules, escalation policies—or restrict it to external observability and knowledge systems.

We picked external only. The agent can query logs (Datadog, CloudWatch), traces, metrics (Grafana), and pull runbooks from Confluence or GitHub. It cannot look up on-call schedules or navigate service configuration. The inward PagerDuty data it needs is already in the precomputed Incident Context.

The alternative felt natural—other PagerDuty agents use those APIs to help users navigate incident management. But we were concerned the agent would look inward for answers when the signal that explains the failure lives outward in observability tools.

It stuck because it kept tool calls focused on where the technical signal actually lives.

Build memory into the system

We faced a choice: rely solely on live retrieval—runbooks, logs, traces pulled fresh each time—or build a memory layer that captures what worked and didn’t work for this service over time.

We picked memory. The agent recalls past incident patterns through Observations and Recollections, reasoning over live data and relevant history.

LLMs can’t be trained the way traditional ML models can. Memory via retrieval gives us a way for the agent to improve for a given service as it handles more incidents, without retraining. It stuck because memory acts as a context keeper—surfacing relevant history automatically, the runbook section that applies, the pattern that matched before—without requiring responders to remember every past incident or know which documentation applies.

What turned out to be hard

Drawing boundaries when everything feels possible

The hardest decisions weren’t about what we could build. They were about what we wouldn’t.

Early on, we experimented with a less tightly scoped agent—one that could answer questions about PagerDuty configuration alongside incident details. The debates started when we realized every new capability pulled focus from what we actually wanted: an agent that drives incident resolution, not one that answers general questions about your PagerDuty setup.

We landed on a test: does this directly aid resolution? On-call schedules failed. So did alert grouping config and arbitrary cross-incident queries. It’s useful information, but it doesn’t change the next technical step. PagerDuty already has other agents that handle incident management introspection—if we tried to do both, we’d dilute focus and create overlapping experiences. This agent isn’t responsible for identifying schedule gaps in your on-call rotations, or knowing your team’s performance metrics over time, we have other agents that specialize in those areas.

Those boundaries felt wrong at first. Someone would propose a reasonable feature, we’d acknowledge it was useful, then say no anyway. The conversations were uncomfortable—saying “the agent can’t do that” when it technically could. But holding the line forced us to build something coherent instead of a tool that does everything poorly.

Testing and verification in a non-deterministic world

Small changes to the system prompt or inputs can yield very different results. That made testing harder than we expected.

We experimented with many prompting techniques to shape behavior: multishot prompting, response templates, step-by-step versus open-ended instructions, structured versus unstructured output, conditional and dynamic prompts. We use all of these situationally, based on conversation state.

Some techniques made a meaningful difference. Telling the agent to say when it doesn’t know and to ask for what it needs reduced hallucination—it’s not compelled to answer. Another was keeping a generic answer template with no structure, otherwise the agent tries to force unanticipated questions into templates that don’t make sense.

We built evals to check quality and relevance—confirming the agent answers based on incident context, not speculation. We use LLM judges and human sampling, and configure both the agent and evals for deterministic output over creativity.

But evals only catch so much. We dogfooded heavily, using real incident data from our own teams and other PagerDuty engineering teams. As we released internally and into early access, we identified gaps through real use. Example: initially, we only supported alerts from our events API. Testing demonstrated the importance of email alerts and manually created incidents, which include rich details from customer internal monitoring systems.

Integration testing remains a pain point. To test end-to-end, we need a real incident: monitor triggers alert, logs sent to aggregator, real runbook with good searches, agent finds logs. We’re still working on tooling to make this easier.

Current state and what we’re learning

Today, the agent proposes. It doesn’t mutate your systems, restart services, or run cleanup scripts—it shows you the command and explains why. The path to “execute with guardrails” is clear: explicit approval flows, preconditions that must pass before any action runs, and full audit trails that log what was attempted and by whom. We’re not crossing that line until those rails are in place and we’ve proven the agent’s judgment on a much larger set of incidents.

The open problems are less about the agent’s reasoning and more about integration surface area. We support Datadog, CloudWatch, Grafana, Confluence, and GitHub today. Each new observability tool means another auth flow, another query syntax, another failure mode. We’re prioritizing based on where our users’ signal actually lives, but the long tail is long.

What we’re watching closely: how often the agent’s first proposal is the right one versus how often it takes multiple rounds of back-and-forth to get useful. Early signals suggest the incident context gets us most of the way there. When it doesn’t, it’s usually because the runbook is missing or the observability integration isn’t set up. That’s fixable. The harder question is whether memory actually compounds—whether the agent gets meaningfully better at handling incidents for a service after it’s seen ten, twenty, fifty of them. We think it does, but we need more time and more services to know for sure.