Engineering Blog

From Experiment to Essential: Making Every Engineer an AI Agent Engineer

by Ralph Bird August 28, 2025 | 8 min read

One year ago, as AI agents started to make waves, a handful of us at PagerDuty began investigating how they might reshape the way we support incident responders. Today, we have our first agents in production and, most importantly, a company-wide shift in how we use AI. How did we scale from concept to production so quickly? By making AI agent development a first-class engineering skill. That meant equipping nearly 100 engineers from across our engineering organization (most of whom had never written production Python or production AI code) with the skills to build agents that handle complex incident response and platform management.

In this blog, we share how we built our first agents and empowered our entire engineering organization to upskill and join the agentic AI revolution.

Our Background in GenAI

PagerDuty Advance, our generative AI-powered assistant, provided users with a new way to interact with our product. Its chat interface in Slack & MS Teams allows users to get the information they want through an easy-to-use chat interface. But we wanted to move beyond constrained flows and information retrieval, to providing a system that can collect and process data, make decisions based and take actions. All the while learning and adapting their behavior over time, to improve their performance and meet our customers’ needs.

During 2024, with the rapid improvements in LLMs, it became clear that AI agents would offer a step change in capability. Rather than manually coding solutions for just a few tasks, we could now set a goal for an agent and let it handle the rest! This not only reduces our workload, but also gives us valuable insights into how users want to use our product! We had to explore their potential and find out how they could help our customers in their time-critical work.

Choosing the Right Platform

While building in-house is tempting, especially in such a young field where there is no clear standard, we first evaluated whether we could accelerate development by using existing platforms or vendors. After researching a number of options, we selected LangGraph on Python. We were already using Python extensively within the AI teams at PagerDuty, so that was the natural language to use. When comparing frameworks, we found that, though LangGraph had a slightly steeper learning curve, we liked its low-level customizability. Writing our own code and maintaining our own infrastructure gives us control over many of the decisions, in particular, it allows us to use a lot of our existing best practices, easing the security and legal burden. 

With this decision made, we built a proof-of-concept insights agent. For example, an agent can now tackle complex questions like “how many incidents did each of my teams have each month last quarter?”. Queries such as this would either require us to write dedicated custom flows for every possible query or have users’ data from several pages and do the analysis themselves. It proved an instant hit with internal testers and showed the value of agents.  

Empowering Teams

This success (and the resultant decision to pursue agents further) raised a critical question: should agents be built by a small core team with agentic experience, or by feature teams who have the domain knowledge? We chose the second model, which is similar to our approach with databases. We have a database reliability engineering team that sets the overall direction, builds and maintains the core infrastructure, and manages the complex parts. Yet they do not manage all aspects of databases; all engineers work with databases in their day-to-day work without needing to be specialists.

So, how do you upskill multiple teams rapidly to follow this approach? As experienced software engineers, they don’t need “Hello World!” examples (like most resources we found online), but they did need guidance in how we write production-ready Python and advice on how to work with the inherent variability of LLMs. To do this, we prioritized three areas:

Training

We ran a series of workshops, where we introduced AI Agents, and broke out into small groups with a hands-on tutorial that gave participants the chance to build their own agent. By focusing on a couple of thin vertical slices, we were able to rapidly go deep into the “why” of building agents. This depth-over-breadth approach taught them how to think when building agents and quick-started their development process.  

Building a Golden Path

The workshop introduced our AI Agent Golden path, which uses a template repo to deploy a working agent into production, with everything the engineers need from the agent code through the databases that support it, roles, permissions, and CI/CD pipeline. A lot of the boilerplate code in this template was deployed from an internal library, allowing us to push out updates to e.g. how we manage locks and state, without requiring all the teams to reinvent the wheel. This work was particularly appreciated by the teams and has been constantly updated as the agents have been developed.

Governance

AI Agents present novel risks, especially for engineers who are not used to working with non-deterministic applications. To manage this risk, we worked closely with the legal and security teams to bake the necessary risk assessments and legal reviews into the standard development process, and built a common threat library that every agent had to consider. Engaging early and often with these teams ensured that everyone was aligned and there were no nasty, last-minute surprises.

From Prototype to Production

The initial rollout exceeded expectations, but then reality hit. Agent development isn’t your typical 80-20 scenario, where most of the work is done quickly and the final 20% of the work takes 80% of the effort. Instead, it’s more like 90-10, you get nearly all the way there with very little effort, but the final 10% takes 90% of the effort. As teams moved beyond MVPs, common challenges started to surface, especially around latency, robustness, integration, and tuning agents for real-world scenarios. For instance, one team ran into trouble with the concept of “yesterday”; our agents operated in UTC, but users naturally expect results in their own time zone!   

In particular, we found that teams needed guidance with:

  • LLM selection: Teams lacked guidance on tradeoffs across vendors and models. We hadn’t fully understood the value of our experience in these nuances and how to get the best value from the differences between them.
  • Tool design: Tools needed stricter interfaces, structured outputs, and useful errors; simply wrapping an existing API was not sufficient! Showing best practices, such as limiting response lengths or providing additional details in the response (e.g, a count of how many entities were returned), was very helpful
  • Moving beyond a single node: Teams outgrew simple loops and needed more advanced LangGraph features. In particular, we found that building workflows or tools for specific critical user journeys made a huge difference to performance (both answer quality and latency), allowing us to reduce the number of LLM calls or carry them out eagerly.
  • Testing and evaluation: Benchmarking is key to measuring improvement. By using both a set of golden questions with answers and online evaluations, we were able to quantify the changes we made. Guidance and examples of “good” testing equipped the teams to create their own testing frameworks. Our knowledge of the best practices was invaluable here.
Next Time?

If we were to do this again, what would we do differently? The main thing would be to provide more guidance on how to think about developing an agent. In particular, we have found that encouraging teams to focus on two things has really helped:

Critical User Journeys: What do you expect users to use the agent for? Design for those, and tune your agent design to that.

Evaluations: Is the agent doing what you want? Is it being used in the way you expect? If you can’t answer both of these questions, you can’t make it better. Test-driven development can work on agents, too! Getting your tests set up early allows you to quantify the impacts of changes and accurately identify where you have issues (latency, quality, cost, etc).

Conclusion

We’re still early in this journey, and our early wins have produced multiple agents. We have also sparked enthusiasm and engagement across our engineering teams, validating our bet that AI agents should be a first-class skill. We will continue to invest in infrastructure, training, and safety to make this real at scale. Providing additional workshops and resources as we bring more agents into production and learn from the process.

If you’re tackling similar challenges or just starting your agent journey, let’s connect and learn from each other. The future of engineering is agentic; let’s build it together.