Blog

The PagerDuty Vision for AI-First Operations

by PagerDuty August 14, 2025 | 9 min read

Something fundamental needs to change in the way we run operations.

Organizations are deploying AI to optimize everything from coding and deployment to resource planning and incident management. But they’re discovering that managing AI-powered systems requires a completely different operational mindset.

AI models hallucinate. Data pipelines degrade silently. Algorithms develop bias without warning. Performance isn’t just about uptime anymore—it’s about model accuracy, data freshness, and algorithmic integrity. As AI moves from experiment to infrastructure, one thing is clear: The old ways of working won’t cut it.

At PagerDuty, we’re championing an AI-first operations approach. This means equipping people with AI so they can spend less time on manual work and more time solving higher-order problems. It means understanding what AI is best suited to resolve concrete problems and being able to measure real results from these investments. It means having a secure, experienced partner for your AI-powered, automation-led digital operations, one that doesn’t just ship features but has a comprehensive perspective on how to move into this new era.

Read on for the PagerDuty vision on this challenging (yet exciting) time and learn how we’re adjusting to this seismic shift in the way we work and operate alongside our customers.

The operational challenges of AI transformation

Integrating AI brings new challenges to the forefront:

  • First, AI failures follow completely different patterns. A machine-learning model can “work”—processing requests, returning responses, consuming resources normally—while quietly degrading in ways that won’t show up in standard dashboards for weeks. 
  • Second, there’s a talent and skills gap. It makes sense that AI skill sets are in high demand. But organizations can’t expect to bootstrap enterprise-wide AI initiatives with a couple of seasoned hires. They need talent and training to support this skill set requirement.
  • Third, there’s an ownership problem. When AI starts producing poor results or outright fails, is that a data science problem, an infrastructure issue, or an operational incident? Organizations need a line of sight into who handles AI-centric time-critical work.

These challenges explain why many AI initiatives get stuck in “proof-of-concept purgatory” where AI is working well in small, controlled experiments but fails to scale and drive meaningful results.

The solution may sound strange: AI is the solution to AI-related problems. It’s what’s necessary to scale human talent and keep your competitive edge. Risk is a requirement for any organization looking to succeed in this digital era. But how you mitigate it, manage it, and respond to it makes all the difference. Enter AI-first operations.

PagerDuty’s framework for AI-first operations

We’re seeing disruption across every part of the software delivery lifecycle as AI transforms how organizations build, deploy, and operate systems. Through our work with enterprise customers, we’ve identified seven principles that define successful AI-first operations.

1. Determine if AI should solve the challenge at all

Not every operational challenge needs an AI solution. Some tasks fundamentally require human judgment, creativity, or contextual understanding that AI can’t replicate.

Start by evaluating whether the challenge is even suitable for AI. AI excels at pattern recognition, data processing, and routine decision-making. Humans excel at novel problem-solving, strategic thinking, and decisions that require empathy or ethical considerations.

2. Match human involvement to problem complexity

Once you’ve determined AI can help, optimize for human strategic focus by matching the level of human involvement to how well-understood each operational scenario is.

Fully autonomous: Well-understood issues with clear patterns and established remediation steps can be handled entirely by AI.

AI-led with human oversight: Partially understood issues where patterns exist but context matters—AI leads the response while humans provide guidance.

Human-led with AI assistance: New and novel issues that haven’t been seen before require human investigation and decision-making, with AI supporting data gathering and pattern analysis.

The goal is to eliminate routine tasks so engineers can focus on architecture decisions, complex problem-solving, and innovation that drives competitive advantage.

3. Operational challenges to AI capabilities

Not all operational challenges need the same type of AI. Successful organizations evaluate what problems or challenges they’re looking to solve first, and then find the right kind of AI to help them—not the other way around. The key is matching the right intelligence to the specific operational challenge.

4. Build context-aware systems that learn your environment

Generic AI fails in complex operational environments because every organization’s infrastructure, applications, and business logic are unique. Effective AI-first operations require systems that understand your specific ecosystem, data sources, team structures, operational patterns, dependencies, and business needs.

5. Automatically embed AI governance into workflows

As AI systems become business-critical, governance isn’t just about compliance—it’s about ownership and continuous improvement. AI-first operations require clear accountability for who owns what part of the AI process and how teams respond when failures occur.

This means establishing clear ownership structures and automated systems that monitor model performance, track AI-related costs, and flag potential risks during deployment. But more importantly, it means ownership structures that evolve based on what works, and improvement processes that turn every failure into organizational knowledge rather than leaving lessons on the floor.

6. Measure operational impact

Success in AI-first operations is measured by business outcomes: faster incident resolution, reduced downtime, improved service reliability, and lower operational costs. The most mature organizations measure baselines and then regularly review their metrics, fine-tuning as needed to achieve the desired business results.

7. Prioritize rapid experimentation and continuous learning

The most successful AI-first operations teams don’t just deploy AI—they continuously experiment, learn, and iterate. This means dedicating time and resources to exploration: regular hackweeks to test new AI tools, demo hours where teams share breakthrough discoveries, book clubs focused on emerging AI techniques, and experimental projects that push the boundaries of what’s possible in operations.

They measure success not just by current performance but by how quickly they can identify and adopt the next breakthrough that will give them an edge.

Understand the types of AI to apply them effectively

In Section 3 of the framework, we emphasized the importance of matching specific operational challenges to AI capabilities. It’s important to understand what each type of AI is designed to do and where it delivers the most impact.

Every operations team faces three persistent bottlenecks: workflow friction, knowledge silos, and reactive firefighting. The three major types of AI—Embedded, Generative, and Agentic—map directly to those challenges.

1. Embedded AI: Work smarter with built-in intelligence

The challenge: Operational work can be a sinkhole of time and money. Legacy workflows drive up operational costs and strain resources while reducing potential for revenue.

The solution: Turn every signal into intelligent action. Embedded AI capabilities cut through the noise—automatically reducing alert fatigue, preventing costly incidents, and resolving issues faster.

Practical applications:

  • Incident triage: ML learns from incidents across an environment and determines root cause, related incidents, past incidents, and other critical incident context.
  • Intelligent noise reduction: ML filters duplicate alerts and false positives before they reach engineers, reducing notification fatigue.
  • Change correlation: ML reviews recent changes in context of your operations and surfaces issues proactively.

Business impact: Operational work decreases and resolution times improve as machines do this always-on work, reducing costs.

2. Generative AI: Make better decisions with immediate insights

The challenge: Legacy, manual processes slow you down and hold you back. The swivel chair exercise to find context adds precious time to every incident, increasing risk to the business.

The solution: GenAI proactively surfaces operational context, turning complex data into action—so teams resolve issues faster, work more efficiently, and make better decisions.

Practical applications:

  • AI-powered incident summaries: AI generates concise post-incident reviews that detail all the relevant incident information, including timelines, action items, chat threads, and more, so teams can easily learn.
  • AI-generated runbooks: AI learns from how a system operates and suggests runbooks that improve efficiency and reduce operational load.
  • Automatic status updates: GenAI readily creates status updates so stakeholders are informed of any key incident developments without the context-gathering and drafting process falling to already over-burdened responders.

Business impact: Organizations can make better decisions in the long term based on post-incident reviews. Responders can make better decisions in the heat of the moment with key context, offloading toil in the process. This means less risk to the business.

3. Agentic AI: Move faster with autonomous agents that power efficiency

The challenge: The opportunity cost of toil is innovation. Manual operations create innovation bottlenecks and slow delivery cycles. Organizations need to elevate humans above repetitive work to focus on strategy and value.

The solution: AI agents autonomously resolve routine issues so teams can focus on building, innovating, and delivering better customer experiences.

Practical applications:

  • Triage recommendations: Data is spread across surfaces. As soon as an issue is discovered, an agent can automatically pull relevant triage data, running diagnostics, leveraging the embedded AI results, and more, to inform responders about what’s happening across the system.
  • Scribing incident information: Humans don’t need to waste time note-taking while trying to triage. Agents can autonomously review communications and surface relevant details to catch up other responders and keep a record of the incident.
  • Insights for improvement: People don’t want to sift through operational data to find problems. Instead, they want to be the ones brainstorming solutions. Agents perform data analysis on operations and regularly surface recommended areas for improvement.

Business impact: Incident resolution improves substantially, customer-impacting incidents decrease significantly, and teams shift focus from firefighting to strategic improvements.

The most effective AI-first operations layer all three approaches, creating a system where each type of AI handles what it does best while preserving human expertise for decisions requiring creativity, judgment, and strategic thinking.

Making the transition to AI-first operations

PagerDuty is helping companies navigate the complexity of AI-powered ecosystems while maintaining the reliability, security, and scale that business-critical systems demand. The shift to AI-first operations is happening fast, and PagerDuty is leading the charge while learning along the way.

If you’re ready to accelerate your own shift to AI-first operations, here are some additional resources. And, if you want to partner with a best-in-class platform purpose-built for critical work, we’re always ready to chat.