What NVIDIA, Okta, and Warner Bros. Discovery Learned About Scaling AI Operations Beyond the Pilot Phase
One key takeaway from AWS re:Invent 2025 was that a clear gap has emerged between teams still experimenting with AI and those seeing measurable value at scale.
In two sessions, PagerDuty customers joined us onstage to explain how they’ve scaled pilots into successful AI operations.
“Unlocking Enterprise Resilience: AI and Automation in Action,” led by our SVP of Engineering Rukmini Reddy, featured NVIDIA’s Rama Akkiraju and Okta’s Dennis Henry, who shared how they’re building the infrastructure that enables AI agents to operate reliably when it matters.
“AI-Driven Automation for Modern Operations,” our Product Strategy and Growth lead, Nora Jones, spoke with Warner Bros. Discovery’s VP of Site Reliability Engineering, Tom Leaman, about the often overlooked yet critical importance of the “boring” foundational work that makes AI innovation possible.
Here are some of the key insights that emerged from those discussions.
Discipline and infrastructure as an operational unlock
Getting AI operations right starts with getting your house in order: documented systems, clear relationships, and structured data that both humans and machines can understand.
During the “AI-Driven Automation” session, Tom Leaman said that when Discovery and Warner Media merged to create Max, they had nine months to build an entirely new streaming platform. They created an Operational Metadata Schema (OMD)—a standardized approach to cataloging all services and systems throughout their software development lifecycle.
“We catalog our services and systems… so that we could easily understand from the point of a repository being created, we could understand the hierarchy of business functions… through the CI/CD pipelines to our deployed infrastructure, services, metrics, logs, and eventually incidents handled in PagerDuty,” Tom explained.
They also mapped everything to Critical User Journeys (CUJs), the functionalities that actually matter to customers, like playing videos, logging in, and browsing content. The goal was to create a common language that both humans and AI could work with.
“Structure and organization make things more efficient, not just for humans, but also for artificial intelligence and automation,” he noted.
Tom’s team validated AI performance through systematic testing. Before deploying their severity classification agent, they ran it against dozens of historical incidents to verify it would reach the same conclusions as their human operators.
As a result of this detailed, systematic work:
- Interrupts have been reduced by 40–50% by mapping service dependencies and intelligently grouping alerts to identify shared root causes.
- Severity classification is now handled automatically, thanks to a well-documented severity framework and validation against historical incidents.
- AI-generated status updates free operators to focus on mitigation, with user journeys mapped upfront so AI understands which services truly impact customers.
The prioritization of enablement
Enablement in the AI era applies to both people and the AI agents they work with. Organizations need to invest in training their employees while also providing agents with the context, guardrails, and infrastructure required to operate effectively.
“Companies that aren’t prioritizing teaching their people how to work with the LLMs are missing the forest through the trees,” Okta’s Dennis Henry said in the “AI and Automation in Action” session led by Rukmini.
Enablement, or teaching people to partner effectively with AI agents, is critical.
In the same session, Rama Akkiraju, who leads AI initiatives at NVIDIA IT, shared a quote from her CEO, Jensen Huang, that resonated with her: “IT departments are increasingly becoming the HR for AI agents.”
IT is responsible for onboarding agents with the right enterprise context, defining what they are authorized to access and act on, evaluating the quality of their outputs, and maintaining their access and permissions over time.
The responsibility of enablement also applies to how teams are empowered to build, use, and trust AI systems in practice. NVIDIA built what they call their “AI factory,” a self-serve platform that gives their teams the building blocks they need to work with AI effectively:
- Pre-built connectors to logs, traces, metrics, alerts, and tickets
- Secure data pipelines for both structured and unstructured information
- Agent blueprints for quick assembly of common workflows
- Natural language interfaces to the platform itself
As Rama explained, their SRE teams “are so busy keeping the systems up and operational… they don’t have the time to step back and rethink the process or build some of these agents.”
The “AI factory” addresses that constraint by making agent development repeatable and self-serve, rather than requiring individual teams to carve out time to build one-off solutions.
Applying AI to the right situations
One of the most practical insights that emerged from our session with Tom from Warner Bros. Discovery, was this simple framework for knowing where and how to use AI:
- Automate the well-understood.
- Augment the partially understood.
- Keep humans focused on the novel situations.
Well-understood situations are perfect candidates for full automation. During major incidents, operations teams face predictable but time-consuming tasks that follow established patterns. Take incident communications, for example.
“Those status updates, those are well-understood pieces of work,” Tom explained. When a critical service goes down, stakeholders need regular updates on what’s happening and when it might be resolved.
“You know that your stakeholders are going to expect messages every 15 minutes, every 30 minutes, and there’s a synthesis process associated with that.”
It’s important work, but it follows established templates and pulls from known information sources. This is exactly the kind of routine but critical task that AI can handle well, freeing engineers to focus on actually fixing the problem.
In partially understood scenarios, AI can shine as an augmentation tool. These situations seem familiar, but vary enough that you want human validation before taking action.
During an incident, Tom encountered alerts related to a feature called “Limited Free Experience,” which he wasn’t familiar with. Instead of spending precious minutes during an active incident researching what this feature did and how it might be affected, he asked their AI agent for context.
“I turned to my handy-dandy agent, and it immediately gave me back a report that provided information on the capability,” Tom explained. The AI quickly explained what the Limited Free Experience was, how it worked, and which services it relied on, giving Tom the context he needed.
Novel situations require full human leadership, though AI can still surface relevant context.
Dennis Henry from Okta echoed this sentiment during the “AI and Automation in Action” session.
“LLMs are great at taking history and looking at things over a very large amount of data and parsing through that. But the one thing they can’t do until we get AGI is that it cannot come up with solutions to new and novel problems,” Dennis said.
In these situations, AI can help by quickly surfacing relevant historical data, similar patterns, or related documentation, but the problem-solving, decision-making, and creative thinking must come from humans. The goal is to provide people with better information faster so they can focus on what they do best.
Governance in practice: expectations, permissions, and validation
As organizations introduce AI agents into incident response, governance becomes critical. Teams need clear standards for how agents justify decisions, strict controls over what they’re allowed to do, and validation processes that build trust before agents are used in production.
When asked in Rukmini’s “AI and Automation in Action” sessions how he would handle scenarios where two AI agents might disagree during a high-visibility incident, Okta’s Dennis said it’s the same as when two SREs disagree: “They have to show their work.”
“I need that ‘show your work’ concept because that’s how I would handle two humans bringing me competing theories. I’m gonna expect the same from an AI to tell me and show me its data and say like, hey, I grabbed this graph from here and this RCA from here, and this trace from here, and because of these things, I think it’s X.”
That expectation establishes accountability, but accountability alone isn’t enough. Governance also requires clear boundaries around what AI systems are allowed to do.
“We never can let go of the table stakes that should be guiding us all, and that’s the security of our systems and the security of our data,” he said.
In practice, his team defaults agents to read-only access, with explicit approval gates for any write operations—especially destructive actions like deleting files, terminating services, or rolling back deployments.
In our “AI-Driven Automation” session with Warner Bros. Discovery, Tom said that before any agent goes into production, his team conducts extensive backtesting.
“We ran through a number of different incidents that were submitted by customer support, by product, by other engineers, and parsed that into the agent and then validated—did it come out with a severity that we eventually landed at?”
Insights from both sessions revealed that both Warner Bros. Discovery and Okta teams established a consistent approach to AI governance. AI agents are held to the same expectations as human responders when it comes to explaining decisions. However, they operate within tighter permission boundaries and are validated rigorously before being trusted in production.
A pragmatic roadmap for scaling AI operations
Across the conversations at AWS re:Invent, a consistent pattern emerged. Organizations that see real value in AI are not pursuing autonomy for its own sake. They are investing in operational discipline to enable both people and AI agents to work effectively, and applying clear frameworks to determine where automation belongs.
That means pairing structured data and repeatable processes with human judgment, enforcing governance and security from the outset, and validating AI systems before trusting them in high-stakes environments.
The lesson from re:Invent is not to move faster with AI, but to move more deliberately. Teams that align AI to the right work, apply it with guardrails, and invest in enablement are turning experimentation into durable operational advantages.
Watch the full discussions: