• PagerDuty
    /
  • Blog
    /
  • AI
    /
  • Your Next Incident Has Already Started. You Just Haven’t Noticed Yet.

Blog

Your Next Incident Has Already Started. You Just Haven’t Noticed Yet.

by David Williams October 24, 2025 | 4 min read

The best way to minimize the impact of an incident is to catch it early, before small issues snowball into major disruptions. That requires maintaining healthy systems and ensuring sufficient resources are available when problems arise. But developers and IT operations pros working in large enterprises face a challenge: Complex systems operate in an inherently degraded state. In his essay “How Complex Systems Fail,” Dr. Richard Cook writes that “the complexity of these systems makes it impossible for them to run without multiple flaws being present.” 

To build resilience, tech teams must first understand the ways in which a complex system can fail. Then, they must work to proactively address failures before they turn into major incidents. 

Historically, this level of resilience has been elusive. Systems are generally too large and complex to proactively monitor every aspect. With the advent of agentic AI, however, CTOs and CIOs now have a powerful tool at their disposal to overcome this imbalance.

The perfect storm: How minor changes can trigger an incident

Complex systems are often just one minor incident away from catastrophic failure. Cloudflare CIO Mike Hamilton, speaking at PagerDuty on Tour, noted that “the vast majority of major incidents that take place on our platform start with the genesis of a change that was deployed.”

On its own, rolling back a change should not be difficult. In a complex system, however, technical debt, siloed operations, and a lack of visibility into dependencies can quickly turn a minor incident into something catastrophic. 

Two real-world incidents show just how easily this can happen.

Slack disruption

On January 4, 2021, a minor network routing issue at Slack’s cloud provider caused widespread packet loss, disrupting communication between backend services.

Slack’s autoscaling systems attempted to spin up new infrastructure, but provisioning failed due to the same underlying network issue. As a result, healthy services were mistakenly marked as unhealthy, triggering cascading restarts and service removals.

Within minutes, a small routing issue became a multi-hour incident that impacted logins, messaging, and file uploads across the globe.

Fastly incident

On June 8, 2021, a configuration change surfaced a previously undiscovered software bug in Fastly’s edge infrastructure, leading to nearly an hour of downtime.

Even though internal monitoring detected the issue quickly, automated failovers and redundancies couldn’t prevent the widespread impact, and major websites like Amazon, Reddit, and CNN went dark.

Both Slack and Fastly experienced incidents triggered by a single change. In each case, complexity made it nearly impossible to anticipate how that change would ripple through the system.

What are the consequences of failure?

With social media as a signal multiplier, reputational damage following an incident can be widespread and severe. “When technology fails, customers don’t blame the technology; they blame your brand,” notes Jeff Hausman, PagerDuty’s Chief Product Development Officer. Even a minor incident can lead to expensive long-term consequences.

The fallout occurs in minutes, not hours. Yet the cost of this damage is enormous. In a 2024 survey conducted by Information Technology Intelligence Consulting, 41% of enterprise organizations anticipated that a significant incident would cost them $1 million to $5 million per hour. 

How to build resilience and anticipate failure using agentic AI

Agentic AI helps resolve incidents in several ways. In well-understood circumstances, where the cause of degradation is known, AI agents can run an autoremediation that corrects the issue without the SRE’s involvement. In partially understood incidents, they can suggest a runbook to the SRE, and provide valuable historical context that assists responders with decision-making. 

Just as importantly, agentic AI can surface early warning signals that humans wouldn’t otherwise notice. Using insights gathered from previous incidents, AI connects the dots proactively to spot well-understood issues before they lead to an incident.

AI agents are always on, always analyzing. They ingest your entire event stream, using historical and current data to spot patterns before humans can. With that data, they can suggest and even execute automation on behalf of the responder. 

The result is that false positives, transient spikes, and other noise are filtered out, manual work is reduced, and experts can focus on the incidents that require their expertise. In well-understood cases, an agentic SRE can take action to contain the problem before it impacts the customer experience.

Empower Experts, Improve Your Resilience

Your experts have the deep knowledge and situational awareness that make them the best people to handle complex issues. But they’re often stuck doing manual work. AI agents help teams break out of this. 

AI agents help teams handle the unpredictable work brought on by systems that are more complex than ever, and do it with less time spent on operations and more time spent on innovation. The result is empowered experts, resilient systems, and faster recovery when incidents occur.

Explore PagerDuty’s AI for critical operations to learn more.