PagerDuty Blog

Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration

PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities.

Customers in our early access program are already seeing value in Global Event Orchestration, touting reduced MTTR and better standardization of incident response at scale. As Kiril Yurovnik, Technical Lead at Riskified, said, “With a growing number of events, minimizing noise and toil is imperative, especially as organizations aim to optimize their IT processes amid the current economic environment. We’ve been using PagerDuty’s Global Event Orchestration as part of the early availability program, and the results have been strong. Riskified has been able to scale noise reduction, especially from non-production environments, saving our team valuable time to spend time innovating on what’s next.” 

What are Global Event Orchestrations?

Global Event Orchestration is like Service Event Orchestration in that it allows users to define complex rules that determine what happens to an event as it is processed. The difference is that Global Event Orchestration enriches events at ingest. Then, once the data is normalized, the event is routed to a service based on various criteria. This ensures that responders have the best event data possible to begin the response process.

Global Event Orchestration has three key components that make it successful for scaling incident response. 

Global Orchestration Rules allow users to apply actions to events across services. Teams can create rules which process event data across services and use the processed data to improve event routing. This empowers organizations to establish and improve on auto-remediation. This means that a human doesn’t need to be involved in an incident to resolve it. This also reduces the blast radius of an incident via more intelligent routing.

Enhanced integration key management reduces the workload of managing integration keys for different monitoring tools. This allows users to combine integration keys into one event orchestration. Even better, enhanced integration key management is now available for all PagerDuty plans.

Additional APIs allow for management at scale. Teams can use REST APIs for event source and Global Orchestration Rule management. Both of these APIs have Terraform support. These APIs are in addition to the REST APIs for Event Orchestration/Service Orchestration management.

“Leveraging PagerDuty’s Global Event Orchestration has been critical to ensure that our event routing processes are efficient and scalable to optimize IT operations and spend,” said Brian Long, Cloud Infrastructure Engineer at Hyland. “With Global Event Orchestration, our organization is able to detect the “resolved” condition from our notifications to execute as a resolve and reduce the number of places these conditions need to be configured by at least a factor of three. This frees up our time to focus on innovation, not configuration.”

How can Global Event Orchestration help my team?

With Global Event Orchestration, teams will see:

  • Codified incident response processes: democratize and distribute well-understood incident responses across distributed teams
  • Fewer incidents: use contextual event data from all services within your ecosystem to improve suppression accuracy
  • Faster resolution: apply automation across teams and enable automated diagnostics at scale with standardized enrichment and data normalization

How teams use Global Event Orchestration may vary based on organizational structure. Capabilities align with two different teams: ITOps, SRE, and NOC teams and developer teams.

ITOps teams will be able to capitalize on the event normalization capabilities, ensuring that all events look the same as they come in.

SRE teams can create and extend automation across any or all services within a technical ecosystem. This makes scaling and standardizing automation across an organization easier than ever.

For L1 response teams such as NOCs, Global Event Orchestration helps them handle the massive incoming wave of events. Events can be routed to the NOC if they meet certain criteria. And, as the event passes through levels of rules and nested rules, automation can deliver diagnostics to the L1 responder. If the fix for an incident is well-known, organizations can create auto-remediation.

Developer teams will see fewer incidents and faster resolution. With auto-remediation, incidents can be resolved before they even hit the services that the developer teams are on call for. And, with in-depth routing criteria, incidents don’t bounce from team to team. If automation or the NOC or L1 responders can’t resolve it, the incident will go to the subject matter expert (SME). And, by the time the SME begins to work on the incident, diagnostic information is already available, reducing resolution time.

How can I get started today?

Global Event Orchestration is generally available for all PagerDuty AIOps customers. To see it in action, join us on Twitch Friday, April 14. 

PagerDuty AIOps helps teams experience fewer incidents, faster resolution, and greater productivity without long implementations or heavy ongoing maintenance. To try PagerDuty AIOps, you can request a trial here or take our product tour. If you want to talk to sales, contact us through this form.

To learn more about Global Event Orchestration, register for this webinar. If you’re a PagerDuty AIOps customer looking to create your first Global Event Orchestration, this knowledge base article can show you how to get started.