From Chaos to Actionable Insights with PagerDuty Integrations and Automation
It’s 2023. In today’s world, every company and individual, regardless of their industry, relies on software to increase productivity. Our users expect our technology to be available and reliable at all times. If your software serves businesses within a single country during regular working hours, they expect it to be available throughout that time. Easy, right?
However, if your software serves customers worldwide, 24/7, with a need for low latency, you will have to run your services in multiple regions and have teams supporting customers in different locations.
While the latter scenario may seem more complex, the same principles apply. Inevitably, something will fail unexpectedly, and chaos will rise during times of stress, such as incidents and service outages. So, be prepared.
Making Sense Out of Chaos
Our services today are distributed and leverage different platforms, hardware and software components, some of which we don’t even manage ourselves. Whenever something breaks, we find ourselves in “solving the mystery” mode. Although I grew up reading Sherlock Holmes’s adventures, I don’t enjoy doing it under pressure. It’s time to fix this!
PagerDuty Operations Cloud serves as a central hub for all events coming from any tool you already use. It doesn’t require you to change the CI/CD platform, ITSM or monitoring tools you are using. You simply integrate them with PagerDuty by taking advantage of our 700+ built-in integrations or by creating your custom integration using our Events or REST APIs.
Once you enable integrations on your services, PagerDuty’s AIOps capabilities will intelligently process and aggregate events and associate them with the target services. This reduces the number of incidents created and enriches existing ones with relevant information that will help you identify the root cause of the issue.
From an incident responder perspective you want to be notified as soon as a problem is identified and have access to all the information on what happened before and after the incident was triggered. The PagerDuty integration with Amazon Cloudwatch is an example of an integration that allows you to be notified once your resources enter an alarm condition. Alarms triggered in AWS generate alerts in PagerDuty that might result in incidents.
Another example is to have GitHub send all changes made to the code base into PagerDuty so the incident responder knows when something new was deployed and analyze the potential impact of those changes.
Using the APIs
For integrations that require higher frequency, such as monitoring or observability tools, we recommend using the Events API due to its higher rate limits and reliability. However, it is important to be aware of API response codes and approaches to retry your requests in case of errors.
Events sent via the API are directed to a PagerDuty service and processed. They can result in the creation of a new alert and/or incident, or the update or resolution of an existing one.
The Events API supports two types of events:
- Events – Monitoring tools should send a trigger event to PagerDuty to report a new problem or update an ongoing problem, depending on the event type.
- Change Events – The Change Events API allows you to send informational events about recent changes, such as code deployments and system configuration changes, from any system that can make an outbound HTTP connection. These events do not create incidents or send notifications, but they are displayed in the context of incidents on the same PagerDuty service.
To effectively route your events, the Events API uses two different endpoints: https://events[.eu].pagerduty.com/v2/enqueue for Alert events, and https://events[.eu].pagerduty.com/v2/change/enqueue for Change events. Once you add the Events API v2 integration to your service, you will receive the URLs for your account along with an Integration Key for your service (refer to the image below).
With this you can virtually integrate any service, tool or platform with PagerDuty Operations Cloud without depending on the native integrations provided by PagerDuty.
Dormain Drewitz, PagerDuty’s VP of Platform Advocacy recently had a conversation with Nakul Bhagat from the Product team on PagerDuty’s APIs. Worth watching if you are looking for more details on how to use them.
The Right People, At the Right Time
Now that you have all data insights flowing into PagerDuty Operations Cloud and routed to the correct services, you need to consider involving the right people at the right time. For incident responders, this is achieved by defining Teams, On-call schedules and Escalation Policies. For other stakeholders, this is typically done through the use of Status Pages, Status Updates or by adding them as subscribers to an active incident.
In addition to the notification mechanisms configured on user accounts, PagerDuty offers a range of integrations and extensions that allow teams to remain within the tools they already use on a daily basis, therefore reducing the need for context switching and facilitating easier adoption.
As an example, when you enable PagerDuty integrations for existing communication platforms such as Slack or Microsoft Teams you allow every person in the organization to be notified, contribute and made aware of what is happening with a specific incident potentially causing issues in different parts of the business. The creation of incident channels and adding relevant responders and stakeholders to them can be automated with Incident Workflows.
By following these steps, you will be well-prepared to provide proper service support. As a result, your customers will be happier than before. But can you take it even further?
Saving Time with Automation
Automation plays a significant role in PagerDuty Operations Cloud as it allows you to automate repetitive tasks and securely provide limited capabilities to others in a self-service manner (see examples here). By incorporating automation into your workflow, you can reduce the likelihood of errors and enhance the efficiency for engineers who utilize it.
When running services on a Cloud platform, there are multiple potential points of failure even before reaching the application. You can automate platform diagnostics with Process Automation or Runbook Automation within Incident Workflows. Instead of dumping the full logs, you can output these diagnostics to your incident timeline in a readable format.
This will allow not only incident responders to quickly understand where the issues are but will also allow other stakeholders to be aware of the work that is being done to resolve the incident.
One Step at a Time
In this blog post, you have learned about the capabilities of PagerDuty Operations Cloud in reducing noise and enabling effective incident resolution. When implemented correctly, having a strategy for handling incidents and being on-call can bring significant benefits. Your customers will be happier, your business will thrive, and your teams will be more satisfied with their work and the knowledge they acquire.
However, it is important not to underestimate the importance of tools alone. Start with small steps, gather insights, involve others, and focus on what is relevant for your customers and your business.
Let Us Know What You Think!
Have you started using our REST or Event APIs? Let us know by filling in this small survey!