Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
PagerDuty is thrilled to be named a leader in G2Crowd’s Fall 2018 Grid Report for Incident Management. The ranking is based on high customer satisfaction...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Over the past few years, PagerDuty has alerted thousands of users, letting them know when their systems are down. It’s what we do, and we’re proud to be seen as an integral part of their monitoring solution. Every once in a while we come across a customer that is using PagerDuty in such a way that it even makes us say COOL! One such customer is Cascadeo. Below is their story.
Cascadeo is an IT operations company that focuses on providing long term DevOps infrastructure and operations support for a wide variety of clients. With a staff of top talent systems/network engineers, project managers, and their own worldwide 24/7 NOC, Cascadeo customers benefit from a highly experienced DevOps team. Cascadeo customers can focus on the development of their applications, from development and through growth, while the team at Cascadeo supports the critical infrastructure on which the applications run.
At the core of Cascadeo’s product offering is a platform they developed called “The Cascadeo Operations Support System”, also known as COSS. COSS is a clustered application that runs across multiple regions on Amazon Web Services (AWS). COSS uses Amazon RDS for its database backend, and its function is to integrate all operations support systems into a cohesive ecosystem. Cascadeo uses a wide variety of SAAS tools in their operations including: PagerDuty (escalations/guaranteed delivery messaging), Zendesk (workflow), Harvest (time tracking), and a number of other tools. A critical requirement for each tool chosen is that it have a rich set of REST APIs to be used in integration. COSS acts as the routing bus for all of these systems by either reaching out to their APIs, or by generating REST endpoints for various systems to use in accessing COSS (e.g. COSS Alerts API).
Cascadeo provides each of their customers with an instant messaging operations room. In that room resides the entire Cascadeo team dedicated to that customer (NOC, PMs, Lead Engineers, Sys Admins, etc.) and everyone from the customer’s IT group (engineers, managers, etc). All communications including maintenance windows, status updates, and requests for assistance happen in this room. Use of a virtual ops room allows for full transparency of all IT service requests and issue resolution as well as a log of all communications which is crucial for audit purposes.
Cascadeo uses PagerDuty in two ways: As an escalation platform to activate on-call teams and as a guaranteed-delivery messaging platform. Taking things to a new level, Cascadeo has integrated PagerDuty into their COSS platform through the use of PagerDuty’s API. Each Cascadeo team member has a service assigned to him or her within PagerDuty. Through a series of commands used within the COSS instant messaging room, any Cascadeo employee can issue notification commands at normal, urgent, and emergency priority. These requests are sent to PagerDuty and then proceed to alert the appropriate team member that can address a customer’s needs. Once notified of the service request, all responses (ACKs, resolutions or escalations) are then captured back within the instant messaging room.
Here is a sample transcript:
Cascadeo NOC D
Ludy :Device: x7.acme.com
IP Address: 10.25.0.13
Time: 2012/11/04 13:53:53.000
CRITICAL: 10.25.0.13 did return a CNAME record
== Escalation notes ==
=== Tier notes ===
=== Device notes ===
* Post in Cascadeo and Acme chatroom and indicate as an URGENT issue.
* If the event happens to be a non-urgent (as confirmed by the engineer), remind the engineer to put a transform for Event escalation notes.
* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is: Fred Flintstone/Barney Rubble.
* How to access x7.acme.com — https://sites.google.com/a/acme/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com
* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is:Fred Flintstone / Barney Rubble.
* How to access x7.acme.com — https://sites.google.com/a/acme.com/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com
.oncall [casc/int/systems, 32962] Pls check 32962 re Acme : Systems : fisheye.acme.com (172.30.0.14) CRITICAL: 10.25.0.13 did return a CNAME record/Time: 2012/11/04 13:53:53.000
NOC, I’ve successfully sent event coss_102-32962 to COSS oncall macro – Systems via PagerDuty.
NOC, PagerDuty is reporting that incident coss_102-32962 (for COSS oncall macro – Systems) was acknowledged by Romel Emperado on 2012-Nov-4 06:01AM PST.
In addition to uniquely creating services for their team members, Cascadeo creates on-call queues in PagerDuty that are associated with each customer. This allows for the NOC, the lead engineer, or the project manager associated with the client, to send alerts to the client’s escalation contacts. PagerDuty’s flexibility in defining escalation policies is very useful in complying with specific customer alerting requirements.
“We live and breathe 24×7 Operations, both in the cloud and in the data center,” says Ophir Ronen, a principal at Cascadeo. “For us, PagerDuty is a key tool in our handling of mission critical operations.”
In addition to leveraging PagerDuty, Cascadeo deploys Zenoss to all of their clients. As a part of the provisioning process, they spend a significant amount of time tuning Zenoss to increase the signal to noise ratio. When an NMS is first installed, it generates an enormous amount of noise (events which are not actionable).
Cascadeo conducts a series of triage sessions, typically twice per week, where they work with their clients to categorize the top 10 noisiest events as either actionable or noise. Actionable events require remediation and escalation data that is collated and embedded into the event itself. That way, when the issue recurs, Cascadeo’s NOC and on-call engineers will have the remediation/escalation information immediately at hand which dramatically reduces mean time to repair.
The COSS solution lives in the world of web services. Not only does it reach out to PagerDuty’s APIs, but it too has APIs. For example, by using the COSS Alerts API, Cascadeo can receive specific alerts from Zenoss, buffer and pass them through the COSS platform for tracking, and then pass on those alerts to PagerDuty to trigger an alert.
Cascadeo is a rapidly growing company of more than 80 people, distributed across 6 time zones. They are able to offer extremely high levels of service delivery thanks to their talented teams and the COSS platform. According to Ophir, “Our integration with PagerDuty via our COSS platform allows us to easily activate our resources, distributed around the world, to help our clients. Guaranteed delivery and multi-cloud redundancy is key for us which is why we selected PagerDuty as the tool to handle the critical alert and messaging functions of our OSS”.
This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...
Dynamic Notifications are now out in the wild! With our launch today, we give PagerDuty users the power to dynamically adjust how they are notified...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018