Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In a world where everything comes down to moments of truth, teams must respond to issues and opportunities in seconds. Rising customer expectations demand real-time...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Recently, I was putting together training material for our upcoming track on “Owning Incident Response” at PagerDuty University, and I listened to the recordings of incident calls across many years of PagerDuty history. Several hours of hearing my coworkers at 2x speed prompted two observations: first, I should go find my copy of Christmas with the Chipmunks; and second, the evolution of our incident processes took time, effort, and focus. Any company, regardless of the size of their teams and infrastructure, can have a great incident response process, but it doesn’t happen by accident, and it doesn’t happen overnight.
Many years ago, PagerDuty internally used the simple-but-painful process of, “page everyone with a generic alarm, and have everyone join a phone bridge.” This resulted in plenty of chaos, even with seasoned Ops people. Tasks were done without coordination, there was often confusion of what the scope of the customer impact was, and so on.
One of the very first things we chose to improve was to iterate on the language used in the call when someone was providing information, or when a request was being made to someone else. Taking the time to have a shared vocabulary, with phrases like “Is there any strong objection?”, we began to decrease the time our incident responses took and lessen the time customers were impacted.
Next, another large improvement came when we started using Incident Command System-styled roles. Agreeing ahead of time on who would take care of the problem (subject matter experts), and who would handle the process of managing the incident itself (incident commanders and related).
This let us also scope down the initial response to only the engineers who needed to be on the call. Gone now are the days of confusion and people joining the bridge with “what’s wrong?”. Along the way, we’ve come up with our own workarounds for anti-patterns in incident response, such as removing disruptive and non-contributing folks from the call, even if they’re the CEO.
So much of “Operations” or “Site Reliability” information is spread via tribal knowledge, or oral storytelling. Getting to the point of having a well prepared, comprehensive, and humane incident response process shouldn’t have to be that hard. Companies shouldn’t have to figure each part of a great incident response on their own, but to improve overall, everyone does have to make it an area to focus on.
You can find out more about Incident Response, chipmunk audio of my coworkers excluded, at PagerDuty University on September 6th, a day before PagerDuty Summit.
This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...
At the latest PagerDuty Connect event in Toronto, DevOps expert Arthur Maltson shared a recent story about chaperoning his daughter’s school field trip to a...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018