Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
About a year ago, some technical difficulties at Citi temporarily shut off a few hundred thousand cards and a swath of ATMs at the same time. The result: Citi’s newly launched Costco Anywhere cards received a “flood of complaints.”
The Internet phrase for something on this scale is “tire fire.”
Incidents that escalate to tire fire status usually involve everyone in the organization from leadership to users to the support desk. PR or marketing sound an alarm and deal with external communications, and the technical team is left to figure it out.
This means writing an internal post-mortem and an SLA-governed mea culpa to the outside world. These are often written as “root cause” analyses, focusing on blaming and correcting the people, processes, and technology involved in the incident.
Technical leaders can and should do better than blame in these situations. Yes, teams should move as fast as they can to triage and return service to normal. But in the process of measuring causes of the incident, the effectiveness of the response, and the impact, the goal should not be to focus solely on “root causes.”
The point-and-shoot approach blames and asks for a budget. A portfolio approach shows how the current investments returned specific results, and how a reallocation might change those results. It helps the rest of the organization see how to invest in DevOps, support, and service teams.
For example, internal tools like ServiceNow, PagerDuty, and Slack are investments in speed and breadth — they help people get on issues across your entire infrastructure stack much faster. Building them out more might require tighter integrations with development-specific tools, more on-call staff, or a system for alerting users on mobile or in-app. These investments should not be presented ad hoc after an incident. Rather, the incident management and incident resolution metrics should be a way of showing how they are currently configured, and where people, process, and tools might be added to improve incident resolution outcomes.
Also, it should be communicated in clear, “business-ready” language, since incidents necessarily force DevOps, TechOps, support and the service organization to talk to the business side.
Internal incident notifications (e.g. change management ticket) are sent immediately to on-call staff (via PagerDuty and Slack). SLA requires same-day management communication with account owner.
Internal incident notifications (e.g. checkout cart app down) are sent immediately to on-call staff, management team, and support. SLA requires management communication with incident commander within one hour of this notification.
This template can be used internally for incident responders and business stakeholders, as well as externally for customers and prospects. With no technical knowledge, the business side can understand incident history and resolution time. This data is an asset that the technical team can maintain, directly tying incident resolution and DevOps processes to the bottom line.
While the above will help you have the right conversation at the business-level, an internal post-mortem is more introspective for DevOps and service teams. Ask: Are these processes correct? Is our infrastructure resilient enough? If not, how do we measure that would we know and what would we change?
There are far more metrics based on what makes the most sense to analyze for your specific team, but these metrics can give you a head start in starting to answer those inevitable questions, and they don’t require much process reinvention to get started. Just use modern ticketing, monitoring, incident resolution, collaboration, and customer satisfaction tools, many of which have analytics built in.
The above-mentioned PagerDuty and Slack are standard tools for incident resolution and collaboration. ServiceNow and the Atlassian suite are great for connecting incident and asset management. The important thing to keep in mind, above all, is that effectively resolving and preventing incidents relies not just on tools but on having a well-defined process that helps people use tools in an effective, integrated, and self-service way.
Never include “Other”, “Misc” or any other catch-all categories in evaluating the effectiveness of your tools, process, and people—that’s like building a trap door into all of your metrics. Also, while a template can be a good place to start, the team will get much more out of your reporting if you go beyond just copying from a template. Instead, begin with your team’s intuition:
Don’t boil the ocean. Remember, you’re on the same team and it’s not a deposition.
Focus these questions on how your team handles these incidents (timeline, personnel, usage of tools, etc.) and sketch out priorities based on that. If you have the basic categories of incident resolution tooling and process covered, and metrics that track how the business can continue investing for improvements, then you are in great shape — even in the case of tire fires.
Photo in Springfield Tire Yard on Simpsons Wiki
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018