Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Modern organizations today are managing increasingly complex technology portfolios and pressured to deliver on innovation—all while facing far higher stakes than ever before when it comes to maintaining service performance and reliability. While these demands may seem like a paradox, many organizations have been successful in implementing processes that enable them to balance both agility and risk. In this post, I’ll touch on the importance of integrating incident response with your ITSM tool and walk you through the steps on how to effectively balance agility and risk.
You can’t add minutes during an outage, so prioritizing your planned work outside of an incident effectively is key—and part of that is using an incident resolution platform like PagerDuty to manage and tie your unplanned work back to the planned work that’s tracked in your ITSM tool like Jira, ServiceNow, or Remedy.
How does that help? First, information flows from ITSM into PagerDuty so that responders know what has changed and who is reporting an impact. Next, follow-up items from PagerDuty are sent back into ITSM, including outcomes of the postmortem that need to be prioritized.
A given employee may have dozens of prioritized tickets in an ITSM tool, but they should only ever have 1 (or ideally 0) assigned to them in PagerDuty at a given time so they can focus on customer-impacting issues that require immediate responses. Similarly the concept of unassigned incidents doesn’t exist in PagerDuty—if there’s a problem, someone is responsible for that problem.
Simply put, the easiest way to speed up your response is to start it earlier. The best way to do this is not tracking what affects your machines, but what affects your customers. Organizations that use Real User Monitoring can track whether users are able to successfully load, download, or buy their tools. Additionally, since you’re primarily looking to detect problems before they affect users (although at the cost of some false positives), monitoring the underlying infrastructure is equally important to identify the cause of a customer-facing problem.
Automation also plays a role in speeding up incident response, and your monitoring tool should automatically assign problems to an owner. Along those same lines, to prevent an issue from affecting your revenue, the monitoring tool should also assign and immediately notify someone about all issues above a certain priority using that person’s preferred communication method (phone, email, SMS, etc.).
To make automation easier, PagerDuty integrates with hundreds of monitoring tools. So, for example, if your monitoring tool detects that your shopping cart has gone from slow to completely non-responsive, PagerDuty can automatically create an incident with the correct priority to ensure the responder has all the information.
In the same vein, create automated workflows whenever possible. If a Sev1 needs to pull in executive stakeholders, automate that response play.
Remove ambiguity, confusion, and wasted time during a response by defining your process and clarifying the different roles involved. We recommend including the following roles: Incident Commander + Deputy, Scribe, Customer Liaison, and Subject Matter Experts. (For more details as to what each role means, visit https://response.pagerduty.com/before/different_roles/.)
During an outage, things can become a bit of a madhouse and the organizational hierarchy takes a backseat to the response roles. When executives start to randomize the defined process, you need to remove them from the process and communicate clearly and concisely why certain processes are followed—and if the CEO wants to change the process on-the-fly, they can decide to become the Incident Commander.
To help everyone keep it together, remember the following:
It’s important to define a process around communication to people outside of the core response team as well. Depending on the type of incident, you could be dealing with internal customers (we often call them stakeholders), external customers, and even the market at large. For instance, when responding to a security incident, you may need to loop in the legal department in addition to other executives.
These groups all need to be kept up to speed on an as-needed basis, but the wrong place to do that is where the responders are working. The last thing you want is someone joining the call and asking for a status update as this disrupts the people trying to discuss fixes during the call. To my point earlier, you don’t want an executive getting on a call and demanding that the team fix the outage in 10 minutes. This implies the team is not already working as quickly as they can. It’s demotivating and doesn’t contribute anything helpful for the response. This is where the Customer Liaison comes in—using a feature like PagerDuty’s Stakeholder Engagement, the Customer Liaison can provide streamlined, real-time updates to relevant stakeholders across the business.
Here are a few other ways to improve real-time communications:
Postmortems are how you fix a long-term problem. They give closure to people after a particularly stressful event and guarantee that your team can take well thought–out and productive action on some of the immediate patches you made in the heat of the moment to solve a problem.
So what does an effective postmortem look like? It should:
We post all of our postmortems internally using our postmortem tool. We view postmortems not only as learning for our team, but also as an input to our best practices training, where we share our experiences and learnings with our customers.
For more postmortem tips, download our detailed e-book.
You can’t expect your incident response process to be fantastic if you only use it every once in a while. Not every service fails often and some people get more practice than others. But everyone should be practiced so that when something does happen, you and your team are ready.
The less time you need to spend fixing unplanned outages, the better your services are, which results in happier customers since customer-impacting incidents are likely the worst thing that can happen to a business. They damage brand reputation, cause huge losses in customers and revenue, inhibit employee productivity, and slow down morale, among other things. If you can get to a point where you are as efficient as possible and are able to respond to major incidents without chaos and stress—with the attitude that you will learn and improve from each one—you will achieve a winning and empowering culture that stands to delight both your customers and employees.
Interested in learning more about incident response? Check out our incident response documentation page.
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018