Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest features we've been working on — from event intelligence, machine learning, response automation, on-call, analytics, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 200 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Many IT organizations have come to learn that leveraging cloud infrastructure is not just unavoidable, it’s one of the most effective paths for IT organizations to become more responsive to business needs. Yet with the...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Incident response (IR) is a process used by ITOps, DevOps, and dev teams to address and manage any sort of major incident that may arise. The main goal of IT incident response is to organize an approach that limits damage and reduces recovery time and costs — and prevents it from happening again. Incident response generally includes an outline of processes that need to be executed upon in the event of an IT incident.
An incident response process is something you hope to never need, but when you do, it’s critical that it encompasses all the steps necessary for the response to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company or organization is built up over time and gets better with each incident. Many times, the knowledge of how to conduct thorough incident response is lost when a team member leaves, making it ever more crucial to have a documented process.
Nailing your incident response and learning how to deal with major incidents in a way which leads to the fastest possible recovery time is vital to the success of any team. Generally, your incident response documentation will outline not only how to prepare for an incident, but what to do during and after an incident. It is intended to be used by on-call practitioners and those involved in an operational incident response process.
For successful incident response, you must not only have a holistic view into the health of your IT infrastructure, you have to prepare your team to know just how to respond and what roles they must take on — allowing you to orchestrate the right response to resolve incidents faster and reduce your mean-time-to-resolution (MTTR).
Monitoring your IT infrastructure health by implementing different monitoring tools to appropriately monitor disparate and new systems, you can gain full-stack visibility. There needs to be a way to normalize, de-dupe, correlate, and gain actionable insights from all this data, and all the events generated by these monitoring tools must be centralized in a single hub, from which they can be triaged and routed to the right on-call engineer.
Before all else, it’s crucial for your team to have established guidelines for what to do when a major incident occurs. Incident response documentation that outlines a process for going on-call, what to do when an incident arises, how to communicate with teams, and what post-mortem process to follow after an incident is crucial. If you need help getting started with establishing your own incident response process, check out PagerDuty’s incident response documentation for guidance.
All this sets the stage for being able to streamline the incident response process when an incident does occur. When a major incident does occur, be sure you:
When a major incident does occur, assess the situation and call in the right stakeholders as needed. Collaborate with subject matter experts if need be, otherwise work with your incident commander, deputy, and customer liaison to assess the damage.
Once a plan of attack has been formulated, incident resolution begins. Determine what needs to be shared with the public, employees, and customers.
Learn is arguably the most important step in the incident response process. It’s in the aftermath that your team is able to look and see what went well or what didn’t go so well, and what you can do to prevent things from happening again. Incident post-mortems are a great way for teams to continuously learn and serves as a way to iteratively improve your infrastructure and incident response process. Check out our incident post-mortem template and handbook to get started.
Organizations are investing in many monitoring solutions to get visibility into their IT
infrastructure so they can better deliver on rising customer demands. Making sense of the event data and taking action by automating the incident response lifecycle for your environment—from assess, to resolve, and learn — is critical. Knowing what do when a major incident does occur is vital to the success of your team and your organization,
Learn more about incident response and the incident response lifecycle, which encompasses everything from assess, triage, and resolve – to learning and prevention to support developers as they move towards owning their code in production.
Sign up for a free 14-day trial of PagerDuty and get started on streamlining the incident response process.