Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
It’s critical to have the right tools in place before a firefight happens. A lack of proper tooling makes it significantly more difficult to recognize, organize, fight, and resolve a major outage. This is especially true when teams are busy fighting rather than communicating to internal and external stakeholders. If best practices have been established ahead of time, a difficult incident can be handled much more smoothly.
The following is not an exhaustive list of domains to plan prior to an outage, but they will greatly improve your organization’s ability to coordinate and be prepared for any issue.
Internal communication will commonly take place in email. This is problematic for a number of reasons. Email is a one-to-one medium. It defaults closed, meaning it is only readable for the sender and receiver(s), and is inherently bulky and difficult to parse through when quick status information is needed. Persistent collaboration environments like Slack and HipChat provide an externally hosted location to disseminate information. Both of these platforms also provide public, optional subscribe, topical channels that can be used to disseminate information. At the critical level, status updates (or messaging that the issue is already known and being worked on) can be provided to key staff (support, leadership) in near real-time.
Ideally, the team will know there is an issue with an application before the customer does. Application and infrastructure monitoring technology can help ensure this is the case and can provide valuable information in the midst of the outage as to whether a fix or update is working as it should (New Relic for application monitoring and AWS CloudWatch are two such technologies). It is also important to monitor both application performance and infrastructure performance, and (ideally) link the two together, with a solution such as PagerDuty, to consolidate all service health data into a single view and notify the on-call resource if any issue requires urgent action. It is much easier to troubleshoot an issue if you have visibility to both layers and can identify the root cause.
When there is a performance issue, support teams will be inundated with requests for updates. Key ways to mitigate this influx are via Twitter, a status page, or to engage business stakeholders with a product like PagerDuty. These are separate from your primary infrastructure and should be resilient to even site-wide outages. On Twitter, users can easily look for pinned tweets and recent replies if they are having an issue. Users can also check statusapp.com for any “yellow” or “red” statuses. An easy-to-read status page like the one from statuspage.io is a critical component to disseminate information to your customers during an outage. A user will build trust in the page if it is accurate and includes updates for minor disruptions — and in that way, they also build more trust in your business. It should also contain updates when an issue is undergoing troubleshooting, and include status for each major subcomponent. These updates should be available within minutes, for complete visibility. Finally, with capabilities like PagerDuty’s Stakeholder Engagement, any incident responder can easily send out a status update that reaches predefined groups of business stakeholders via any preferred notification channel — phone, SMS, email, or push notification. Stakeholders can also subscribe to incident status updates to get real-time information on any issue that is customer-impacting.
A ticketing solution like ZenDesk is absolutely critical to managing an outage. A significant outage can be highly disruptive and forfeit substantial goodwill. A ticket management system will help to identify intermittent issues an application monitor may have missed. It will also track and disseminate information relative to an influx of support requests. Workflows for issue escalation will raise potential issues more quickly than relying on individual judgement, especially on larger support teams. Ready-made message templates will help keep messaging consistent and accurate during an outage, and “related to” tags will also make it easier to debrief an issue once it has been resolved.
With proper procedures in place, an organization can anticipate issues that are likely to arise from their applications. These scenarios should be documented ahead of time. Troubleshooting, mitigation, and remediation information should be documented and surfaced for the team. The procedure can also include a checklist of duties — one that lays out who does what, and includes emergency numbers and who is on-call. If resources are available, a tabletop exercise of a mock outage is extremely helpful in identifying gaps before a major outage occurs. Then after a firefight has occurred, debrief with the team in a post-mortem and improve your procedures. There will be another outage, and any additional information you can add to your process will speed recovery. As with the other above items, it is possible your local architecture will become unavailable, so storing these procedures on an externally hosted repository, or automating it with a solution such as PagerDuty, is preferred.
These tools are only an initial list. Their effectiveness in an outage is only as valuable as the time that was spent to properly configure and understand them ahead of time. Communicating with both internal and external stakeholders is key in any firefight, as much within IT as in any other function or industry.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019