Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
When a customer outage occurs, its impact is felt across the organization. While the technical response is underway, stakeholders from public relations, customer support, legal, and executives must also all be engaged and kept informed.But as teams become more global and distributed, coordinating streamlined internal and external communications and response only gets harder.
You need a well-defined plan and processes in place to ensure effective messaging during an outage. This minimizes time wasted when every minute counts, and maximizes transparency and order in the face of stressful, major outages.
Today, outage communication is often manual as well as ad hoc. Unfortunately, this creates several challenges.
Traditional outage messaging is often done via email distribution lists, conferencing, and chat in multiple, non-consolidated streams. But if the process isn’t managed well, it can be hugely costly with respect to losses incurred from service degradation and impaired productivity. There are dire needs for standardized processes around incident communication, and centralizing information to get everyone across the business on the same page.
Here are a few best practices that will enable you to simplify your outage communication plan:
During an outage, 100% your attention needs to be focused on solving the issue at hand. This leaves no time to waste, let alone on toggling between 4 or 5 tools, to execute mission-critical tasks like collaborating, logging status, and making sure people outside the team also know what’s going on.
This is where doing some pre-planning makes a world of difference in reducing chaos in a war room situation. Don’t exhaust mental energy during an incident trying to remember names of people you need to contact (Mary from the Infrastructure team? John from Support? What’s the name of that Director of Compliance again!?) and figuring out how to get in touch with them. There are great tools out there, like PagerDuty, that enable you to predefine groups of stakeholders that must know about various types of issues. When an incident strikes, automatically notifying all the right individuals with their preferred contact methods can be as easy as pushing a button.
For the most part, systems of record are not where people do the bulk of communicating during the incident response. You’re much more likely to find that information dispersed across multiple places, like ChatOps tools. But to make sure system and process failures aren’t repeated, there needs to be a way to piece together everything that happened chronologically, and prioritize learnings and action items with a post-mortem. Streamlining the post-mortem with templates and easy timeline building is key to learning faster.
The best way to get good at responding and communicating an outage is to regularly practice failure testing. While it’s crucial to do so in a way that doesn’t impact customers, test and try out different things to try to expose potential vulnerabilities. The ensuing response is an important opportunity to get more efficient at getting on top of unplanned issues, and at resolving issues fast while remembering to keep the right people engaged.
Teams must effectively coordinate incident response across subject matter experts and front-line responders. It’s important to have an efficient way to sound the alarm.
Appoint an Incident Commander who is the point person for getting all the right people from respective teams on the line, track the incident, and coordinate response. For more information on the role and best practices of being an Incident Commander, check out this webinar.
You want to minimize the number of channels that you’re using to communicate with the response team, as tool toggling wastes time. Whether you right channel depends not only on the severity and scope of the incident, but also on your team culture and work location. The main thing that matters here is making it easy to get the right people immediately engaged.
ChatOps tools are a fantastic tool for the incident response team. Having a simultaneous discussion in a chat client provides actionable, searchable, time-stamped data of who is doing what, and on what services. Even better, you can automate certain tasks and bring important information (like monitoring graphs) into a shared view, which helps drive down resolution times.
IT outage management isn’t solely concentrated to IT. As they potentially affect the entire business and bottom line, organizations should also have a plan around how teams like Support, Legal, Marketing, Sales, etc. are kept in the loop. Have an idea of what to share, set up a place where colleagues can easily get information, and determine who will get updates and how often.
To keep things streamlined, the response team should only share key, high-level updates: How severe is the outage? What is its likely duration? What’s being done, and when can the team expect the next update?
A solution like PagerDuty’s Stakeholder Engagement enables you to automatically notify individuals or groups of stakeholders via preferred contact methods. No more need to try and remember names of people to look up and contact during an outage. Stakeholders can also subscribe to incident status pages to check up on progress.
If colleagues have further questions, they shouldn’t distract individual members of the response team that are heads-down on the incident. To strike a balance between keeping things moving and providing additional context as needed, funnel questions and asks through the Incident Commander.
According to Inc. magazine, it’s 30 times cheaper to keep an existing customer than it is to get a new one. Being proactive in communicating an outage to users helps you control the story about your outage, and makes it clear that your company makes transparent communication a priority.
Let end users know that you are aware of the issue and at work on a solution. The outage notification can take many forms: a maintenance page on your website, social media post or update to your status page, or perhaps just an internal communication to your customer support team.
Provide updates at regular intervals and give practical information to customers about how the issue affects them, that is short and to the point.
A representative from support should always be immediately notified when a major outage takes place. This helps the support team stay on top of communicating the right messaging, updating your status page and support channels in real time, and reaching out to customers both during and after the issue.
PagerDuty supports better outage communication by enabling you to automate the best practice response. With PagerDuty’s Stakeholder Engagement, you can automatically engage the right stakeholders with real time updates via their preferred communication channels, and orchestrate the right business-wide response to customer-impacting issues.
Try out PagerDuty incident resolution, automate stakeholder communications, streamline and learn from postmortems, and more — all according to best practice. Get started with a free 14-day trial.
Be sure to check out our ebook, Best Practices in Outage Communication if you’d like to dive deeper into the best practices mentioned above. Our Incident Commander training is a great resource in building up Incident Commanders that can drive clarity in both internal and external communications during a response.
Streamline Critical Communications With Stakeholder Engagement
Oracle Delivers Better Customer Experience with PagerDuty
Best Practices in Outage Communication
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018