Best Practices in Outage Communication
When a customer outage occurs, its impact is felt across the organization. While the technical response is underway, stakeholders from public relations, customer support, legal, and executives must also all be engaged and kept informed.But as teams become more global and distributed, coordinating streamlined internal and external communications and response only gets harder.
You need a well-defined plan and processes in place to ensure effective messaging during an outage. This minimizes time wasted when every minute counts, and maximizes transparency and order in the face of stressful, major outages.
What are some of the main challenges around outage communication?
Today, outage communication is often manual as well as ad hoc. Unfortunately, this creates several challenges.
- Managing updates across several siloed channels places additional burdens on the IT Team when they need it least, as they work to put out fires. This risks increasing the time it takes to achieve resolution.
- Business and internal stakeholders also find themselves frustrated because they don’t know where to go for the latest, relevant updates. They especially don’t want to be hearing about major issues from the customer instead of the team.
Traditional outage messaging is often done via email distribution lists, conferencing, and chat in multiple, non-consolidated streams. But if the process isn’t managed well, it can be hugely costly with respect to losses incurred from service degradation and impaired productivity. There are dire needs for standardized processes around incident communication, and centralizing information to get everyone across the business on the same page.
What are the best practices for communicating an outage?
Here are a few best practices that will enable you to simplify your outage communication plan:
Establish a single source of truth
Have predefined lists of stakeholders to automatically notify
Streamline postmortems to improve future response
Practice, Practice, Practice!
During an outage, 100% your attention needs to be focused on solving the issue at hand. This leaves no time to waste, let alone on toggling between 4 or 5 tools, to execute mission-critical tasks like collaborating, logging status, and making sure people outside the team also know what’s going on.
This is where doing some pre-planning makes a world of difference in reducing chaos in a war room situation. Don’t exhaust mental energy during an incident trying to remember names of people you need to contact (Mary from the Infrastructure team? John from Support? What’s the name of that Director of Compliance again!?) and figuring out how to get in touch with them. There are great tools out there, like PagerDuty, that enable you to predefine groups of stakeholders that must know about various types of issues. When an incident strikes, automatically notifying all the right individuals with their preferred contact methods can be as easy as pushing a button.
For the most part, systems of record are not where people do the bulk of communicating during the incident response. You’re much more likely to find that information dispersed across multiple places, like ChatOps tools. But to make sure system and process failures aren’t repeated, there needs to be a way to piece together everything that happened chronologically, and prioritize learnings and action items with a post-mortem. Streamlining the post-mortem with templates and easy timeline building is key to learning faster.
The best way to get good at responding and communicating an outage is to regularly practice failure testing. While it’s crucial to do so in a way that doesn’t impact customers, test and try out different things to try to expose potential vulnerabilities. The ensuing response is an important opportunity to get more efficient at getting on top of unplanned issues, and at resolving issues fast while remembering to keep the right people engaged.
How do you communicate with the incident response team?
Teams must effectively coordinate incident response across subject matter experts and front-line responders. It’s important to have an efficient way to sound the alarm.
Get the right people involved
Appoint an Incident Commander who is the point person for getting all the right people from respective teams on the line, track the incident, and coordinate response. For more information on the role and best practices of being an Incident Commander, check out this webinar.
Pick your communication channel
You want to minimize the number of channels that you’re using to communicate with the response team, as tool toggling wastes time. Whether you right channel depends not only on the severity and scope of the incident, but also on your team culture and work location. The main thing that matters here is making it easy to get the right people immediately engaged.
ChatOps tools are a fantastic tool for the incident response team. Having a simultaneous discussion in a chat client provides actionable, searchable, time-stamped data of who is doing what, and on what services. Even better, you can automate certain tasks and bring important information (like monitoring graphs) into a shared view, which helps drive down resolution times.
How do you communicate with business stakeholders?
IT outage management isn’t solely concentrated to IT. As they potentially affect the entire business and bottom line, organizations should also have a plan around how teams like Support, Legal, Marketing, Sales, etc. are kept in the loop. Have an idea of what to share, set up a place where colleagues can easily get information, and determine who will get updates and how often.
Decide what to share
To keep things streamlined, the response team should only share key, high-level updates: How severe is the outage? What is its likely duration? What’s being done, and when can the team expect the next update?
Automate when you can
A solution like PagerDuty’s Stakeholder Engagement enables you to automatically notify individuals or groups of stakeholders via preferred contact methods. No more need to try and remember names of people to look up and contact during an outage. Stakeholders can also subscribe to incident status pages to check up on progress.
If colleagues have further questions, they shouldn’t distract individual members of the response team that are heads-down on the incident. To strike a balance between keeping things moving and providing additional context as needed, funnel questions and asks through the Incident Commander.
How do you communicate externally?
According to Inc. magazine, it’s 30 times cheaper to keep an existing customer than it is to get a new one. Being proactive in communicating an outage to users helps you control the story about your outage, and makes it clear that your company makes transparent communication a priority.
Be transparent with public updates
Let end users know that you are aware of the issue and at work on a solution. The outage notification can take many forms: a maintenance page on your website, social media post or update to your status page, or perhaps just an internal communication to your customer support team.
Craft your message
Provide updates at regular intervals and give practical information to customers about how the issue affects them, that is short and to the point.
Enable your support team
A representative from support should always be immediately notified when a major outage takes place. This helps the support team stay on top of communicating the right messaging, updating your status page and support channels in real time, and reaching out to customers both during and after the issue.
How does PagerDuty support better outage communication?
PagerDuty supports better outage communication by enabling you to automate the best practice response. With PagerDuty’s Stakeholder Engagement, you can automatically engage the right stakeholders with real time updates via their preferred communication channels, and orchestrate the right business-wide response to customer-impacting issues.
How to become great at outage communication
Try out PagerDuty incident resolution, automate stakeholder communications, streamline and learn from postmortems, and more — all according to best practice. Get started with a free 14-day trial.
Be sure to check out our ebook, Best Practices in Outage Communication if you’d like to dive deeper into the best practices mentioned above. Our Incident Commander training is a great resource in building up Incident Commanders that can drive clarity in both internal and external communications during a response.
Zoho Cliq and PagerDuty: Straight Out of Chat
Top Trends for Infrastructure & Operations in 2020: A Fireside Chat with Charles Betz, Forrester Research