This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
When you’re in the middle of an outage, the last thing you want is people from all over the company constantly asking you when it’s going to be fixed. Your job is busy enough without having to play translator and communication whiz when you have more important things to be worried about. But at the same time, your outage affects people outside of your team. You can’t neglect communicating with internal stakeholders like your manager, or your CTO, or your CEO, or your marketing department, or you customer support team. You see where I’m going with this. So how do you keep your internal stakeholders informed in a timely, efficient fashion?
Your internal stakeholders are, for the most part, not interested in the long story of how your system went down and what methods you’re using to fight the fire. They are only interested in a couple of key pieces of information: How severe is the outage, and what is its likely duration? What’s being done, and who’s working on it? Making sure this information is clear will help your colleagues on other teams do their own jobs.
Blake Gentry, in his Heavybit talk on Incident Communication at Heroku, suggested appointing an Incident Commander who would be responsible for issuing hourly situation reports, or “sitreps,” to the entire company. This is a great idea, because it gets out key information, and it gives stakeholders the information they need without them having to seek it out themselves. Which brings us to:
You don’t have time to keep regurgitating the same message over and over; you’ve got incidents to fix! So find a way to make this information available internally to whomever wants it, whenever they want it. You can either issue regular, company-wide updates, or use a status page, like statuspage.io, to give your internal stakeholders a central location to check on their info and be masters of their own destiny in terms of getting their information.
You can also use our amazing, fabulous API to create a custom dashboard displaying information like the number of incidents open, their severity, and contact information for your on-call engineers. You can also send out a periodic email update to key internal stakeholders. The key is to be proactive, no matter what method that takes.
You should have your communication plan designated beforehand. Don’t wait until you have a big incident to set up a dashboard or create an email distribution list. It’s better to have these things and never need them then need them and not have them. You should also be aware of who might need to know more detailed information. This list will usually include your managers and any other engineering teams that might need to jump in. But don’t leave out a customer support point person. Your customer service team has their own agenda, plan, and priorities for getting out information to customers about the outage. But if they don’t know how to get updates, or if you don’t know whom you should be giving information to, then you might have a very early breakdown in communication, and one that you could have otherwise avoided.
Even with automation, your colleagues might have further questions about the outage. It’s important to find a balance here between keeping your priorities straight and not going radio silent. Appointing an Incident Commander to take responsibility for communication will help unburden the rest of your team and keep them focused on the problem. Remember that your whole company is vested in solving the same issue, so your Incident Commander should be sure to quickly answer stakeholders.
This will help you out in the long run. Giving stakeholders the information they need means having more people who can fix related issues elsewhere, and spread the word that your team is busy. It’s hard for a colleague who isn’t actually fighting the fire to understand that you’re being unresponsive or curt for a very good reason.