This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
You’ve just realized that something has gone critically wrong, and you can’t fix it yourself. Particularly if you work within a collaborative DevOps environment, it’s better to get by with a little help from you friends. Effectively coordinating the incident response across subject matter experts and front-line responders is a secret to operational success that differentiates top teams. So it’s important that you have an effective and efficient way to to sound the alarm, and make sure that your conversations are recorded and actionable.
The first step to effective outage communication within your incident response team is making sure the right people get involved. You should have clear processes in place for identifying subject matter experts, for contacting them, and for bringing them together in a single place. You should also have a designated team for handling external communication, and, if the outage is severe enough, you should loop them in right away so they can get a headstart on alerting your customers. A system like PagerDuty can help on both accounts by automatically storing on-call schedules and people’s preferred contact methods.
Secondly, it’s important to understand the importance of documentation. During the course of the incident, the response team will uncover many pieces of information, and make make quick decisions about the best way to contain the issue. Documenting in the moment is important to ensure nothing is forgotten or missed. Fortunately, there are tools and processes that can help here.
First, you’re going to need to talk it out. Create a fixed conference line; no one should be wasting time setting up a bridge manually for every call. Everyone on the team should know the dial in details, or where to find them. It’s a good idea to include the details in the PagerDuty event or in the service description, so they can be easily accessed when they’re needed. It’s also a good idea to record your conversations in case you want to debug your process.
Phone calls are great for real-time conversations and discussion. But phone calls are flawed: a call only provides low-fidelity data on the outage and on how tasks are being distributed. There is no text to hold people accountable for the decisions made over the course of the conversation, so how can you track your conversations?
The answer is ChatOps. Having a simultaneous discussion in a chat client provides actionable, searchable, time-stamped data of who is doing what, and on what services. And make sure you name your services. Here at PagerDuty, our services are named after Greek deities. This way, our entire team can understand what we’re talking about when we refer to Artemis.
To make incident response even easier, you can connect your tools to your chat client. Pipe in PagerDuty incidents, and use plugins to customize and make the most of your chat service. For example, you can use a chat bot to contribute server updates to the chat, or you can have Datadog graphs contribute analytics in the chat window. You can also issue actions to tools in the chat, and bots can take actions or capture follow up tasks.
Record your chat record in a CMS or in PagerDuty notes, so that it can be cited later. This can be a great teaching instrument for post-resolution learning, and can help your team become more efficient in the future by learning from how they handled issues in the past. And that timestamped, searchable discussion that came in handy when you were solving your incident also makes it far easier to write a post-mortem.
One added benefit to ChatOps seems obvious, but is worth mentioning: written communication is generally higher in quality than spoken communication. Your team has more time to organize their thoughts than on a conference call or face-to-face, and they can more easily reference what other team members have said over the course of the conversation to create a clear plan of action.
Effective communication during an incident makes your life easier when you’re training new team members, too. You don’t have to reverse engineer your past experiences into a future plan of action, or a runbook. You’re writing training materials and action plans in real time, ready to use from the moment you’re done documenting and solving the incident.
How does your team communicate outages internally? Let us know in the comments section
For further reading, check out Best Practices in Outage Communication: Customers.