(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
It’s critical to have the right tools in place before a firefight happens. A lack of proper tooling makes it significantly more difficult to recognize, organize, fight, and resolve a major outage. This is especially true when teams are busy fighting rather than communicating to internal and external stakeholders. If best practices have been established ahead of time, a difficult incident can be handled much more smoothly.
The following is not an exhaustive list of domains to plan prior to an outage, but they will greatly improve your organization’s ability to coordinate and be prepared for any issue.
Internal communication will commonly take place in email. This is problematic for a number of reasons. Email is a one-to-one medium. It defaults closed, meaning it is only readable for the sender and receiver(s), and is inherently bulky and difficult to parse through when quick status information is needed. Persistent collaboration environments like Slack and HipChat provide an externally hosted location to disseminate information. Both of these platforms also provide public, optional subscribe, topical channels that can be used to disseminate information. At the critical level, status updates (or messaging that the issue is already known and being worked on) can be provided to key staff (support, leadership) in near real-time.
Ideally, the team will know there is an issue with an application before the customer does. Application and infrastructure monitoring technology can help ensure this is the case and can provide valuable information in the midst of the outage as to whether a fix or update is working as it should (New Relic for application monitoring and AWS CloudWatch are two such technologies). It is also important to monitor both application performance and infrastructure performance, and (ideally) link the two together, with a solution such as PagerDuty, to consolidate all service health data into a single view and notify the on-call resource if any issue requires urgent action. It is much easier to troubleshoot an issue if you have visibility to both layers and can identify the root cause.
When there is a performance issue, support teams will be inundated with requests for updates. Key ways to mitigate this influx are via Twitter, a status page, or to engage business stakeholders with a product like PagerDuty. These are separate from your primary infrastructure and should be resilient to even site-wide outages. On Twitter, users can easily look for pinned tweets and recent replies if they are having an issue. Users can also check statusapp.com for any “yellow” or “red” statuses. An easy-to-read status page like the one from statuspage.io is a critical component to disseminate information to your customers during an outage. A user will build trust in the page if it is accurate and includes updates for minor disruptions — and in that way, they also build more trust in your business. It should also contain updates when an issue is undergoing troubleshooting, and include status for each major subcomponent. These updates should be available within minutes, for complete visibility. Finally, with capabilities like PagerDuty’s Stakeholder Engagement, any incident responder can easily send out a status update that reaches predefined groups of business stakeholders via any preferred notification channel — phone, SMS, email, or push notification. Stakeholders can also subscribe to incident status updates to get real-time information on any issue that is customer-impacting.
A ticketing solution like ZenDesk is absolutely critical to managing an outage. A significant outage can be highly disruptive and forfeit substantial goodwill. A ticket management system will help to identify intermittent issues an application monitor may have missed. It will also track and disseminate information relative to an influx of support requests. Workflows for issue escalation will raise potential issues more quickly than relying on individual judgement, especially on larger support teams. Ready-made message templates will help keep messaging consistent and accurate during an outage, and “related to” tags will also make it easier to debrief an issue once it has been resolved.
With proper procedures in place, an organization can anticipate issues that are likely to arise from their applications. These scenarios should be documented ahead of time. Troubleshooting, mitigation, and remediation information should be documented and surfaced for the team. The procedure can also include a checklist of duties — one that lays out who does what, and includes emergency numbers and who is on-call. If resources are available, a tabletop exercise of a mock outage is extremely helpful in identifying gaps before a major outage occurs. Then after a firefight has occurred, debrief with the team in a post-mortem and improve your procedures. There will be another outage, and any additional information you can add to your process will speed recovery. As with the other above items, it is possible your local architecture will become unavailable, so storing these procedures on an externally hosted repository, or automating it with a solution such as PagerDuty, is preferred.
These tools are only an initial list. Their effectiveness in an outage is only as valuable as the time that was spent to properly configure and understand them ahead of time. Communicating with both internal and external stakeholders is key in any firefight, as much within IT as in any other function or industry.