(This blog post is inspired by the talk that I will be giving at DevOps Talks Conference Melbourne and DevOps Talks Conference Auckland. Hope to...by Matt Stratton
March 4, 2019
About a year ago, some technical difficulties at Citi temporarily shut off a few hundred thousand cards and a swath of ATMs at the same time. The result: Citi’s newly launched Costco Anywhere cards received a “flood of complaints.”
The Internet phrase for something on this scale is “tire fire.”
Incidents that escalate to tire fire status usually involve everyone in the organization from leadership to users to the support desk. PR or marketing sound an alarm and deal with external communications, and the technical team is left to figure it out.
This means writing an internal post-mortem and an SLA-governed mea culpa to the outside world. These are often written as “root cause” analyses, focusing on blaming and correcting the people, processes, and technology involved in the incident.
Technical leaders can and should do better than blame in these situations. Yes, teams should move as fast as they can to triage and return service to normal. But in the process of measuring causes of the incident, the effectiveness of the response, and the impact, the goal should not be to focus solely on “root causes.”
The point-and-shoot approach blames and asks for a budget. A portfolio approach shows how the current investments returned specific results, and how a reallocation might change those results. It helps the rest of the organization see how to invest in DevOps, support, and service teams.
For example, internal tools like ServiceNow, PagerDuty, and Slack are investments in speed and breadth — they help people get on issues across your entire infrastructure stack much faster. Building them out more might require tighter integrations with development-specific tools, more on-call staff, or a system for alerting users on mobile or in-app. These investments should not be presented ad hoc after an incident. Rather, the incident management and incident resolution metrics should be a way of showing how they are currently configured, and where people, process, and tools might be added to improve incident resolution outcomes.
Also, it should be communicated in clear, “business-ready” language, since incidents necessarily force DevOps, TechOps, support and the service organization to talk to the business side.
Internal incident notifications (e.g. change management ticket) are sent immediately to on-call staff (via PagerDuty and Slack). SLA requires same-day management communication with account owner.
Internal incident notifications (e.g. checkout cart app down) are sent immediately to on-call staff, management team, and support. SLA requires management communication with incident commander within one hour of this notification.
This template can be used internally for incident responders and business stakeholders, as well as externally for customers and prospects. With no technical knowledge, the business side can understand incident history and resolution time. This data is an asset that the technical team can maintain, directly tying incident resolution and DevOps processes to the bottom line.
While the above will help you have the right conversation at the business-level, an internal post-mortem is more introspective for DevOps and service teams. Ask: Are these processes correct? Is our infrastructure resilient enough? If not, how do we measure that would we know and what would we change?
There are far more metrics based on what makes the most sense to analyze for your specific team, but these metrics can give you a head start in starting to answer those inevitable questions, and they don’t require much process reinvention to get started. Just use modern ticketing, monitoring, incident resolution, collaboration, and customer satisfaction tools, many of which have analytics built in.
The above-mentioned PagerDuty and Slack are standard tools for incident resolution and collaboration. ServiceNow and the Atlassian suite are great for connecting incident and asset management. The important thing to keep in mind, above all, is that effectively resolving and preventing incidents relies not just on tools but on having a well-defined process that helps people use tools in an effective, integrated, and self-service way.
Never include “Other”, “Misc” or any other catch-all categories in evaluating the effectiveness of your tools, process, and people—that’s like building a trap door into all of your metrics. Also, while a template can be a good place to start, the team will get much more out of your reporting if you go beyond just copying from a template. Instead, begin with your team’s intuition:
Don’t boil the ocean. Remember, you’re on the same team and it’s not a deposition.
Focus these questions on how your team handles these incidents (timeline, personnel, usage of tools, etc.) and sketch out priorities based on that. If you have the basic categories of incident resolution tooling and process covered, and metrics that track how the business can continue investing for improvements, then you are in great shape — even in the case of tire fires.
Photo in Springfield Tire Yard on Simpsons Wiki