About a year ago, some technical difficulties at Citi temporarily shut off a few hundred thousand cards and a swath of ATMs at the same time. The result: Citi’s newly launched Costco Anywhere cards received a “flood of complaints.”
The Internet phrase for something on this scale is “tire fire.”
Incidents that escalate to tire fire status usually involve everyone in the organization from leadership to users to the support desk. PR or marketing sound an alarm and deal with external communications, and the technical team is left to figure it out.
This means writing an internal post-mortem and an SLA-governed mea culpa to the outside world. These are often written as “root cause” analyses, focusing on blaming and correcting the people, processes, and technology involved in the incident.
Technical leaders can and should do better than blame in these situations. Yes, teams should move as fast as they can to triage and return service to normal. But in the process of measuring causes of the incident, the effectiveness of the response, and the impact, the goal should not be to focus solely on “root causes.”
The point-and-shoot approach blames and asks for a budget. A portfolio approach shows how the current investments returned specific results, and how a reallocation might change those results. It helps the rest of the organization see how to invest in DevOps, support, and service teams.
Talk Business to Me!
For example, internal tools like ServiceNow, PagerDuty, and Slack are investments in speed and breadth — they help people get on issues across your entire infrastructure stack much faster. Building them out more might require tighter integrations with development-specific tools, more on-call staff, or a system for alerting users on mobile or in-app. These investments should not be presented ad hoc after an incident. Rather, the incident management and incident resolution metrics should be a way of showing how they are currently configured, and where people, process, and tools might be added to improve incident resolution outcomes.
Also, it should be communicated in clear, “business-ready” language, since incidents necessarily force DevOps, TechOps, support and the service organization to talk to the business side.
Here’s a very basic example framework for communicating on incidents:
Internal incident notifications (e.g. change management ticket) are sent immediately to on-call staff (via PagerDuty and Slack). SLA requires same-day management communication with account owner.
- (Historical percentage)% of Priority 3 incidents resolved within SLA-agreed target
- (Percentage)% of Priority 3 incidents within relevant time frame.
Internal incident notifications (e.g. checkout cart app down) are sent immediately to on-call staff, management team, and support. SLA requires management communication with incident commander within one hour of this notification.
- (Historical percentage)% of Priority 1 incidents resolved within SLA-agreed target
- (Percentage)% of Priority 1 incidents within relevant time frame.
This template can be used internally for incident responders and business stakeholders, as well as externally for customers and prospects. With no technical knowledge, the business side can understand incident history and resolution time. This data is an asset that the technical team can maintain, directly tying incident resolution and DevOps processes to the bottom line.
While the above will help you have the right conversation at the business-level, an internal post-mortem is more introspective for DevOps and service teams. Ask: Are these processes correct? Is our infrastructure resilient enough? If not, how do we measure that would we know and what would we change?
Here are some example metrics to consider when determining how your team is doing:
- We prioritize incidents appropriately, based on their impact and urgency:
- Number of tickets where the priority was changed after logging
- Number of additional tickets created due to complaints or escalations
- Number and tier of personnel assigned to each priority of ticket
- We communicate well so customers and users understand what is happening and when they can expect their incidents to be resolved:
- Percentage of incidents where customer contacted the service desk to ask for an update
- Customers and users are satisfied with the way we handle incidents:
- Percentage of users giving a score of 4 or 5 on post-incident satisfaction survey
- Increased satisfaction with incident resolution on annual customer satisfaction survey
- We recognize repeating incidents and explain problems on the public forum to help reduce future negative impact:
- Number of problems logged by the service desk uncovered in forum
- Number of tickets redirected to forum
- Number of tickets generated by forum
- We make efficient use of our incident resolution investments and tooling:
- Percentage of incidents logged via email/forum/application
- Percentage of incidents detected and resolved with self-service tools
- Average cost to resolve incident (by priority)
- Mean time to resolve incidents since investing in tools
- % reduction in number of incidents since investing in tools
There are far more metrics based on what makes the most sense to analyze for your specific team, but these metrics can give you a head start in starting to answer those inevitable questions, and they don’t require much process reinvention to get started. Just use modern ticketing, monitoring, incident resolution, collaboration, and customer satisfaction tools, many of which have analytics built in.
The above-mentioned PagerDuty and Slack are standard tools for incident resolution and collaboration. ServiceNow and the Atlassian suite are great for connecting incident and asset management. The important thing to keep in mind, above all, is that effectively resolving and preventing incidents relies not just on tools but on having a well-defined process that helps people use tools in an effective, integrated, and self-service way.
Never include “Other”, “Misc” or any other catch-all categories in evaluating the effectiveness of your tools, process, and people—that’s like building a trap door into all of your metrics. Also, while a template can be a good place to start, the team will get much more out of your reporting if you go beyond just copying from a template. Instead, begin with your team’s intuition:
- Is a billing module error categorized as P1 or P2 for your service?
- For which customers would it be P1?
- Are there customers for whom everything is P1?
Don’t boil the ocean. Remember, you’re on the same team and it’s not a deposition.
Focus these questions on how your team handles these incidents (timeline, personnel, usage of tools, etc.) and sketch out priorities based on that. If you have the basic categories of incident resolution tooling and process covered, and metrics that track how the business can continue investing for improvements, then you are in great shape — even in the case of tire fires.
Photo in Springfield Tire Yard on Simpsons Wiki