Improve Incident Response by Getting Control of Your (Unintelligent) Swarm
Incidents happen. Things go wrong. Systems fail. Sometimes they fail in unexpected and dramatic ways that create Major Incidents. PagerDuty makes a very specific distinction between an incident and an Incident. Your organization may also make such a distinction.
Determining if an incident is major or not can come down to a number of factors, or a specific combination of factors, like the number of services affected, the customer impact, and the duration of the incident.
These factors require your organization to have at least some baseline telemetry and a handle on the relationships among the services that make up your technical ecosystem. Without this baseline, it’s hard to know real impacts and where to start when triaging an incident.
What happens when an organization is missing key data? Without the following, an organization will have difficulty responding to incidents:
- Which services are impacted?
- How much are they impacted?
- Who owns those services?
In the absence of this data, some organizations choose to use a swarm approach to their incident response.
Swarming vs Intelligent Swarming
Swarming is an approach to incident response that alerts everyone in the organization that there is a problem and opens a large war room or conference call for everyone to join, regardless of their potential to help resolve the issue. To reduce the impact of an incident, it is crucial that the right people are mobilized at the right time. Swarming is the opposite of getting the right people in the right place at the right time – it’s just everyone, for the entire time.
The term intelligent swarming is used to refer to a workflow for dealing with customer service issues, especially for VIPs, which we’ve talked about earlier this month. It’s a somewhat different approach, specifying that the team member who first picked up the case should see it through to the resolution and has the ability to pull in resources from around the organization to help solve the problem. While it’s related to a general response swarm, the focus of an intelligent swarm is usually a single customer and centering their experience.
Swarming for a general technical incident response is more like hearing the fire alarm in a building: everyone is on high alert and is expected to respond. Essentially, an alert is sent to anyone who might have any knowledge of anything at all, asking them to join the incident, and then the laborious process of figuring out who can triage and remediate begins.
Organizations often swarm because they don’t have enough information about their services and ecosystem or they don’t have strong communication practices to keep stakeholders informed. When something happens, no one is sure what the problem might be, where it might be happening, or who might know how to fix it, so everyone is mobilized in case they might have some vital knowledge to add. This makes swarming incredibly expensive. Work is disrupted, tasks and meetings are derailed, and resources are stranded in a place where they aren’t effective. Hundreds of people might be mobilized to respond to an incident that only a handful can actually deal with rather than continuing with unimpaired work and receiving appropriate updates.
Swarming is also hard. Large calls with a lot of responders can be noisy and confusing. Swarming slows down the recovery process for incidents because there is no clear coordination or path of responsibility. Information is coming from all kinds of directions with no central organization or decision-making authority. Teams might be attempting to remediate their own services without fully understanding the impacts to other services. Swarming is one of the reasons we have an explicit incident command practice – to cut down on confusion and facilitate resolution of the incident as quickly as possible without making things worse.
Swarming can feel comfortable, in that the team believes they will always have everyone they could possibly need on an incident call from the first alert, instead of bringing folks on when it’s determined that their systems are impacted or implicated. Improving your on-call behaviors will alleviate the fears that folks won’t be available to remediate. Having an explicit on-call rotation with agreed upon responsibilities is less stressful on responders than worrying that an all-call page might come at any time. If responders know they will have an on-call shift during certain days and hours, they can plan ahead. In a swarming scenario, there’s still a chance that the person you’ll need is unavailable – they can’t be on-call 24x7x365.
Moving on from the Swarm
Improving your process from swarming requires changing how your team thinks about services and the teams that own them. At PagerDuty, we refer to this practice as “Full Service Ownership”, and you can read more about it in our Ops Guide. In the context of a coordinated incident response, ownership of a service means a few things:
- A single team has full responsibility for the service, including its performance in the production environment.
- That team has a documented process for being notified of an issue on that service. Generally, this is your on-call schedule.
- The dependencies consumed by the service are documented.
Your organization might have services that don’t currently have a clear owner. They might be mature or legacy projects that no longer require active development or attention. They might be commercial off-the-shelf (COTS) products that are maintained in conjunction with the vendor, or be SaaS solutions, or even internal services that were orphaned by organizational changes. If services are in your production ecosystem, teams should be assigned to keep an eye on them, even if that only requires subscribing the team email alias to the vendor’s updates to start. Every service running in your environment should have a team that is explicitly responsible – these services can still be involved in incidents or need work like security updates. Some organizations have legacy engineering teams, or platform engineering teams, that will be responsible for these services.
Assigning services to a single team reduces confusion over who owns what in the environment. Teams can train new members on the services they own and manage them to the service SLOs that are most impactful. Creating a service directory with a complimentary team ownership structure that lists who to notify provides everyone in the organization with a resource to consult when they see an issue. We accomplish this in PagerDuty with teams and escalation policies attached to services.
The escalation policy sets the guidelines for who is expected to be available to respond to incidents on a service. The responder in this case should then be someone who is knowledgeable about the impacted service with the appropriate access to triage and fix the issue.
A clear dependency model establishes the relationships among services so responders, support, and stakeholders have a clear picture of how an incident on one service might be impacting other services in the environment. PagerDuty goes one step further and offers business services, which links technical services not only to each other but also to the customer-facing functionality that they contribute to. All of the technical and business services appear on the service graph, along with a handy link to the team member who is currently on-call for that service.
Building up this infrastructure data, particularly the dependency model, can be a lot of work if they haven’t been kept up to date for a service. Knowing the full impact of an incident on a backend service is impossible, though, if the team doesn’t know what other services are consuming the service with the issue.
Customer Support teams will benefit from this work as well. Intelligent swarming depends on having all of this information at the fingertips of your support team. When your customers need a solution, your team needs to be able to find all the correct information and mobilize the right people.
Improving Incident Communications
Incident response isn’t really a spectator sport – resolving an incident is often not particularly exciting. There can be long periods of waiting for checks and processes to run, tracking down error messages, or waiting for restarts. While this work goes on, not much changes. However, while these tasks are proceeding, folks who aren’t directly involved in the remediation still want to know what’s going on. The lack of a strong incident communications plan is another reason teams resort to swarming. If someone wants to know what’s going on, the only way to find out is to join the call and listen, no matter how long the resolution takes.
Having a strong pre-determined communications plan for major incidents has two functions: helping internal users stay up-to-date about what’s going on and keeping external users informed. In our incident response guide we specify two roles for communicating during an incident: the customer liaison and the internal liaison. It’s expected that you’ll have different updates for these two groups. Depending on your organization, what you release publicly about an incident might have to be reviewed or use specific language, so creating templates and assigning specific team members to the role of communications liaison will facilitate that. Your internal communications will likely have more details so other teams can determine if their services might see an impact.
The best plans are based on keeping all stakeholders informed in a regular cadence. Communicating early and often lets everyone know that the situation is being worked on, and when things are fixed, they’ll be informed.
You don’t have to swarm with a NOC
It is possible to move toward a modern incident response model when your first line of responders is a general purpose NOC team. Explicit service ownership means the NOC can escalate complex issues to service teams when they aren’t able to resolve an incident. It gives the NOC a direct line for who to call when an issue needs additional support from the subject matter experts – paging the on-call responder for the team that owns Service A is much easier than gathering up a wide variety of people from across the organization.
Modernizing your response methods saves your organization time and resources. PagerDuty’s customers like SAP are reaping the benefits of mobilizing only the responders that are needed, when they are needed, to focus on providing the most effective response.
If your team is looking for a way to reduce time to resolve and limit the need for those huge swarm calls, check out our resources in our Incident Response Ops Guide. Not sure what all might be required for full service ownership? Check out our video and stop by our community forums to chat with like-minded folks.