Why Your Engineering Teams Need Incident Commanders
In any fast-paced engineering environment, unexpected incidents can arise and escalate without warning. Effective leadership is key when this happens since coordination and decision-making across teams and business functions are urgent and challenging. Without strong leadership, you get chaotic, stressful, and tiring situations that waste valuable engineering time, slow down resolution, and most importantly, impact your customers.
At PagerDuty, we have a mature, proven incident response system led by Incident Commanders. Incident Commanders provide the leadership needed to help stabilize major incidents fast. Read on to learn about the role from a member of our team and discover ways you can implement a similar system in your organization.
“I’m James, and I’m the Incident Commander.”
That’s me announcing myself on a major incident response call. Everyone present is now aware that there’s an Incident Commander (IC) present who will support the response team and assertively direct the incident towards stabilization and resolution.
But I’m not the only IC. We have a diverse and supportive team of 14 women and men from a variety of backgrounds who are a mix of engineers, engineering managers, and product managers with varying lengths of tenure at PagerDuty. Anyone is welcome to join us as long as they have strong, assertive communication skills and an ability to remain calm under pressure.
We don’t dig into logs, check servers, make code changes, or deploy patches. That’s the job of Subject Matter Experts (SMEs—engineers, in this case) who are paged into the call and follow the IC’s directions. A representative from Customer Support is also a member of the response team and helps write external communication under the guidance of the IC.
How does the incident command system work?
Prior to becoming an engineer 15 years ago, I worked at an airline as a flight service manager. This gave me exposure to rigorous incident response practice. Roles, drills, checklists, and standard operating procedures are memorized and implemented using a defined command structure. This allows the flight and cabin crew to remain calm under high-stress emergency situations and enables them to take action quickly to avoid catastrophic problems.
Similarly, our incident response process, which was adapted from the United States National Incident Management System (NIMS), is well defined, documented, and rehearsed. ICs use checklists to guide actions so that proven practices are followed and roles and responsibilities are clear. This leads to an organized, efficient response.
Why is the Incident Commander role so critical?
The Incident Commander helps the response team maintain focus, bringing a sense of order to unpredictable, fast-paced situations.They do this by:
- Getting the right people on the call, and ensuring responders are supported.
- Assigning someone as “scribe” so that discussions, decisions, and actions are recorded in a Slack channel for everyone to see.
- Asking questions assertively in a methodical, direct, unambiguous manner so that options are weighed up and clearly understood.
- Gaining rapid consensus by asking the response team if there are “any strong objections” to proposed actions.
- Delegating actions to named individuals, getting acknowledgement and reporting back at agreed time intervals on progress and discoveries.
- Helping coordinate clear, regular outward communication to stakeholders and customers in collaboration with communication experts from the Customer Support team.
- Encouraging people to leave the call if they’re tired or no longer needed.
- Ensuring the team explores and takes action on new lines of investigation if the situation isn’t improving.
- Scaling down the response and resolving the incident quickly once the problem is stabilized to ensure everyone gets back to bed or their regular activities.
Once the incident is resolved, the Incident Commander directs discussion and follow-up activity in an “incident follow-up” Slack channel and instructs the team to schedule a postmortem meeting where the response team learns from the incident timeline. Follow-up tasks are created in order to prevent similar problems from happening in the future.
How an Incident Commander gets trained
Before going solo on an incident call, we follow a series of important steps:
- Join four sessions where we receive training, carry out role-play, and discuss prior incident calls.
- Buddy up with an experienced IC to receive support and mentorship.
- Practice being the IC at incident simulation sessions (we call them “Failure Fridays”).
- Join a “shadow” incident commander schedule for a period of a month or so, joining calls to scribe and or simply just listen in, before stepping up to the incident commander role.
What’s it like when you first join a call as Incident Commander?
Being an IC is a challenging role that requires concentration and patience. When we arrive on an incident call, it takes time to determine what’s going on. People are worried and uncomfortable. Also, periods of silence are necessary since it takes time and concentration for SMEs to investigate, take action, and report back.
Learning to project calm during stressful situations is important. Keeping calm under pressure is a learned skill—people echo back and respond to the environment they are in, and it’s upon the IC to keep the response focused and calm enough so that responders are able to be most effective.
In situations where we need to work fast with a team under emergency conditions, it’s also easy to make mistakes both in terms of communication and decision-making. Our process helps with this, but it’s still important to get candid feedback and address learnings in the follow-up session. Adopting a growth mindset and a blameless culture ensures everyone looks at ways to learn and improve as a team.
How to learn more and improve how your team responds to major incidents
My experience at PagerDuty has taught me just how valuable the role of the IC is to the business, our engineers, and, of course, to our customers. It’s also been an incredible learning experience for me personally, helping me build confidence and comfort with ambiguity.
If you’re interested in learning more and want to implement your own incident command system, we’ve fully documented and open sourced our training and process at https://response.pagerduty.com.
You can also join PagerDuty University (PDU) Training, which we’ll be running this year at PagerDuty Summit on September 23. I’ll be there with fellow Incident Commander Jon Grieman. PagerDuty Summit will also feature workshops and talks on many other aspects of incident response, as well as ways to automate workflows using PagerDuty. Register today to save your spot.
Do you have best practices or thoughts to share about Incident Command or incident response in general? Join us today in our community forum discussions!