Lessons in Distributed Communication From Incident Response
As reported cases of novel coronavirus (COVID-19) continue to rise around the world, many companies are increasingly shifting to using remote work as a way of minimizing exposure for their workforce. But even if some of these companies have been remote-friendly in the past, many organizations are currently struggling to figure out how to shift their operations to becoming entirely remote.
With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.
The Shift to Remote Work
As companies have increasingly embraced remote work, those in IT and engineering positions have been at the forefront of this change.
Twenty years ago, it was the norm for engineering teams to be in the same physical location, have an on-premises server room running their production applications, and have a private intranet on which all work happened. IT and engineering teams were onsite because the response to incidents in production was for the operations team to wheel a crash-cart into the server room to investigate, while development teams and managers started to gather in a conference room that was the designated incident “war room.” Major incidents might be so impactful that a manager would use their hip-mounted Nextel cell phone to radio an engineer that was out that day and ask them to VPN in so they could assist with troubleshooting.
In the last decade, the shift to using cloud infrastructure and applications means that IT and engineering teams can access their production applications from anywhere in the world. Today’s norm is for these teams to operate in a distributed fashion. As a result, IT and engineering teams have been at the forefront of developing effective practices when working remotely.
The days on on-site servers, intranets, and physical incident war rooms have generally been phased out in many organizations in favor of more modern solutions. Examining how these solutions and workflows come together can help any organization struggling to figure out how to make a shift to distributed work.
Lessons From a Decade of Managing Real-Time Operations
PagerDuty has helped thousands of organizations manage their real-time operations for over a decade. Our lives have become increasingly connected to a digital-first experience, and that means the world is always on. Customers demand perfection, and organizations have only mere seconds, not hours, to solve digital problems when they occur. Managing real-time operations effectively is about coordinating responses and communication between the right people, at the right time, when every second matters. That means ensuring that every team and team member, department, and leader is involved, informed, and aligned around actions that are happening in real time, regardless of where around the globe they happen to be.
PagerDuty is widely recognized as a leader when it comes to incident response. So we thought that an obvious place to start is to look at the lessons we can teach about how to manage effective communication for remote teams. At PagerDuty, our teams respond to incidents that occur by utilizing not just our own platform, but several other remote productivity tools (at PagerDuty, we use Slack and Zoom) to manage real-time work effectively, regardless of where our teams are located.
When major incidents occur, our people use the PagerDuty platform to ensure they can reach out across various teams to reach the right subject-matter experts, as needed, when working to reach resolution. The physical “war room” has been replaced with a combination of a video conference bridge (that has a backup dial-in option, if needed), plus a dedicated chat room in which all critical communication is captured.
Several communication practices are key when working remotely:
- Informal communication channels should be replaced by formal communication channels
- Rather than relying on verbal explanations, you should favor writing down and recording knowledge
- Rather than restricting information on a need-to-know basis, you should favor sharing information internally
Instead of having an ad-hoc communication channel, our teams use a well-known and documented communication channel when incidents occur. When their participation is requested during an incident, they should already know which communication channels to join. However, just in case they don’t, the PagerDuty platform sends notifications that contain embedded links they can use to join those channels with a single click.
Managing incidents can be fast-paced and stressful work. A lot of the communication necessary to coordinate that work happens verbally on the video bridge. But in order to ensure that knowledge is written down and recorded, every incident call has an assigned Scribe whose job is to create a timeline of key events during an incident by documenting important facts and actions taken, and tracking follow-up items to be addressed. Our video conferencing solution allows us to create automatic transcriptions of the call. However, the notes created by the Scribe are more useful as a quick reference for anyone that wants to get up to speed on events that occurred.
The Scribe documents the timeline in the dedicated chat channel. By doing so, other responders can quickly refer to the timeline to catch up on anything they’ve missed when they join the call (either as necessary responders or just observers). Observers are encouraged to join the dedicated chat channel or video call (in listen-only mode) if they would like to better understand the situation as it unfolds.
During an incident, our teams also typically send updates to both internal and external stakeholders to keep them apprised of current events. Internal stakeholders typically include executives, business owners, customer-facing teams, etc., and external stakeholders typically include customers. Those notifications are managed by the PagerDuty platform. But the decisions leading up to sending that notification, including coming to a shared agreement of what is communicated, are captured as part of the Scribe’s timeline that is also recorded in the dedicated chat channel.
This balance of verbal and recorded communication helps ensure that distributed teams both work quickly and communicate effectively to the broader organization. The added benefit of recording the Scribe’s timeline into a dedicated chat channel is that it can be automatically incorporated into a post-incident review using an existing PagerDuty integration.
After an incident, we use the Postmortems feature of PagerDuty to help us run a blameless postmortem, where we summarize the events leading up to resolution of the incident, identify contributing factors, and document agreed-upon action items that may help mitigate this type of incident in the future. Those postmortem reports are then shared internally so that any team can better understand the event, regardless of their physical location.
This is just one example of how we’re able to take a task that was formerly relegated to in-person war rooms and instead manage it among distributed teams in a highly effective manner.
Shifting to Remote Work to Minimize COVID-19 Exposure
As organizations shift to enabling more of their workforce to work from home, understanding how to quickly shift to effective remote communication practices will be critical to ensure minimal disruption of company operations. The world is always on, and our customers will continue to expect perfection from our digital world, which is our responsibility to deliver, especially as everyone works to minimize exposure to the novel coronavirus.
Managing the balance between verbal and written communication is just one of the many challenges organizations face in the early stages of mitigating this ongoing public health crisis—and using the PagerDuty platform in tandem with other remote productivity tools and well-defined practices can help organizations maintain effective communication between the right people at the right time as they shift to doing more remote work.