PagerDuty Blog

Virtualizing a Network Operations Center

A Network Operations Center (NOC) is a location from which IT support technicians can supervise, monitor, and maintain client networks and infrastructure. Because they act as a central nervous system for many organizations, NOCs are typically located in a central physical location.

The global coronavirus (COVID-19) pandemic is an unprecedented situation that is creating new challenges for everyone—and that includes NOCs. As the pandemic forces more companies to adopt a distributed remote-work model, regardless of their readiness, PagerDuty can help NOCs make a smoother transition that helps keep business continuity on track.

The Network Operations Center

Today’s NOCs are highly sophisticated and complex operations centers. Network operations analysts and engineers provide 24x7x365 supervision, monitoring, and management of (among other things) a company’s networks, servers, databases, firewalls, devices, and external digital services. That infrastructure environment typically includes a wide variety of heterogenous on-premises and cloud-based systems.

Supervising, monitoring, and managing these systems encompass many of the things you might normally think about, such as performance optimization, troubleshooting issues, patch management, firewall management, incident response, and so on. But an increase in the variety, location, and scale of those managed systems has also meant that NOCs have had to find continuously more innovative ways to manage their workloads, such as introducing as much automation as possible or developing site-to-site disaster recovery plans. NOCs are typically organized as centralized teams so that all of the distributed systems they manage can route through one manageable point of communication and coordination. Similarly, their disaster recovery plans have typically involved relocating those centralized operations from one physical location to another centralized physical location.

However, the COVID-19 pandemic has meant that many physical offices have shut down. Moreover, they’ve started to shut down across the world. NOCs can no longer rely on relocating operations to another physical location. Instead, with the shift to remote work, NOCs now face the challenge of learning how to manage distributed systems from distributed locations.

Distributed Work and the Challenge for NOCs

Centralized network operations have operated on the principle that distributed systems can be more easily monitored by having one clear egress and ingress point for all system operations. It’s not uncommon for that central management point to have large monitors displaying several dashboards so that anyone can make quick visual correlations across many systems when something goes wrong. Teammates are able to easily and quickly communicate because they’re sitting next to one another. Workspaces are often set up in highly customized ways that optimize for speed when troubleshooting. Major incidents that require an all-hands-on-deck approach can be readily coordinated by shuffling the right folks into a dedicated conference room. There are many advantages to having everyone in the same physical space.

But that’s no longer normal.

With today’s new normal, there are a few key challenges that NOCs must learn to address in order to ensure they’re able to continue to keep their businesses running smoothly.

System visibility. NOC engineers and analysts need to access a variety of systems in different places. Some may be in a datacenter with restricted network access that requires using a jumpbox to reach them. Others may be on a private VLAN that isn’t accessible from the outside world. And yet others may be in the cloud or distributed across multiple cloud providers. Getting to these systems quickly, no matter where they are, is critical when troubleshooting issues. Troubleshooting complex issues typically requires correlating behavior across multiple systems that may be located in different places. NOCs need to have system-wide visibility so they can quickly understand where and how failures might be occurring.

System access. Because those systems may have different access requirements—each with unique constraints—it’s not uncommon for NOC engineers to have multiple desktop systems, VMs, or dedicated machines to access different parts of the distributed systems they manage. Their office workstations may be optimized for that, but when working from home, that same access may not exist. It might be possible to replicate access somehow (e.g., via Remote Desktop sessions from home to their workstations), but even in best-case scenarios, that access is kludgy, slow, and potentially unreliable. Minimizing dependency on jumping through those hoops is necessary when needing to move quickly. NOC teams need the ability to quickly understand system state across their distributed landscape without struggling with system access.

Mobilizing response teams. When complex issues arise, it’s essential for a NOC to reach the right people at the right time to mobilize a response. While forming a quick huddle or asking a question across cubicles is possible when you’re in the same office, it can be a little more challenging when working remotely. Remote productivity tools for chat and video conferencing make distributed communication faster and easier. But NOC teams also need the ability to proactively raise alerts that page the necessary responders to join those shared communication channels when every second matters.

Managing Distributed Operations in Real Time

When running a NOC that manages distributed systems from distributed remote locations, a key challenge for teams is collaborating quickly in real time. Some of the best ways to help NOCs continue to move quickly in that remote-working environment is to provide system-wide visibility into all of the various systems they manage, remove the need to access each system individually to gather and correlate troubleshooting data, and enable them to mobilize the right teams at the right times.

PagerDuty is known as the platform for driving real-time work. Our customers leverage the core PagerDuty platform for coordinating incident response processes when operational issues occur. Many customers also use our visibility features to act as a central nervous system through which digital signals from their various distributed systems can be routed, correlated, and surfaced automatically. Having one central egress and ingress point allows remote NOC teams the ability to access their distributed systems in ways that help mitigate the challenges they face when shifting to working from home.

You can see these solutions in action by contacting us to learn how you can virtualize your NOC with PagerDuty.

You can also find an assortment of other resources to help your teams adapt to the COVID-19 pandemic.