PagerDuty Blog

Optimizing Incident Management for Hybrid Infrastructure

It’s 2016, and your infrastructure is probably hybrid. That means your Incident Management solutions need to be ready for hybrid environments, too. If you only had on-premises servers to manage, and if you didn’t have virtual networks or microservices in the mix, incident management would be much simpler.

But then you’d be living in the past, and Windows Server 2003 would be the newest big thing in IT Ops.

Today, almost all infrastructure is hybrid in one way or another. On-premises servers and devices blend seamlessly with public and/or private cloud services. Networking is abstracted from the physical layer. Storage is scaled out and distributed across many servers, sometimes even between data centers.

So what’s an admin to do? The short answer is to adopt a hybrid-ready incident management solution. The long answer is spelled out below, where I offer tips for optimizing incident management for today’s hybrid infrastructures.

Incident Management Challenges With Hybrid Infrastructure

Let’s start by outlining the special challenges hybrid infrastructure poses for incident management.

  • Your incident management team doesn’t always have physical access to all of the infrastructure. If your infrastructure spans multiple data centers and/or includes clouds, an admin in any one location may not be in the same location as the device that triggers an alert.
  • You don’t have full control over all of the infrastructure. Public or private clouds could be hosted on someone else’s servers (over which you have only limited control).
  • Physical devices are abstracted from the infrastructure. As a result, it becomes harder to tell whether alerts are caused by software problems, hardware problems or both. For example, the source of alerts about file system problems on a virtual server could be failing disk hardware on the host, software file system errors on the guest or a combination.
  • Your infrastructure is not finite. It’s scaling constantly as new devices are added or subtracted, storage expands, containers spin up and down, and so on.

Solving the Incident Management Challenge With Hybrid Infrastructure

Now that we’ve addressed the challenges, here are a few suggestions to consider when planning a hybrid infrastructure incident management strategy:

  • Adopt incident management platform (like PagerDuty) that is intelligent enough to route alerts according to the source of the problem. That way, an alert generated in one data center will be sure to reach admins who have control over that data center, instead of a team in a different location.
  • Deploy an incident management platform that delivers flexible monitoring and alerting configurations and easily integrates with your existing environment. By this, I mean that you should be able to integrate different tools in different parts of your infrastructure depending on what works best for that specific part. In your public cloud servers, you might want to use AWS CloudWatch, and Nagios can handle your on-premises servers. Snort or OSSEC could monitor network events. PagerDuty as an example, has 150+ integrations right out-of-the-box, that can integrate with your existing hybrid infrastructure.
  • Send all alerts to the central hub. If you have multiple monitoring platforms, you want to make sure alerts are viewed together in a group or cluster. Otherwise they’ll become difficult to manage and trying to derive a link between potentially related problems is impossible. A platform like PagerDuty solves this by providing a centralized hub for receiving and normalizing all of the different alerts from across your hybrid infrastructure.
  • Make sure your incident management solution scales. Since the size of your infrastructure is not constant, you want a platform that can receive and store a changing volume of alerts.
  • Be vendor-agnostic. Incident management solutions that only support certain operating systems or service providers won’t work for hybrid infrastructures. Hybrid environments are usually composed of diverse hardware and software components, and managed by admins that want to be able to interchange parts quickly. A solution like PagerDuty comes in handy for this purpose, too, since it can integrate with vendor-specific monitoring software, then translate those alerts for use through a flexible, integrated incident management interface.

Some of the challenges I’ve outlined might not seem that important for your organization right now. Fair enough — some environments are not yet as hybrid as others.

But the clear trend is toward hybrid infrastructure. The sooner you adapt your incident management solutions to prepare for this future, the better positioned you’ll be to migrate fully to hybrid environments without impacting your ability to monitor your infrastructure.