PagerDuty Blog

NASA's Juno Mission and IT Operations

I’ve always wanted to be a starship pilot traveling the stars. While there is a slim chance of interstellar travel happening in my lifetime, we are starting to enter a fascinating era. We’re doing incredible things like landing on comets, testing ion engines , and even exploring EM-drives. What’s especially exciting right at this moment is humanity placing a probe around one of the most intense environments in the solar system — orbiting Jupiter.

Harsh Environments

The Juno spacecraft has to deal with an incredibly harsh environment. The biggest challenge is the intense radiation — 20,000 greater than Earth’s — which Juno will not survive but rather contend with for a brief time. “Once these electrons hit a spacecraft, they immediately begin to ricochet and release energy, creating secondary photons and particles, which then ricochet,” Heidi Becker, leader of Juno’s radiation-monitoring team, said during a news conference last month. “It’s like a spray of radiation bullets.”

Why am I bringing up the Jupiter mission in the context of IT Operations? How does all this relate to human problems of operating an ITOps environment? The answer is simple — both pose harsh environments that require planning, well-defined processes, and appropriate tooling in order to endure and thrive. The IT Operations version of a spray of radiation bullets is the at times overwhelming flood of non-actionable and actionable alerts flowing in from the various management systems.

In the past, we called these non-actionable alerts “noise” but we’re moving away from that nomenclature as we’re discovering golden nuggets of leading and trailing edge indicators in the sea of IT Operations alert data.

Alert Suppression

When my former company, Event Enrichment HQ, was acquired by PagerDuty late last year, the expectation was to augment the existing excellent array of incident response capabilities with event management focused enhancements. We initiated this effort by creating our PagerDuty common event format (PD-CEF) with which we normalize and structure alerts from your management systems. By doing so, we set the stage with which to build new and powerful tools to help you accelerate incident response. Building on that solid foundation of normalized event data, our new event rules engine allows you to classify groups of alerts and to act on them, starting with event and alert suppression. Alert suppression is necessary as our philosophy to deal with the enormous load of alerts generated by today’s infrastructure is not to drop them but instead suppress them.

Why suppress alerts you ask? Our research has shown that many of those so-called “noise” alerts are leading edge indicators to much more severe issues. By sending in more events rather than less to PagerDuty, you will gain a much deeper and more profound understanding of the event flows and alert clusters in your IT Infrastructure using our new IT Operations visualization tools.

The Future

As you will see at PagerDuty Summit, these enhancements to PagerDuty’s core offering will go far beyond what you have seen from us thus far. We are intensely focused on providing you the tooling with which to give you a deeper understanding and specific context to issues and incidents which impact your company.

Now a year in after the acquisition, I’m excited to report that PagerDuty has undergone an evolutionary leap into the future. We have always and will continue to embrace lean and agile methodology as per Tim’s earlier post; we’re focused on learning and empathy as described by Jonny; and we’re creating a profound fusion of event management (data) and incident management (people) capabilities. These are heady times here at PagerDuty.

We’re now T-1 week away from PagerDuty Summit where we’ll kick off this wild ride and introduce you to all of these new capabilities. If you join us at The Village on Sept 13th, you will get to experience it first hand. I’m looking forward to seeing you there!

 

Referenced articles: