This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Suppression. According to the thesaurus, this word is synonymous with terms like deletion, elimination, and annihilation.
Yet within the context of incident management, suppression means something quite different. It’s not about getting rid of data forever. It serves instead as a way of making sure that admins focus on the right alerts at the right time by mitigating noise.
Here’s a look at how suppression significantly helps streamline incident management.
Why is suppression useful in incident management? Simply put, it’s because modern infrastructure generates a huge volume of alerts and admins can’t reasonably expect to be able to review each and every alert. If they try, they will soon become subject to alert fatigue, which means they will begin ignoring potentially important alerts because they are overwhelmed and burned out. And if they stop paying attention to alerts, then the entire incident management process breaks down.
Alert suppression is a way of avoiding this issue. By suppressing alerts of certain types, admins can ensure that actionable, high-priority alerts receive the greatest attention. They can also reduce the overall number of alerts that appear on their dashboards, which helps to prevent the risk of alert fatigue.
As an example, consider an organization whose workstations reboot once a week overnight after updates are installed. The reboot would generate a series of alerts as workstations go offline and come back up. Adding these to the incidents dashboard that admins see wouldn’t be helpful, because the alerts in this case reflect a routine procedural event that does not require action. In order to avoid adding this unhelpful noise to admins’ dashboards, admins can configure their incident management software to suppress alerts related to a workstation rebooting.
An important point to understand about alert suppression is that suppressing alerts is not an either/or proposition. In other words, admins’ options are not limited simply to enabling all alerts of a certain type or permanently suppressing all of them.
They can instead take a more nuanced approach to suppression. Alert suppression could be configured in such a way that alerts of a given type are suppressed unless they occur repeatedly within a certain period of time, for example. Alerts could also be configured so that they are reported if they occur during a certain time of day, but are suppressed during other times. Similarly, admins might want to suppress alerts of a particular type if they occur on a certain kind of device, but not others.
This flexibility is important because it ensures that admins can maximize the effectiveness of alerts. Instead of applying broad, blunt suppression policies, they can tweak suppression settings in order to maximize the visibility of important events without adding unnecessary noise to the incident management system.
Nuanced suppression could be helpful in the example above. As I noted, admins generally don’t want to receive alerts when a workstation reboots in the middle of the night following a software update. But if the incident management software detects a workstation that reboots multiple times during the same period, that could signal a problem (like a flawed software update) that admins will want to know about. In this situation, having suppression configured so that only recurring reboots generate incidents that appear in the central dashboard, would help to optimize incident management effectiveness.
It’s also worth emphasizing that suppression in the context of incident management does not mean that suppressed alerts disappear forever. On the contrary, suppressed alerts still happen, and data related to them should be saved. The only difference between a suppressed alert and a non-suppressed one is that the former is not sent to priority dashboards in the incident management system.
This is important to understand because it means that admins retain the ability to look up suppressed alerts to gain insight into an incident if they need to. This also helps them better tune their alerting thresholds. In addition, suppressed alerts still figure into historical incident management data, which can be used to reveal lots of valuable information about infrastructure efficiency and health trends.
With suppression, then, you get to have your alerts and eat them, too—or something like that.
Suppressed alerts can be leveraged in any way admins need to help identify and respond to incidents, but they don’t clutter dashboards with non-actionable information that gets in the way of resolving incidents that are likely to be of a higher priority. Moreover, suppression can be tweaked so that alerts are suppressed only under exactly the right circumstances, but are always reported so you gain full visibility into your infrastructure.