This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Monitoring systems can help you better manage your uptime, but even though you may spend a lot of time configuring checks and thresholds to identify problems early, your alerts are only as good as your incident response processes. One of the biggest challenges we’ve seen when talking with customers is getting bogged down in email alerts. Despite the increasing disarray of our inboxes, many monitoring systems and IT Operations teams still rely on email for alerting, even though most agree it’s messy and too easy to miss. Looking to improve email alerts? Look again. Here are 5 reasons why you should ditch email alerts if you’re still using them:
“Hey did you see this latest cat video my friend emailed to me?”
Even if you’re staring at your email inbox constantly, it’s not hard to imagine a critical alert getting buried by other alerts or work-related emails. For this reason, top Operations teams typically use at least two notification channels where one is a phone call or SMS message. Having an audible sound with the alert definitely helps it get noticed.
“Um, is someone on this?”
Time is critical during a severe incident and you don’t want your team wondering about who’s on point for addressing it. If your alerts are getting emailed to multiple people, there’s no way to know for sure who on the team should respond first. Has someone else already seen the email and are they already working on it? Am I really the best person to respond, or should I wait for someone with more experience to take it? Top Operations teams with a strong culture of response make sure each incident is automatically assigned to the person responsible for fixing it. Incident management tools and ticketing systems can enforce this workflow by automatically assigning an incident to the engineer on-call and by tracking assignee status for each open incident.
In PagerDuty, we use your on-call schedules to determine who’s on point right now, and assign the incident accordingly.
“Will it ever stop?”
Alert storms suck. When stuff really goes wrong, all of your monitoring systems will be sending alerts, multiple times per minute. Those alerts can quickly flood your inbox making it virtually unusable. PagerDuty will aggregate alerts for a single incident and will bundle alerts for multiple incidents (after the first notification for each) so repeated alerts will notify you only once. Dashboards are helpful here too so you can get a quick picture of how many incidents are open and where they’re coming from.
“What’s the latest status?”
It’s hard to tell from email who’s working on an incident, how long it has been open, and the latest status. This information is useful not only to your team, but also to your management and other business stakeholders. It’s annoying to be pinged constantly by people wanting an update on the issue when you’re trying to fix it. By taking your incidents into a system like PagerDuty, you can get all of this information in a single dashboard view that’s accessible to management as well as everyone on your team. We can’t promise that the CEO and CTO still won’t ask, but at least there’s a place you can direct them to where they can get the information for themselves.
“How are we doing?”
Top Operations teams track metrics to continually measure, evaluate, and improve their performance. We’ve blogged before about what metrics you should track and all of them would be incredibly difficult to measure from emails. Tracking when an incident is opened, how long it takes for the first person to notice & respond, and ultimately how long it takes your team to resolve it are critical for proactively managing your uptime. With this data, you can create dashboards on team performance and weekly reports to facilitate conversations within your team and company.
Want to learn more about incident resolution best practices and how IT stacks up today? Email alerts may be only one challenge you’re facing, but you’re not alone. Learn more about the key facets of an intelligent incident resolution strategy and common challenges in a commissioned study conducted by Forrester Consulting on behalf of PagerDuty. Download the study to read more.