This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Last fall when we launched User Reporting as a part of our Advanced Analytics suite, we talked about using team member stats to up your coaching game. Now, almost six months later, we’ve nearly doubled the computing power of the original report with three additional dimensions to measure your incident response at the individual user level. In addition to the original parameters, which helped gauge user workload and its effect on team efficiency, now you can get a much more accurate read on workload and how it’s affecting the performance of your team as a whole.
Time on Call measures the total time that a responder is on call during a specific timeframe. Long hours of being on call often overburden team members and deeply affect their quality of life. In fact, research shows that on-call engineers who feel overworked are much more likely to start looking to be on call elsewhere, so it’s crucial to keep your finger on the pulse of how many late nights your engineers are working. Armed with the updated User Report, now managers have visibility into how total on-call hours are distributed across the team.
Similar to Time on Call, the new Reassignments parameter helps measure individual workload when responders are too heads-down to handle incoming alerts. The Reassignments metric measures the number of times incidents are reassigned from a user to someone else. A manager may recognize an on-call responder is too busy to address an issue quickly enough, for example, and need to pass the incident off so the original assignee can stay focused on their work. Oftentimes incidents are simply assigned to the wrong engineer, where a reassignment means respectfully delegating the issue to someone else who can fix it.
We’ve also added what is quickly becoming the gold standard of all on-call metrics: Mean Time-to-Acknowledge. Often shortened to just MTTA, it measures the average time between when an incident is first assigned to a user and when that user acknowledges the incident. The uniqueness of each incident makes comparing resolution times across incidents much like comparing apples to, say, door hinges; however, the time it takes to acknowledge an incident offers a clean view into how quickly a user springs into action. Low MTTAs usually reflect especially attentive on-call responders, who deserve recognition for their ongoing contributions to the business. High MTTAs may simply point to improperly set up notification rules, which is an easy fix to help drive down response times.
As with the other parameters in the User Report, higher MTTAs also point back to workload. Even the best on-call engineers can be slow to the draw when they’re too busy working on several issues in flight. By identifying the bottlenecks in incident response and smoothing workload evenly across the team, you can get the most out of your team as a whole.
User Reporting is a part of PagerDuty’s Advanced Analytics suite, available on the Standard and Enterprise plans.
Read more here about how to get started.