PagerDuty’s July Hack Day presented another batch of amazing projects from our staff. One project in particular has a lot of future potential to provide our customers with helpful insights into their response times and of others.
Our data guru Kyle Napierkowski did some analysis on the longest and shortest mean time to response (MTTR) and median time to response across our customer base, and visualized it.
As PagerDuty is used by thousands of customers around the world, we’re in a pretty cool position to provide insights to our customers about trends in incident response times. This preliminary data is a starting point.
The graph below shows the median time to response—from the moment PagerDuty sends an alert to the moment it is resolved. As you can see, the majority of median times across our customer base are 20 minutes or less, with a fairly quick dropoff.
As a comparison, the graph below shows the distribution of mean time to response. It has a slower drop-off, with more customers in the tail, indicating that customers tend to have many short-resolution incidents (0-10 minutes) but also a handful of incidents with very long resolution times that skew the mean.
Kyle also looked at the customers with the highest MTTR. The graphs below show response time distribution for accounts that had the highest median and mean times to resolve an incident (customer names have been removed). For each account, the median or mean value is flagged, and a heatmap bar shows the response time for individual incidents. Brighter green = more incidents that took that amount of time to resolve.
The mean time to resolution shows much higher times—again, a result of a handful of incidents with exceptionally long resolution times.
Kyle’s project provides some interesting initial insights. It’s just a preliminary exploration into the metrics about customer averages, but it lays the foundation for some exciting future ideas. For example, one option could be providing mean response time, segmented by industry so you can better benchmark yourself against your peers.
What metrics/analysis do you use to evaluate your response times, and what are you curious about that you’ve never been able to dig into? Let us know—perhaps next hack day we can use your input to dig a little deeper.