SendGrid is a proven cloud-based customer communication platform that successfully delivers over 25 billion emails each month for Internet and mobile-based customers. The company is headquartered in Colorado with over 300 employees, 23 of those within the operations team and approximately 84 in the development group. Mary Moore-Simmons, Engineering Operations Manager, is in charge of managing the infrastructure at SendGrid, which includes servers and data centers, the network behind it all, virtualization stacks, and backend systems. With the high rate of emails that are sent from SendGrid, there are a multitude of incident alerts generated on a daily basis. Finding a scalable enterprise-grade solution to help streamline and simplify the manual incident alert process was a top initiative for the company.
Replacing previous alerting tool and overcoming scalability challenges
SendGrid receives up to two thousand incident alerts in a typical day and tens of thousands per minute during technical incidents or outages. With such a large amount, it’s important for the company to address alerts quickly and efficiently. Before making the move to PagerDuty, SendGrid used a different vendor for alerting, but realized they needed a full-scale incident management solution in place to support their high volume of incidents. “When you have a tool in place, you want it to work, especially when there is an outage; that’s when you expect it to work,” said Moore-Simmons. Faced with scalability challenges, SendGrid decided to make the move to a reliable and scalable incident management solution.
Accelerating MTTA and MTTR by switching to a new incident management platform
SendGrid implemented PagerDuty as their new incident management solution and uses the platform for collaboration, scheduling, escalation, and reporting. When on-call, a user is able to acknowledge an incident alert, escalate the alert if needed, or resolve the issue at hand, allowing them to move directly to the next incident without any delay. The main dashboard which reports all incidents is another critical benefit for SendGrid. “The way PagerDuty’s incident management dashboard’s UI is designed allows you to see what’s going on and what kind of alerts you are receiving. This is super helpful for us – no more having a list of alerts moving around at all times and losing focus on them,” said Moore-Simmons.
Moore-Simmons finds PagerDuty’s reporting feature to be the most important asset for her role. Reporting on metrics enables her to gather insight around the number of alerts per day, per week, per month, and per year. “We had an estimate of 78,000 alerts happen this year and the company’s goal was to reduce the number of alerts by 50% compared to 2015. So far, we are on track with this metric, thanks to the support of PagerDuty,” stated Moore-Simmons. She was also able to figure out that the team’s average mean-time-to-repair (MTTR) is 19 minutes, while the average mean-time-to-acknowledge (MTTA) is only 2 minutes. Gathering this type of information helps both Moore-Simmons and the other engineering managers identify what’s working, what’s not, and how to fix the problem.
The biggest benefit to SendGrid was that their operations and development teams could now resolve outages quickly and prevent them from happening again, thanks to the reliable and rapid incident notifications. Every minute that an outage occurs costs the company thousands of dollars and results in poor customer experience and customer churn, and with fewer outages, there has been less customer churn. Moreover, the team is now more satisfied and productive after switching to PagerDuty.
Enhancing employee productivity and improving scalability
SendGrid can rely on PagerDuty as a trustworthy solution to support their use cases, critical alerts, and scheduling. “We have confidence in PagerDuty and no longer have to worry about unnecessarily long outages and revenue loss. Everyone on-call at SendGrid uses PagerDuty and knows the solution as an established provider,” said Moore-Simmons. Employees are happy and productive which is important to the business. Overall, the company has seen many advantages after switching to PagerDuty, including faster resolution times for outages, increased employee productivity and happiness, as well as pulling impressive bottom-of-the-line metrics that attest to the company’s operational efficiency.
“PagerDuty helps us respond faster to the alerts that we receive. We’re able to diagnose outages faster, which in turn improves the experience of our customers and reduces downtime as well as any associated costs”