Founded in 2007, ecobee is a Canadian home automation company that builds Wi-Fi enabled thermostats for residential and commercial applications to help users maximize comfort, reduce their carbon footprint, and save money. Behind the curtains of this easy-to-use product are continuous deployments of mission-critical applications and services, a regionally distributed infrastructure, and self-healing server clusters that operate to maintain and keep the services online for their global customer base.
Jordan Christensen, VP of Technology at ecobee, is responsible for the company’s platform infrastructure, including automation, self-healing, and end-to-end service delivery and availability. “My team’s overall mission is to build reliable, fault-tolerant infrastructure, and PagerDuty really is the critical platform we use to measure and monitor this reliability,” he explained.
Because ecobee’s premier product is responsible for temperature control in millions of residential and commercial buildings, their services need to always be online and available for their users. A minor blip or application failure can lead to lost revenue—so minutes matter when it comes to getting ahead of and responding to potential incidents or outages before they impact customers.
In order to provide the best customer experience for their users, ecobee needed to approach incident management from a proactive and preventative angle. To do this, their engineering teams needed a platform that would enable real-time visibility across their infrastructure and services.
Jordan’s platform team relies heavily on PagerDuty’s Terraform integration to build their PagerDuty instance into the greater infrastructure as code. By building PagerDuty into Terraform, teams can better understand the real-time health of their infrastructure and enable full visibility into on-call rotations and schedules, as it is all defined as code within the Terraform environment.
This technique enabled teams to cut out the manual work of on-call management and create opportunities for automation in terms of maintaining on-call rotations and schedules between different applications and services. “Having PagerDuty embedded into the infrastructure as code rather than a disparate interface makes it a central piece of the infrastructure rather than hanging off as an ancillary service,” explained Jordan. This improved visibility and ability to manipulate code within their codebase empowers his teams to truly understand the health of their infrastructure when incidents inevitably occurred. With the help of this integration, the ecobee team is gradually working towards four 9s in terms of uptime and availability.
With PagerDuty, ecobee is able to proactively work incidents collaboratively and be fully enabled on the context of the incident at hand. “The insights are pointed and specific, not generic,” recalls Jordan. Centralizing all of the signals from every container, server, application, and microservice in PagerDuty makes it easy for his teams to be able to diagnose issues and automatically engage the right people to remediate the issue before it impacts the customer.
Jordan’s team has seen several benefits from PagerDuty, including:
With PagerDuty implemented across the entire Engineering organization—along with other key business units and stakeholders—Jordan noted that leadership has been able to put a strong focus on team health, work-life balance, and creating opportunities for growth among junior engineers. “With PagerDuty, employees feel safe being on call because they know they can escalate issues to senior developers to provide guidance and walk through the issue to solve it,” explained Jordan. Minor incidents often turn into learning opportunities, which boosts morale and team health among the organization.
“If we didn’t have PagerDuty, it would be extremely difficult to execute proper incident management and response as a company.”
– Jordan Christensen, VP of Technology
Ecobee plans to continue its use and expansion of PagerDuty across the greater organization. Specifically, the engineering teams want to learn to better leverage PagerDuty Modern Incident Response so they can implement response plays for particular services and automate certain tasks within a response action. The teams also plan to leverage PagerDuty’s Slack integration to centralize communications and improve collaboration across teams during major incidents.
Additionally, ecobee would like to formalize a postmortem build-out within their PagerDuty instance in order to centralize the entire incident lifecycle onto one platform. Jordan’s team is also looking to harness the full ability of the PagerDuty REST API to encourage automation and build business efficiencies across the rest of the organization. “We haven’t even begun to scratch the surface of what we can truly accomplish with PagerDuty,” explained Jordan.
Interested in learning more about how PagerDuty can improve your team’s health and incident management process? Sign up for a two-week free trial today!