Xero Leverages PagerDuty and ChatOps to Improve Incident Response and Digital Operations

PagerDuty image

Size: 1,001-5,000 employees

Industry: Computer Software

Location: Wellington, New Zealand

Key Integrations:

Sumo Logic
Datadog
Zendesk
Slack
JIRA

Xero is a global small business platform for accountants, bookkeepers, and small businesses. Founded in 2006, the platform offers small business owners and their advisors automatic bank and credit card account feeds, invoicing, accounts payable, and standard business and management reporting.

Xero has an easy-to-use intuitive interface so that even small business owners with little bookkeeping experience can accurately account for their transactions. A comprehensive education portal as well as, award-winning customer service further support small business owners if they have questions. For its active community of accounting partners, Xero offers additional functionality, such as a practice manager, advisory tools, and an app marketplace.

With offices in the U.S., U.K., Asia, Australia, and New Zealand, Xero has more than 1.2 million subscribers in over 180 countries who rely on its software to help run their businesses. It’s therefore very important for Xero’s platform to be dependable—a responsibility that falls on the company’s developers and site reliability engineers.

Challenges

Anthony Angell, one of the Site Reliability Engineer Team Leads, explained that when he joined the company a few years ago, Xero was already using PagerDuty to manage two schedules. The production environment was supported by Operations teams located in Auckland, New Zealand, and Denver, Colorado. However, as Xero continued to rapidly grow, it became increasingly challenging for the Operations team to scale and coordinate schedules and escalation policies across the two sites.

In 2016, Xero implemented a DevOps approach incorporating Site Reliability Engineering (SRE) to manage the production environment and overhauled its incident management processes. Rather than having the operations teams oversee the entire production environment, this new incident management framework relied on the teams that built the software to be available and on-call in the event of an incident—regardless of whether they were a developer or a QA engineer.

This meant many more people and teams were added to on-call schedules, and Xero needed a way to manage and scale the on-call groups, which is where PagerDuty came in. “[PagerDuty] helped us to be able to scale the on-call groups within the business quite easily,” Angell shared. “It has also given us and the business a better support structure.”

Business Impact

With PagerDuty, the site reliability engineering team also was able to educate many other teams about incident management and how alerting works on the platform. The result? Customers are seeing quicker resolution times because the people who developed, built, and continue to maintain the code are also the first responders should something go wrong. “The ability to get a hold of our responders in a timely fashion via different methods adds a lot of business value,” said Angell.

To further automate and scale the incident management process, Xero’s Site Reliability Engineering team leverages ChatOps to support hundreds of employees around the world. Xero’s homegrown chatbot, “Multivac,” is integrated into the company’s Slack account and leverages PagerDuty’s API to automate several critical activities within Xero’s incident management framework.

Using Multivac, Xero can onboard a new team and on-call schedule into PagerDuty by sending a request to Xero’s Github repository to automatically enable the configuration. Incident managers can use Multivac to notify the right team members to initiate the incident response process within PagerDuty and create a unique Slack channel for the incident. Users can also request status updates on recent production releases or active alerts from Multivac, which provide needed context to troubleshoot incidents more quickly. By offloading many activities to Multivac and PagerDuty, Xero has been able to respond and resolve incidents much faster.

“In a one year span, from January 2017 to January 2018, PagerDuty analytics showed us that we saw a 40 percent reduction in high-urgency alerts. Not only that, but MTTR for high-urgency alerts, the highest urgency level, is down 74 percent.”

#PeopleFirst: Improved Work-Life Balance With PagerDuty

One of Xero’s core values is “human” which put a big emphasis on people, and the company expanded its use of the PagerDuty platform by leveraging analytics capabilities to gain insight into team health. “The analytics insight is helpful for our managers—particularly those on other teams—because they can see from the data how many alerts their team received over a specific time period,” explained Angell. “This is useful for when we need to take a closer look at the reasons for engineer fatigue—for example, we want to know if on-call responders received unusually high number of alerts in a short time period, as that could put them at risk of burnout.”

Additionally, Angell’s favorite part about PagerDuty is how it gives teams flexibility and ownership when it comes to on-call scheduling. Instead of having one team overlooking everything like before, Xero now has a number of distributed teams empowered to manage their own on-call schedules. “We’ve educated a lot of teams around incident management and how alerting and PagerDuty works, and it’s actually given the business a better MTTR,” said Angell.

What’s Next

Xero is expanding its use of the PagerDuty Digital Operations Management platform across a broader range of users and use cases.  The company has already taken some steps to evaluate team health on their own, and they hope to have more in-depth insight into how their teams are performing by adopting PagerDuty’s Operational Health Management Service (OHMS).