Summit EMEA: How Vodafone Is Enabling Immutable Telemetry

by PagerDuty July 21, 2020 | 4 min read

In June, we were delighted to host our first ever virtual PagerDuty Summit EMEA! Llywelyn Griffith-Swain, SRE Manager, and David Jambor, Head of Systems Engineering at Vodafone, were among our speakers. They outlined Vodafone’s approach to achieving immutable telemetry.

David opened the session by defining Vodafone’s strategic goals. “Our vision is to create an engineering-driven culture,” he explained. “We want to empower development teams to be self-sufficient. Therefore, we’re putting them at the center of everything we do, but we want to challenge them—their code needs to reach production within four hours.”

To do this, Vodafone is building self-service capabilities, with development teams given the power to say what tools and capabilities they need and how they want to use them. The end goal is to have observability and alerting capabilities that tells development teams what happens to code and how it behaves as it moves into production.

“We’re building a lot of tooling around this,” David shared. “We’re building true continuous CI/CD, with a focus on continuous deployment that enables us to move code from a sandbox into the production environment. But this cannot be achieved without immutable infrastructure, which will enable us to provide immutable observability and alerting for development teams.”

Why Is Immutable Telemetry Important?

To explain how immutable observability can be defined, David gave us a great analogy using Formula 1.

Imagine you’re leading the race and your tire gets a puncture, forcing you to come in for a pit stop. What do you expect your engineers to do—repair or replace the tire? You of course want them to replace it because you want to get back to the race as soon as possible. Immutability is about throwing away what is broken and replacing it quickly, instead of spending time trying to repair it.

“Immutable infrastructure in IT really means that you shouldn’t change things if something is broken; it is much quicker to replace it with something new,” David explains. “Immutable observability leverages this approach to provide an on-demand, out-of-the-box capability to monitor and alert everything, end to end, in an immutable fashion.”

How Vodafone Is Enabling Immutable Telemetry

The immutable approach to telemetry would see Site Reliability Engineering (SRE) teams develop new monitoring approaches on demand. Llewelyn gave us an example where three development teams are all using a threshold error rate monitor.

But what happens if one team decides it wants an anomaly detection error rate monitor? Instead of replacing the existing monitor and upsetting the other teams, the SRE team would develop the new monitor. Once ready, the development team that requested it would use the new monitor, while the others carry on using the existing monitor.

Llewelyn also talked about the challenges Vodafone faced in implementing immutable telemetry. “We have 150+ developers and are following the DevOps approach, where developers need to own the code whether it’s in production or lower environments, including subsequent monitoring and alerting,” he shared. “We also need to give an immediate view of our production status to all stakeholders to enable visibility across digital.”

He also explained that the solution they build needs to be in line with SRE principles of reducing toil. But because the solution will also be for the developers, it means they need to make all modules and monitors available as code and implemented via a CI pipeline, which allows developers to quickly add them as needed and also allows Vodafone to recover should an incident arise.

The SRE team dreamed of a developer never having to leave the release pipeline to set up monitoring and alerting; instead, they can simply call up modules that have been built by the team itself. In practice at Vodafone, this sees the SRE team developing configurations for Datadog monitors and PagerDuty callouts, which can be called up in Terraform to set up monitoring and alerting. In the future, should developers want new monitors, these would be requested from the SRE team, who would develop it and make it available, and developers could then call it up through Terraform.

David wrapped up the session by explaining how PagerDuty fits into SRE’s strategy. “SRE’s goal is to eliminate toil to allow time to be spent on more valuable tasks, like engineering solutions that make tomorrow a better place. Automation of tasks is vital here, and PagerDuty is the best tool for the job because it brings development teams closer to their code and empowers ownership.”

Interested in watching the full session? Register today to check it out on demand (for free!), along with other customer sessions, including incident management at Form3 and how to drive operational efficiency with Auto Trader UK and Gousto.