Building and Scaling Your SRE Team
Building Site Reliability Engineering (SRE) teams is hard! There are so many articles and explanations of what SRE means, it’s easy to get lost. Going beyond understanding what the individual SRE role is and into building and scaling a team of SREs is the real challenge. It’s important to find the right information that will help you take your SRE team to the next level.
In a recent Page it to the Limit podcast episode with Gremlin’s Principal SRE Tammy Bryant, we discussed the importance of SRE and how to build teams with the right culture. Building upon some of the best practices that Tammy shared, this article will go beyond just defining the role of an SRE and dive into practical ways to build and scale your SRE teams.
What is an SRE?
The goal here is to not reinvent the definition of Site Reliability Engineer, or SRE. The term SRE has been defined in multiple places (for the most comprehensive information, check out Google’s SRE book). One of my favorite explanations of what an SRE is comes from Tammy Bryant herself: “They [SREs] work with all teams across an organization to ensure reliability goals are constantly achieved. They are engineers, they are educators, they are mentors, they are ‘automaters,’ they are data-driven, and they put the customer first.”
“One of the most important missions of an SRE is to protect, provide for, and progress the software and systems behind really important services that people use and rely on every day. So you have to have this ever watchful eye on durability, availability, latency, performance, and capacity.”
– Tammy Bryant, Principal SRE, Gremlin
In general, an SRE looks to bridge the gap between development and operations teams to ensure the reliability of the systems and are responsible for availability, latency, performance, efficiency, change management, and monitoring.
In the world of DevOps, there is often confusion that DevOps and SRE are the same. While there is cross over, DevOps is more about the “what” needs to be done, while the Site Reliability Engineering focuses more on the “how” this can be done.
SRE Skills & Responsibilities
SRE is a critical role/team in today’s digital world. As Tammy puts it, “If your system and services aren’t up and running, then your customers can’t even use your services.”
As previously mentioned, SRE roles and responsibilities are focused on performance and reliability. An SRE is not just “an ops person who codes,” they have skills that are geared towards automation, deployment, configuration management, monitoring, as well as analytics and metrics. Great SREs look to partner with engineering stakeholders to design and deliver a reliable, scalable, secure, and performant platform. Beyond partnership, SREs and SRE teams look for ways to improve the customer experience and stay on top of technical trends to find innovative tools and approaches to solving problems.
When you think about the overarching responsibilities of automation, focusing on the customer experience, and reliability, SREs should have the skills to solve problems by writing code to automate manual processes. SREs are often responsible for running critical services that customers (both internal and external) depend on. It’s important that SREs understand the impact and importance that operational optimization can have on a product and the positive ripple effect it can have on an entire organization. SREs should also be empathetic and responsive to others, and have the ability to take opinions and suggestions and translate those into opportunities to reach technical solutions quickly.
Establishing an SRE Team
When building a SRE team, it is important to set out guidelines that represent the team’s goals. At PagerDuty, our SREs have a set of guidelines that are used to aid the SRE team through the decision-making process. Dave Bresci, Site Reliability Engineering Manager at PagerDuty, shared those guidelines with us, listed here:
- Ensure our work connects to organizational goals.
- Partner with Engineering stakeholders to define a supportable and performant service architecture (paved road).
- Continuously strive to improve the customer experience: Full lifecycle support (creation, development, deployment, retirement), observability, flexible connectivity, and monitoring.
- Favor managed, commercially supported, or industry-accepted solutions over systems built in-house.
- Proactively notify the organization of any significant infrastructure changes.
- Measure success through adoption.
- Revisit design choices and components that are rendered obsolete and see what can be replaced with managed or off-the-shelf parts, or substantially simplified.
- Share SRE expertise in service to the entire PagerDuty organization.
- Factor operational costs in architectural and platform decision-making.
Having goals clearly stated, written down, and visible to the entire organization elevates the organizational culture through transparency, clarity, and information sharing. An example of a specific team goal is the one the PagerDuty SRE Delivery Team has: Empowering service owners by providing tooling, patterns, and partnership to enable them to rapidly build reliable, operable, and performant services at scale. In addition to these overarching SRE goals at PagerDuty, our individual SRE teams all have their own individual goals written down.
Thinking through ways to share information such as progress and goals is a key practice when building and scaling SRE. We also review these goals in our Product All Hands. Whatever your goals are, clearly define them, share them with the organization, and make your team’s vision and mission known far and wide. At PagerDuty, we use an internal Wiki open to everyone to share goals with each other and we review these goals in Product All Hands as well.
The ways in which an SRE team operates will vary depending on the organization. SRE roles can be fully embedded within a team, shared between teams, or shared with a standalone team. Understanding where you are in the organizational, transformational process and what the overarching goal you are trying to achieve with an SRE team will help you to determine how you want your team to be structured.
Scaling Your SRE Team
As with scaling any team, understanding that hiring and onboarding takes a long time is the first step. It can take new folks 3 to 12 months to learn new systems, new ways of working, and the cultural dynamic of new organizations and teams.
Always keep in mind that transformation doesn’t happen overnight, as this is the same with creating new teams and/or scaling said teams. One way to keep on track is to always keep your eye on the horizon, look to what is coming next. Set team goals for 2-3 years from now for what you want to be true. Remember that SRE teams are not on an island and the responsibility for the SRE group is not just building for the future, it is also to support the existing environment. You never know when the current systems will break or when suddenly a pandemic will break out and you have to wildy scale your environment.
Understanding that scaling and fixing SRE teams can put current optimization projects on hold and communicating that to the team is key in keeping them focused on the needs of the business and customer. After all, teams will get frustrated dealing with all the legacy systems. They will want to blow it all up and fix it, which is why it is important to remind teams that it takes time to migrate and move forward, and that this progress doesn’t always show up in the day-to-day tackling of issues. One way to combat this is to regularly remind the team of the small progressions that have added up. Ask questions like “Do you remember where we were 6 months ago?” Using data to back up progression rates is great, you can look at things like; “Here’s our adoption rate of this new tool over time,” or “One year ago we had five containerized services and now we have over 100.”
At the end of the day, building and scaling teams is not easy. Hopefully this article gave you and your organization some things to think about as you embark on your SRE journey. We would love to hear your stories, tips, and thoughts as well.
Continue the conversation with us here at community.pagerduty.com. If you would like more information on building and scaling your SRE teams, check out Tammy Bryant’s talk from PagerDuty Summit and the Page it to the Limit podcast episode on SRE.