This blog was co-authored by myself and Simon Darken. Once a year, PagerDuty’s SREs get together for a three-day, in-person offsite. With the team spread...by Dave Bresci
December 5, 2018
Achieving scale — that is, the ability to meet application demand at any level — is essential if you want your business and user base to grow, or if you hope to be able to handle the vicissitudes of modern software deployment.
Yet scaling is no easy feat. Most legacy applications struggle to support thousands of users. An unexpected traffic spike will simply knock over an application not designed for it, and countless customers and dollars are lost while the ITOps team struggles to spin up VMs or rack-and-stack servers to handle the load.
And even if you run your app in the cloud, scalability is not guaranteed. A poorly designed cloud app will experience bottlenecks that render it unusable.
Given the massive costs of suboptimal digital services on productivity, lost opportunities, and more, scalability is mission-critical to any organization today. And it is possible! It requires implementing the right tools and processes, the right team, and the right communication lines between that team. Below, I explain how to achieve scalability in order to avoid derailing your software and organization.
When preparing for scalability, flexibility and matching deployed infrastructure to meet the load is key. This can also drive cost efficiencies in deploying an application. Understanding your traffic patterns, average usage, and the standard deviation will help you properly size your environment, and planning to rapidly scale for an exceedingly rare (but possible) event can save a lot of headaches when the application goes viral. If an application is deployed regionally, often, idle cycles can be found during the middle of the night. Weekday load vs. weekend load can vary significantly. Many businesses are seasonal, and usage of the application is fractionally or exponentially lower from one time of year to the next.
Scale also involves ensuring the reproducibility of your artifacts, which in turn forces consistency in production deployments. The service artifacts can then be scaled independently as application needs change and grow. This method requires a strong understanding of DevOps, with a durable continuous integration and continuous deployment pipeline at its core.
First and foremost, application source code needs to be checked into a version control system. Instead of taking this well-structured output and building a bespoke server stack around it, the server stack itself also needs to be transformed into code. It can be a painful process at first, but the only way to scale an infrastructure consistently, every time, is to not rely on an ITOps staff member clicking the “next” button or typing commands into the console on every server deployed to dev, test, and production.
Once your infrastructure and code are both well-defined, you can write integration tests to ensure they function as they should in a fully built environment. To take these to the next level of sophistication, containers can be used as infrastructure building blocks. Those blocks then have consistent “downward” facing hooks to the infrastructure. A cloud container management platform, combined with manifest files that describe how the services fit together and should scale, turn these consistent artifacts into a highly resilient and scalable application.
The often missed essential ingredient for scalability is a team that maps well to the technological topology described above. Such a team includes three main groups (note: naming conventions for titles and division of responsibilities can vary across organizations):
The trick to coordinating your team in such a way as to maximize scale is to have SREs focus on reliability efforts by leveraging the Infrastructure-as-Code that their team members have written, rather than spending time on manual configuration. This makes for a different type of team arrangement than a legacy team structure, in which application code is simply “thrown over the wall” by developers to the ITOps team to deploy and run. The legacy model is a highly manual environment and is prone to error.
To complement this greater infrastructure visibility, engineering teams can implement a greater degree of application trace logging to help discover issues more quickly. As an incentive to create a more highly instrumented application, canary releases can be quickly deployed to a subset of the application’s user base, letting the team test new features and find bugs more quickly without affecting the larger application user base. Canary releases also let you gradually release new features, reducing the likelihood of incident spikes during rollout.
Last but not least, remember how important communication is. It should go without saying that even the best-structured team will not succeed in enabling scalability unless team members can communicate seamlessly with each other.
Effective communication requires not only tools that can automate communication tasks, but also a commitment to ensuring that everyone on your team “speaks the same language”— meaning that developers, ITOps, and SREs can all talk to one another in a mutually intelligible way because they all understand each other’s roles and needs.
It can be intimidating to take the first steps down a path of application scale. People, processes, and technology all need to change to move from a Waterfall method to DevOps, and to evolve legacy infrastructure management practices into modern ITOps and reliability engineering.
Much in the same way that the agile development revolution added value on a quicker timeline, each step in the scaling journey brings value that can be realized immediately.