I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and...by Rachel Obstler
May 16, 2017
You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution:
That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense.
Scheduled maintenance tends to take place late at night on weekends—a tough proposition for operations engineers and admins. Customers require access at all hours, not just daylight ones. And scheduled maintenance implies your system is less reliable than you think, because you’re afraid to change it during the workday.
The solution? Avoid it altogether, and replace it with fast, iterative maintenance strategies that don’t compromise your entire system.
That might sound a bit ‘out there.’ But shelving scheduled maintenance is easier than you think. In his talk, Doug offered four ways to do it.
First thing’s first: if you discard scheduled maintenance, your deployments need to be rock-solid. They should be scripted, fast and rolled back quickly, as well as tested periodically to ensure rollbacks don’t lag.
They also need to be forward and backward compatible by one version. It’s not an option to stop the presses when you push out a new version. Red-blue-green deployments are crucial here, as they ensure only a third of your infrastructure undergoes changes at any given time.
Lastly, stateless apps must be the norm. You should be able to reboot an app without any effect on the customer (like forced logouts or lost shopping carts).
Use canary deploys judiciously to test rollouts, judge their integrity and compare results. These test deployments affect only a small segment of your system, so bad code or an unexpected error doesn’t spell disaster for your entire service.
Doug suggested a few practical ways to accomplish this:
As Doug summed it up for the DevOps Days crowd: “Avoid knife-edge changes like the plague.”
Your system should be loaded with retries. Build them into all service layer hops, and use exponential backoffs to avoid overwhelming the downstream system. The requests between service layers must be idempotent, Doug emphasized. When they are, you’ll be able to reissue requests to new servers without double-applying changes.
Use queues where you don’t care about the response to decouple the client from the server. If you’re stuck with a request/response flow, use a circuit breaker approach, where your client library delivers back minimal results if a service is down—reducing front-end latency and errors.
Distribute your data to many servers, so that no one server is so important you can’t safely work on it.
At PagerDuty, the team uses multi-master clusters, which help with operations and vertical scaling. They also use multiple database servers like Cassandra: No one server is that special, which means operational work can happen during the day.
Put together, these strategies help admins and operational engineers sleep more, worry less and maintain better—all ahead of schedule.