How to Ditch Scheduled Maintenance
You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution:
Ditch scheduled maintenance altogether.
That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense.
Scheduled maintenance tends to take place late at night on weekends—a tough proposition for operations engineers and admins. Customers require access at all hours, not just daylight ones. And scheduled maintenance implies your system is less reliable than you think, because you’re afraid to change it during the workday.
The solution? Avoid it altogether, and replace it with fast, iterative maintenance strategies that don’t compromise your entire system.
That might sound a bit ‘out there.’ But shelving scheduled maintenance is easier than you think. In his talk, Doug offered four ways to do it.
Deploy in stages
First thing’s first: if you discard scheduled maintenance, your deployments need to be rock-solid. They should be scripted, fast and rolled back quickly, as well as tested periodically to ensure rollbacks don’t lag.
They also need to be forward and backward compatible by one version. It’s not an option to stop the presses when you push out a new version. Red-blue-green deployments are crucial here, as they ensure only a third of your infrastructure undergoes changes at any given time.
Lastly, stateless apps must be the norm. You should be able to reboot an app without any effect on the customer (like forced logouts or lost shopping carts).
Send canaries into the coal mine
Use canary deploys judiciously to test rollouts, judge their integrity and compare results. These test deployments affect only a small segment of your system, so bad code or an unexpected error doesn’t spell disaster for your entire service.
Doug suggested a few practical ways to accomplish this:
- Gate features so you can put out code dark and slowly apply new features to a subset of customers.
- Find ways to slowly bleed traffic over from one system to another, to reduce risk from misconfiguration or cold infrastructure.
- Run critical path code on the side. Execute it and log errors, but don’t depend on it right away.
As Doug summed it up for the DevOps Days crowd: “Avoid knife-edge changes like the plague.”
Make retries your new best friend
Your system should be loaded with retries. Build them into all service layer hops, and use exponential backoffs to avoid overwhelming the downstream system. The requests between service layers must be idempotent, Doug emphasized. When they are, you’ll be able to reissue requests to new servers without double-applying changes.
Use queues where you don’t care about the response to decouple the client from the server. If you’re stuck with a request/response flow, use a circuit breaker approach, where your client library delivers back minimal results if a service is down—reducing front-end latency and errors.
Don’t put all of your eggs in one basket
Distribute your data to many servers, so that no one server is so important you can’t safely work on it.
At PagerDuty, the team uses multi-master clusters, which help with operations and vertical scaling. They also use multiple database servers like Cassandra: No one server is that special, which means operational work can happen during the day.
Put together, these strategies help admins and operational engineers sleep more, worry less and maintain better—all ahead of schedule.