This is the fourth in a series of posts on increasing overall availability of your service or system.
Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with this week? That this problem is special, and is really, really bad? You know, that kind of problem that you’ve been worrying about deep in your subconscious for weeks now, and that you’ve been hoping would never happen?
Well, what do you do when it happens? Often in these high-pressure situations, you’ll have a very brief period of time (say, minutes) before a problem goes from ‘pretty-bad-but-our-customers-will-forgive-us-and-some-might-not-even-notice’ to simply catastrophic. If you’re a Boy or Girl Scout, you’d just open up the Pressure Release Valve you’ve prepared beforehand and prevent the problem from escalating out of control.
Build Pressure Release Valves
When building or maintaining one of the systems or services that you own, have you ever said to yourself: “You know, if situation X ever happened, as improbable as it is, we’d be in real trouble”? Situation X could be any hypothetical catastrophic disaster scenario for your given system: both master and slave datastores go down simultaneously; all your customers or clients decide to flood you with their theoretical peak loads of traffic at once; your cloud provider of choice suffers a multiple-availability-zone outage; your multicast-based messaging system suffers from a feedback loop; etc.
The problem is, if you work with a given system long enough, there’s a higher-than-you’d-like chance that “Situation X” will actually crop up.
So what can you do? Yes, you could try to engineer a system to try to prevent these catastrophic failures altogether. But building something like this could be time and cost prohibitive and can easily lead to over-engineered systems if you go too far. Spending a lot of development time targeting failure scenarios that perhaps have a 5% chance of happening over the course of your lifetime isn’t the best use of your resources.
Instead, create pressure release valves. You can think of these as a sort of lever or knob that you can adjust during failures in order to reduce the severity of your problem while it is being worked on. They can often take the form of a configuration-based boolean or constant that can be easily changed in case of an emergency, but can come in other forms too.
You can use these pressure release values to easily flip off (or on) some piece of critical functionality or to dial up or down some important value used in your application. I’ll go into some examples below.
To come up with these pressure release valves, get together with your team and brainstorm some (perhaps even semi-outlandish) ways in your system or service can fail catastrophically.
For each of these failure modes, figure out a way in which the system could be temporarily patched, re-routed, short-circuited, or generally hacked to temporarily reduce the magnitude of the problem. The goal would be to bring the system back to a functioning state: you will probably be forced to sacrifice functionality in order to do so. Usually, the 1 – 2 people who are most intimately familiar with a given system must design these hacks. Since these people are not always available in an emergency, it’s good to explore these ideas ahead of time.
After you create a list of all the catastrophic failure modes and the corresponding hacks that would be needed to get the system back in a (semi) working state, you can also start figuring out common patterns in the hacks:
- Would adding a throttle on the incoming requests help in a large number of these failure situations?
- Would disabling the computationally-expensive widget X or Y on your website reduce load?
- Would the ability to re-route all incoming requests from datacenter A to B turn a partial outage into just some latency issues?
- Would relaxing your consistency requirements result in a bit of corrupt data but would make your system available again?
- What other functionality can you sacrifice on-demand from your datastore to get it partially functioning again? Durability? Historical data? The ability to do writes (by using a read-only slave)?
- Would flipping off some of your non-critical background workflows free up capacity for your more important ones?
- Would the ritual sacrifice of an intern appease the operations gods?
Limping along at only partial functionality is much better than a complete outage, and also takes pressure off the on-call staff while they get started on their methodical S.O.P for fixing the root cause of the problem.
As I said earlier, you could try over-engineering a system to prevent these rare exotic catastrophes before they happen, but it often just isn’t worth it. Plus, even then, there would probably still be other even-more-improbable-but-still-possible failure modes that could benefit from these brainstorming discussions. So don’t necessarily waste large amounts of time engineering ways to prevent these obscure problems, but don’t ignore their possibility either. Talk about them!
If anyone has more examples of pressure release valves you keep in your own operations toolkit, I’d be very interested in hearing about them in the comments.
 Ignore this advice if you’re building something like a nuclear reactor. Make that shit work.
 Just kidding. Operations gods don’t get out of bed for anything less than a fulltime newhire college grad.