Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
Join live and on-demand webinars for product deep dives, industry trends, configuration training, and use case-specific best practices.
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
“I need to be notified if there’s a significant event ongoing with SignalFx.” This is what I tell my team. However, despite being the CTO...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
This is the fourth in a series of posts on increasing overall availability of your service or system.
Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with this week? That this problem is special, and is really, really bad? You know, that kind of problem that you’ve been worrying about deep in your subconscious for weeks now, and that you’ve been hoping would never happen?
Well, what do you do when it happens? Often in these high-pressure situations, you’ll have a very brief period of time (say, minutes) before a problem goes from ‘pretty-bad-but-our-customers-will-forgive-us-and-some-might-not-even-notice’ to simply catastrophic. If you’re a Boy or Girl Scout, you’d just open up the Pressure Release Valve you’ve prepared beforehand and prevent the problem from escalating out of control.
When building or maintaining one of the systems or services that you own, have you ever said to yourself: “You know, if situation X ever happened, as improbable as it is, we’d be in real trouble”? Situation X could be any hypothetical catastrophic disaster scenario for your given system: both master and slave datastores go down simultaneously; all your customers or clients decide to flood you with their theoretical peak loads of traffic at once; your cloud provider of choice suffers a multiple-availability-zone outage; your multicast-based messaging system suffers from a feedback loop; etc.
The problem is, if you work with a given system long enough, there’s a higher-than-you’d-like chance that “Situation X” will actually crop up.
So what can you do? Yes, you could try to engineer a system to try to prevent these catastrophic failures altogether. But building something like this could be time and cost prohibitive and can easily lead to over-engineered systems if you go too far. Spending a lot of development time targeting failure scenarios that perhaps have a 5% chance of happening over the course of your lifetime isn’t the best use of your resources.
Instead, create pressure release valves. You can think of these as a sort of lever or knob that you can adjust during failures in order to reduce the severity of your problem while it is being worked on. They can often take the form of a configuration-based boolean or constant that can be easily changed in case of an emergency, but can come in other forms too.
You can use these pressure release values to easily flip off (or on) some piece of critical functionality or to dial up or down some important value used in your application. I’ll go into some examples below.
To come up with these pressure release valves, get together with your team and brainstorm some (perhaps even semi-outlandish) ways in your system or service can fail catastrophically.
For each of these failure modes, figure out a way in which the system could be temporarily patched, re-routed, short-circuited, or generally hacked to temporarily reduce the magnitude of the problem. The goal would be to bring the system back to a functioning state: you will probably be forced to sacrifice functionality in order to do so. Usually, the 1 – 2 people who are most intimately familiar with a given system must design these hacks. Since these people are not always available in an emergency, it’s good to explore these ideas ahead of time.
After you create a list of all the catastrophic failure modes and the corresponding hacks that would be needed to get the system back in a (semi) working state, you can also start figuring out common patterns in the hacks:
Limping along at only partial functionality is much better than a complete outage, and also takes pressure off the on-call staff while they get started on their methodical S.O.P for fixing the root cause of the problem.
As I said earlier, you could try over-engineering a system to prevent these rare exotic catastrophes before they happen, but it often just isn’t worth it. Plus, even then, there would probably still be other even-more-improbable-but-still-possible failure modes that could benefit from these brainstorming discussions. So don’t necessarily waste large amounts of time engineering ways to prevent these obscure problems, but don’t ignore their possibility either. Talk about them!
If anyone has more examples of pressure release valves you keep in your own operations toolkit, I’d be very interested in hearing about them in the comments.
 Ignore this advice if you’re building something like a nuclear reactor. Make that shit work.
 Just kidding. Operations gods don’t get out of bed for anything less than a fulltime newhire college grad.
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in...
We’re excited to share that we’re open-sourcing the tool we use to gather and transform the metrics from our managed DNS providers. We use DNSmetrics...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2018