Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
This month is a big month for PagerDuty—we turned 10 on February 18! I never imagined we’d reach this milestone, honestly. A lot of Dutonians...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
The best way to build a distributed system is to avoid doing it. The reason is simple — you can bypass the fallacies of distributed computing (most of which, contrary to some optimists, still hold) and work with the fast bits of a computer.
My personal laptop has a nice sticker by SignalFX; it’s a list of speeds of various transport mechanisms. Basically the sticker says to avoid disks and networks, especially when you go between datacenters. If you do that, and employ some mechanical sympathy, you can build awesome stuff like the LMAX disruptor that can support a trading platform that can execute millions of transactions per second on a single node. Keep stuff in memory and on a single machine and you can go a long way; if you’re OK with maybe redoing 15 seconds worth of work on a failure then you can do all work in memory and only write out checkpoints to disk four times a minute. Systems like that will run extremely fast and you can sidestep the question of scaling out completely.
Don’t let yourself be fooled — distributed systems always add complexity and remove productivity. If people tell you otherwise, they are probably selling you snake oil.
There’s this requirement called “highly available” which makes it unfeasible to put all your code on one node. This requirement often triggers the very expensive step up to having multiple systems involved. There are two things to do here: challenge assumptions, and challenge requirements. Does this particular system really need to have 5 nines of availability or can we move it to a more relaxed availability tier? Especially if your software still needs to prove itself, going for HA and other bells and whistles may very well be premature optimization. Instead, skip it for now, get to market faster, and have a strategy in place to add it later on. If business stakeholders assert that yes, it needs to be “HA”, explain the trade-offs and make sure that they know they’re about to invest time and money into something they may never have a use for (the expected outcome should be that customers won’t like the product or feature. If you only build products or features that you know customers are going to like, you’re not taking any risks and your venture will end in a boring cloud of mediocrity).
Explaining the CAP theorem, tell your stakeholders they can have availability or consistency, but not both (again, some optimists say that this is not the case anymore, but I think that’s wrong). For example, if you build a system that delivers, say, notifications, they can get a system that delivers a notification exactly once most of the time (consistent, but less available) or a system that delivers a notification at least once almost always (available, but less consistent). Usually, eventually consistent (AP) systems need less coordination so they are simpler to build and easier to scale and operate. Try to see whether you can get away with it, as it’s usually worth the exercise of redefining your problem into an AP solution.
Remember — if you can’t avoid it, at least negotiate it down towards something simple. Not having to implement a complex distributed system is the best way to have a distributed system.
Complexity is the enemy of our trade, so whatever system you’re designing or code you’re writing, you need to play this game of whack-a-mole where complexity pops up and you hammer it right back into the ground. This becomes even more important as soon as you write software that spans more than one system — distributed systems are intrinsically complex, so you should have no patience with accidental complexity. Some things in distributed systems are simpler to implement than others — try to stick with the simple stuff.
There are several ways to increase availability — you can have a cluster of nodes and coordinate everything (save work state all the time so any node can pick up anything), but that requires a lot of coordination. Coordination makes stuff brittle, so maybe you can not have it? There are various options to avoid coordination and still have good availability:
In both cases, coordination moves from “per transaction” to “per configuration”. Distributed work transactions are hard, so if you can get away with configuration-level coordination, do so. Often, this involves replaying some work — an “exactly once” work process becomes “almost always exactly once unless a machine dies and then we replay the last minute to make sure we don’t miss anything.” Modelling operations to be idempotent helps; sometimes, there’s no avoiding duplicate operations becoming visible and you need to chat with the stakeholders about requirements. Get an honest risk assessment (how often per year do machines just die on you?), an honest impact assessment (how much stuff will be done twice and how will this inconvenience users), and an honest difficulty assessment (extra work, more complexity which begets more brittleness which results in less availability).
Sometimes you need availability even when datacenters fail. Be extra careful in that case, because things will become extra brittle extra quick, and you’ll want to make sure to only require a minimal amount of coordination.
Sometimes you can’t just get all the work done in a single node. First, try not to be in that position. Open up the hood and see where you are wasting cycles — these pesky LMAX people showed you can do 7-figure transactions per second on a single machine; it might be to go to Amazon for the bigger instance. By now, I would expect decent software to be multi-core capable so you can indeed get a quick fix by getting beefier hardware. Moreover, if you cannot organize your code to run faster with more cores, you probably have no chance to make it faster by adding more nodes, do you? Even without LMAX-level engineering, I think it is reasonable to expect your software to handle at least low 5-digit business operations per second. If you want to scale out because one node can’t handle a couple of hundred of them per second, you maybe want to go back to the drawing board first. Most likely, you probably have some issues in your code that need to be addressed.
When you have to add more machines to crack the problem (this is a great problem to have!), plan it so that coordination is minimal.
Architectural patterns like Command/Query separation and Event Sourcing decouple and often duplicate data storage into multiple specialized stages. These specialized stages work well to support distributed designs, as you can choose what to keep local and what to distribute so you come up with a hybrid solution that minimizes coordination. For example, you can write update commands to a distributed Kafka cluster, but have everything downstream from there operate local and separate (e.g. consumers process the update commands and independently update ElasticSearch nodes that are used for querying). The “real” data is highly available and coordinated in message streams — systems just use views of that data for specialized processing like search, analytics, and so on. Such a system is much easier to maintain than the classical configuration where a central database system is the nexus of all the operations and inevitably becomes the bottleneck — whether the database system was built for scalability or not.
Feel free to store data redundantly and have multiple independent systems each use their own optimalized form of the data. It takes less coordination and eventually pays for the relatively small increase in storage cost.
Unless you operate at the scale of Google, the system you’re about to take into the realm of distributed systems is not so special that you have to build it from scratch. It’s quite likely that you’re paid to solve business problems, not to build tools and infrastructure, so there’s zero reason to figure stuff out for yourself in 2017. Implementing a distributed system correctly is hard, so you will likely get it wrong (the same advice holds for persistence and cryptography, by the way). If you think you have a unique problem and need to roll your own, you either haven’t looked hard enough or you haven’t tried hard enough to shape your problem in a format that makes using any of the hundreds of open source projects out there a possibility. You’ve been pushing “the business” to help shape the requirements in a form that makes a distributed solution much simpler (and thus reliable). Now, push yourself to find the correct software out there that will solve the non-unique parts of your problem so you can focus on what makes your company special.
Yes, tool-smithing is fun — I love it and I could do it all day long. And indeed, framing your problem in a form that makes you look like a unique snowflake is good for your self-esteem. Ditch it and go on solving some real problems, the sort that makes your business successful.
Prevention is the best medicine The best way to build a distributed system is to avoid doing it. The reason is simple — you can...
Today we’re announcing the integration of PagerDuty with Webmon, a website monitoring and escalation service that lets you be the first to know when an online service goes down.
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019