Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
Db2Kafka is an open source service developed at PagerDuty. It is a message pump to take records written to a MySQL table and publish them to a Kafka topic. While we realize it’s not ideal to be using an SQL table as a queue, it has been a useful tool as we are able to keep our Rails code transactional while the rest of our systems become more asynchronous.
Initially we developed this service as a temporary workaround to be used until we have time to restructure our non asynchronous systems, but, as it’s usage has grown and more features are depending on it, we’ve decided it merits some time spent to make it more robust and resilient to failure. This post will go over how we’ve done this by adding support for automatic failover across datacenters to Db2Kafka.
Db2Kafka has been used in production at PagerDuty for over a year now, it’s been used to move messages for services not part of the notification pipeline. Features such as alert timelines and user activity summaries were powered by Db2Kafka. Due to some upcoming changes to our infrastructure, Db2Kafka will start taking traffic for services in the notification pipeline, moving it into the realm of our “tier 1” critical infrastructure.
However, there is a complication. For performance reasons, Db2Kafka was designed to run as a singleton – only one instance of the service is expected to be running anywhere in our production system. In this new world where Db2Kafka must have strong uptime guarantees, a problem occurring with this single instance would mean downtime for our notification pipeline, which isn’t acceptable!
To solve this problem, we need to allow for the possibility of downtime of a Db2Kafka instance without loss of availability. We’ll have to introduce a second instance which will take over when we encounter failures. We have a few requirements on how this should work:
When the requirement is a speedy failover, this usually implies a hot standby: a secondary instance of the service running but not actively processing records. This standby is ready to take over immediately if the primary is in trouble.
But how do we know when to failover? This type of distributed decision making usually calls for tools like Zookeeper or EtcD. These are systems that solve a hard problem known as distributed consensus; that is, how do we make sure services have a consistent view of the world while staying reasonably available when faced with network failures. At PagerDuty we’ve been using Zookeeper for many years with other systems, we understand it and operationally it’s been rock solid. For us it’s the natural choice.
Zookeeper provides two useful concepts which we will use to build our failover implementation: ephemeral nodes and watches. An ephemeral node is an entity whose life is tied to the life of your Zookeeper session. A client connects to Zookeeper and writes the ephemeral node /failover. Once the client’s session expires (through a session timeout), Zookeeper ensures that /failover disappears.
A watch notifies you when the state of a node changes. For example, you can set a watch on the /failover node and it will trigger when the node comes into existence, or if it already exists, trigger when the node disappears. It’s important to be aware that a watch will trigger at most once. Once triggered, the client will need to reset the watch if they still wish to be notified of future changes.
With these Zookeeper primitives we have a way to make decisions and make sure all interested parties have a consistent view of the world. We now need to decide on the failover model for Db2Kafka. After reviewing a few options, we’ve settled on using the Barrier pattern.
In the context of distributed computing, the Barrier pattern refers to blocking a set of nodes until some condition is met. We can specialize this idea in the context of failover of a singleton service: we want to block the secondary whenever the primary is up.
As a primary instance life is simple. On startup it writes the ephemeral node /failover into Zookeeper. If the write is successful, it then proceeds to do work, if the write failed, it will retry until it is successful.
The secondary instance has a bit more to keep track of. On startup, it sets a watch on /failover. Setting a watch will also tell the secondary the current state of the node. If the node exists at the time the watch is set, the secondary will block (not process records) until the watch triggers. If the watch does trigger, the ephemeral node will have disappeared, meaning the primary is in trouble. the secondary instance can safely start processing records.
If the ephemeral node does not exist, this means that the primary instance is down and the secondary instance should start processing messages. This time, if the watch triggers, the secondary must stop doing work as this means that the primary instance has come back up.
Why do we use this barrier pattern rather than a tried and tested distributed lock? While locks are conceptually simpler and don’t require us to run the primary and secondary instances in different modes, they also don’t allow us to encode a preference for a datacenter. This is one of our requirements.
For more details on implementing the barrier pattern with Zookeeper see here.
When building distributed systems, it is important to check how your system will behave in the presence of network partitions. CAP theorem tells us that a network partition forces us to make a tradeoff between consistency and availability. This system is fairly simple: we have just four cases to consider.
In the case of no network partitions, it’s business as usual, nothing to see here.
In this case, the primary instance is able to connect to Zookeeper, create the ephemeral node, and will begin doing work. Since the secondary cannot make a connection to Zookeeper, it should block and not process any messages.
Here, the primary instance was not able to create the barrier node, so it did not start processing records. When the secondary instance connects to Zookeeper, it will notice that the ephemeral node does not exist and begins processing records.
In this case, we are effectively down. Neither the primary nor the secondary instance is able to connect to Zookeeper, and so neither may begin processing messages.
If we prefered a more highly available model, we could amend our plan to have the primary Db2Kafka instance process messages regardless of its connection to Zookeeper. This would mean that we could have two instances of Db2Kafka processing messages if only the primary is partitioned.
Db2Kafka currently has a message ordering guarantee that may be violated if more than one Db2Kafka instance is processing messages. If we were to accept this amended plan we would need to also weaken this ordering guarantee. However, this fourth case of a full Zookeeper/Db2Kafka partition is much less likely to occur than the other cases with only a partial partition.
Our Zookeeper cluster spans all of our datacenters and will survive one datacenter going down with no impact on availability, so rather than inflict downstream services with duplicate, out of order, messages more often, we’ve opted to accept the chance of downtime for Db2Kafka in the unlikely case we have full Zookeeper isolation.
At PagerDuty we have a strong culture of chaos engineering, large features and services are not considered done until they’ve undergone at least one Failure Friday.
Db2Kafka was put through a few different failure scenarios to validate our assumptions on its behaviour. Part of the testing involved simulating network partitions between Zookeeper and Db2Kafka. In the graph above we see the number of records being processed by Db2Kafka color coded by datacenter.
This service solves an important problem for us at PagerDuty, with this new addition to Db2Kafka, we’re now ready to use it for more critical systems as well. Db2Kafka is open source, so if you think there may be a place for it in your organization (or you are just itching to read some code) see below.
This project was completed during my internship on PagerDuty’s Core team. Thanks to the team for providing detailed code and design reviews, the result is better for it! If you’ve found this work interesting and want to help us out, join us!
Comment on this post
I’m proud to be working for an Engineering organization that feels safe. Safe for its engineers to bring their authentic selves to work and discuss...
When PagerDuty was founded, development speed was of the essence—so it should be no surprise when we reveal that Rails was the initial sole bit...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019