ZooKeeper, for those who are unaware, is a well-known open source project which enables highly reliable distributed coordination. It is trusted by many around the world, including PagerDuty. It provides high availability and linearizability through the concept of a leader, which can be dynamically re-elected, and ensures consistency through a majority quorum.
The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.
Background: The Use of ZooKeeper at PagerDuty
Here at PagerDuty, we have several disparate services which power our alerting pipeline. As events are received, they traverse these services as a series of tasks which get picked up off of various work queues. Each one of these services leverages a dedicated ZooKeeper cluster to coordinate which application host processes each task. As such, you can imagine that ZooKeeper operations are absolutely critical to the reliability of PagerDuty at large.
Part I: The ZooKeeper Bugs
Too Many Client Sessions
One day last year, an engineer noticed that one of the ZooKeeper clusters in our load test environment was broken. It manifested itself as lock timeouts in the dependent application. We confirmed that the cluster was reachable and listening, but something was off – each client had 10’s of active sessions to their respective ZooKeeper cluster members. Normally they have only two. As a result, we hit a limit on the number of active sessions allowed per node, and a relevant exception was logged.
How could the client be so silly? Maybe there was a bug in the ZooKeeper library we were using at the time. Restarting the entire ZooKeeper cluster fixed the problem, and we were left without any way to replicate it. After some poking around in the library code, we were not able to find a condition which might cause a session pileup. We were dead in the water, and the worst part is we had no idea if it could occur in production or not.
Less than one week later, the condition recurred in our load test environment. This time, it was affecting a different ZooKeeper cluster, and it occurred while we were generating significant load. We realized that we were able to reproduce the problem by inducing synthetic load and just waiting for an hour or two.
We noticed that session counts climbed linearly across all ZooKeeper nodes, so we deduced that even if it is a problem with the client, there’s likely a condition in ZooKeeper that is triggering the behavior. We began digging further into the ZooKeeper logs. After a while, we found something promising in the leader’s log:
java.lang.OutOfMemoryError: Java heap space
Working through the stacktrace, we found this:
scheme=a_.readString("scheme");. Hmph. Well, the ZooKeeper protocol has a four byte
scheme_len field… maybe the client is miscalculating the value. Regardless, the protocol dictates a maximum size for the scheme. Alas – there is no bounds check on that field.
After many, many packet captures, we were able to find a single problem packet. It contained a
0x6edd0b51… or about 1.7GB. The lack of a bounds check resulted in ZooKeeper trying to allocate memory for the bogus length, which causes an OutOfMemory exception to be thrown, killing the thread. Cool. Well, not so cool, but now we’re starting to get somewhere. There are still so many questions, but the most pertinent problem is clear: if the leader is dead, why doesn’t it get re-elected?
It turns out that ZooKeeper was not catching unhandled exceptions from it’s critical threads, meaning that if one died, the ZooKeeper process would continue to run without it. Unfortunately, that means the heartbeat mechanisms would continue to run as well, deceiving the followers into thinking that the leader is healthy. Since
scheme_len is handled by the request pre-processor thread (through which all requests must pass), the entire cluster effectively becomes catatonic – seemingly alive yet largely unresponsive.
We have process monitoring with zk-specific health checks, though due to the nature of the failure, those health checks continued to (annoyingly) pass. So, the thread would die on the leader and block all subsequent operations while evading several failure detection mechanisms at the same time. The end result is that we’ve discovered we can tip over an entire ZooKeeper cluster with a single poison packet. Wat.
Part II: The Kernel Bugs
TCP Payload Corruption
Now we understand the ZooKeeper failures, but a much larger question remains: How in the world are we seeing these sorts of values for
scheme_len?? Decoding the packet, we could see that it was not only
scheme_len that was affected, but an entire 16-byte chunk was seemingly corrupted. Here’s a snippet of the first bad packet we located:
0170 00 00 00 03 00 00 00 b6 00 03 2a 67 00 00 00 01 ..........*g....
0180 00 00 00 7e 2f 47 65 6d 69 6e 69 2f 63 6f 6d 2e ...~/Gemini/com.
0190 70 61 67 65 72 64 75 74 79 2e 77 6f 72 6b 71 75 pagerduty.workqu
01a0 65 75 65 2e 53 69 6d 70 6c 65 51 75 65 75 65 61 eue.SimpleQueuea
01b0 62 6c 65 2f 52 45 44 41 43 54 45 44 5f 52 45 44 ble/REDACTED_RED
01c0 41 43 54 45 44 5f 52 45 44 41 43 54 45 44 5f 52 ACTED_REDACTED_R
01d0 45 44 41 43 2f 5f 63 5f 35 66 62 65 61 34 37 62 EDAC/_c_5fbea47b
01e0 2d 61 65 62 31 2d 34 36 62 35 2d 62 32 32 33 2d -aeb1-46b5-b223-
01f0 38 65 34 65 31 37 38 34 32 31 34 39 2d 6c 6f 63 8e4e17842149-loc
0200 6b 2d 00 00 00 09 31 32 37 2e 30 2e 30 2e 31 00 k-....127.0.0.1.
0210 7c 0e 5b 86 df f3 fc 6e dd 0b 51 dd eb cb a1 a6 |.[....n..Q.....
0220 00 00 00 06 61 6e 79 6f 6e 65 00 00 00 03 ....anyone....
The corruption does not line up with the fields you’d expect in the ZooKeeper protocol, but instead lines up with 16-byte boundaries within the packet itself (beginning at offset 0210). Given that information, it now seems quite unlikely that ZooKeeper has anything to do with the bad
scheme_len value at all.
This capture was taken as the packet was coming off the wire at one of the ZooKeeper nodes. In other words, it’s already corrupted by the time it reached ZooKeeper. This means it had to be that either the client sent the corrupt packet, or a network device corrupted it. Typically, if an intermediary device is responsible for corruption, a set of checksums will be invalidated, and the receiving system will discard it. The TCP payload was clearly reaching ZooKeeper in this case, so the checksums must all be good… but to our great surprise, they were not!
Before going further, it’s important to know that PagerDuty uses IPSec in Transport Mode to secure all of our inter-host traffic. Without being too verbose, it’s basically the same as your run-of-the-mill IPSec-based VPN, minus the VPN part. The IP payload is encrypted, leaving the IP headers behind so the packet can be routed through the network as it normally would be. What you end up with is VPN-like encryption without the need to maintain separate address space. In addition, the overhead is distributed, with every host encrypting and decrypting it’s own traffic.
Our use of this technology is important to understand as up until now, all of the captures we have been examining have gone through decryption for analysis. Since the TCP headers and payload are encrypted within IPSec, but the IP headers are not, it means that we have one checksum on the outside of IPSec, and one on the inside. It also means that successful decryption of the packet proves that it was not corrupted after it was encrypted.
Our examination of the checksums showed that the IP checksum was valid, but the TCP checksum was not. The only way this could be possible is if the corruption occured after the TCP frame was formed, but before the IP headers were generated. Regardless of how/why/where this TCP payload corruption could be occurring, it should absolutely be discarded by the receiver as the TCP checksum is invalid… what gives?
Bug #3 – Obscure Behavior
The only thing that makes sense at this point is to get word from the horse’s mouth. A bit of Linux source spelunking turns up the following lines, which remain in the Linux master branch as of this writing:
* 2) ignore UDP/TCP checksums in case
* of NAT-T in Transport Mode, or
* perform other post-processing fixes
* as per draft-ietf-ipsec-udp-encaps-06,
* section 3.1.2
if (x->props.mode == XFRM_MODE_TRANSPORT)
skb->ip_summed = CHECKSUM_UNNECESSARY;
RAGE!!! It feels so wrong – how can this be right? RFC 3948 tells the tale. It states that while using IPSec in NAT-T Transport mode, the client MAY forgo the validation of the TCP/UDP checksum under the assumption that packet integrity is already protected by ESP. Who would have ever thought that a case could exist under which one does not validate TCP checksums?? The assumption made by the authors is invalid, as there is clearly ample opportunity for corruption prior to ESP/IP formation. While checksumming is a great way to detect in-flight corruption, it can also be used as a tool to detect corruption during the formation of the packet. It is the latter point that was overlooked, and this optimization has come to bite us. Lack of validation here let our mystery corruption through the gate – giving ZooKeeper bad data which it reasonably believed was protected by TCP. We claim this is a bug – intentional or not.
Over the course of our investigation, we found that replicating the problem was strangely difficult, even though we could still manage to do it occasionally. We needed a simple way to detect and analyze this corruption without complicating things with the use of ZooKeeper, wire capture decryption, etc. After a couple tries, we settled on an extremely simple approach: use
netcat to pipe zeros from
/dev/zero over the wire, and send them to
xxd (a command-line hex tool). Any non-zero value being read by
xxd is obvious corruption. This is what some of our corrupted TCP payloads look like using this approach:
-- evan@hostB:~ $ nc -l 8080 | xxd -a
0000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
189edea0:0000 1e30 e75c a3ef ab8b 8723 781c a4eb ...0......#x...
189edeb0:6527 1e30 e75c a3ef ab8b 8723 781c a4eb e'.0......#x...
189edec0:6527 1e30 e75c a3ef ab8b 8723 781c a4eb e'.0......#x...
189eded0:6527 1e30 e75c a3ef ab8b 8723 781c a4eb e'.0......#x...
189edee0:6527 9d05 f655 6228 1366 5365 a932 2841 e'...Ub(.fSe.2(A
189edef0:2663 0000 0000 0000 0000 0000 0000 0000 &c..............
189edf00:0000 0000 0000 0000 0000 0000 0000 0000 ................
4927d4e0:5762 b190 5b5d db75 cb39 accd 5b73 982b Wb...u.9..[s.+
4927d4f0:5762 b190 5b5d db75 cb39 accd 5b73 982b Wb...u.9..[s.+
4927d500:5762 b190 5b5d db75 cb39 accd 5b73 982b Wb...u.9..[s.+
4927d510:5762 b190 5b5d db75 cb39 accd 5b73 982b Wb...u.9..[s.+
4927d520:01db 332d cf4b 3804 6f9c a5ad b9c8 0932 ..3-.K8.o......2
4927d530:0000 0000 0000 0000 0000 0000 0000 0000 ................
4bb51110:0000 54f8 a1cb 8f0d e916 80a2 0768 3bd3 ..T..........h;.
4bb51120:3794 54f8 a1cb 8f0d e916 80a2 0768 3bd3 7.T..........h;.
4bb51130:3794 54f8 a1cb 8f0d e916 80a2 0768 3bd3 7.T..........h;.
4bb51140:3794 54f8 a1cb 8f0d e916 80a2 0768 3bd3 7.T..........h;.
4bb51150:3794 20a0 1e44 ae70 25b7 7768 7d1d 38b1 7. ..D.p%.wh}.8.
4bb51160:8191 0000 0000 0000 0000 0000 0000 0000 ................
4bb51170:0000 0000 0000 0000 0000 0000 0000 0000 ................
4de3d390:0000 0000 0000 ......
-- evan@hostB:~ $
It looks awful hardware-y, doesn’t it? Ocurring typically at 16-byte boundaries with repetition. We hypothesized that maybe we have some hosts on bad hardware, so we began running this test on different pairs of hosts in our infrastructure in an attempt to suss them out. What we found was interesting.
The first thing that we noticed was the kernel version. No matter how hard we tried, we were unable to replicate the condition under Linux 2.6. Successful replication was dependent on a Linux 3.0+ kernel. While we could see the issue occurring under Linux 3.0+, it was inconsistent. A set of affected hosts running a particular version of 3.0+ would be affected, whereas a different set of hosts with the same version would be unaffected. So, we started looking for other factors.
After a couple weeks of fruitless research and testing, we eventually made a disturbing discovery: reproduction was dependent on the version of Xen that our hosts were running on top of. Xen 4.4 guests were unaffected, but Xen 4.1 and Xen 3.4 were both affected. This detail explains why our test results were inconsistent while using fixed kernel versions. So, we have found that any host running a Linux 3.0+ kernel on top of Xen 4.1 or 3.4 are experiencing sporadic corruption during IPSec encryption. For the first time, we were able to reliably reproduce the problem, as well as predict which hosts would be affected. Now all we have to do is figure out how to fix it!
At this juncture, we have enough information to take our search beyond the PagerDuty organizational boundaries. We dropped a mail on LKML to see if anyone else had run into this problem before. Within a few days, we had a response. Herbert Xu, one of the crypto subsystem maintainers, replied that he’d seen something similar once before. He reported speculation that Xen’s HVM mode may not be affected, and pointed a finger at a particular Intel instruction:
Bug #4 –
The Intel x86 instruction set includes an AES instruction used to perform AES computations in hardware. A kernel module,
aesni-intel, is responsible for leveraging this instruction in the AES encryption facilities provided by the Linux kernel. Since our IPSec encryption uses AES, this module would likely be used for traffic encryption in the presence of
aes-ni on Intel hardware. On the tip from LKML, a quick check reveals we do indeed have the
aesni-intel kernel module loaded on hosts across our fleet. Upon forcing the module to unload, the corruption disappears! Hallelujah!!
Further testing revealed that Herbert was right about HVM – it was not affected. At last, the problem seems to lie in an interaction between
aesni-intel and Xen paravirtual mode.
With such a serious bug lurking in the
aesni-intel module, you might wonder – how has nobody noticed this before now? After all, AES is occasionally used for SSL traffic too. Well, the answer to that lies in Bug #3 – only in IPSec NAT-T Transport mode does the kernel not validate TCP checksums. That means that under any other condition, checksum validation will fail and the packet will be dropped, protecting the application from the corrupted data. That plus the Xen version and virtualization type restrictions make this problem exceedingly rare… an exotic AES unicorn that can be seen only by those who know where it lies. Lucky us :).
Part III: The Workaround
After more than a month of tireless research and testing, we have finally got to the bottom of our ZooKeeper mystery. Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack… Even after all this, we are still unsure where precisely the bug lies. Despite that fact, we’re still pretty satisfied with the outcome of the investigation. Now all we need to do is work around it.
What We Did
Unloading the module has proved to thwart the bug(s) we’ve been encountering, though there’s a performance impact associated with such an action. We measured the impact and it turns out to only be a problem at very high throughputs, the likes of which we have other ways of mitigating. We also know that Xen HVM guests are not affected, and neither is Linux 2.6. Knowing this, we can start to make a plan of attack.
Existing Xen paravirtual guests running Linux 3.0+ were downgraded back to 2.6, but only if they were on an affected version of Xen. We wrote a Chef recipe to do Xen detection, and blacklist the
aesni-intel module when the problem conditions exist. We have also begun standardizing on HVM hosts rather than paravirtual so that we can still leverage the AES hardware acceleration, among other things.
A proper fix still regrettably eludes us. With later versions of Xen being unaffected, and as HVM adoption grows, there is little interest from the community in sinking the time required to isolate the code which is introducing the problem. We face a similar problem with ZooKeeper – version 3.5 has a fix which prevents a critical thread from dying unnoticed, though 3.5 is still alpha and will be for some time. There is talk of backporting the fix, though the official stance is that it is not considered a blocker for further 3.4.x releases.
We would like to thank the members of our various Engineering teams for their help and contributions related to the isolation of these problems. We take reliability very seriously at PagerDuty, and extracting root causes for even the most exotic issues is something we take a great deal of pride and joy in. If you agree, we’d love it if you could join us.