In light of some interesting twitter, blog, and IM discussions over yesterday's post (thanks to Chad Walters, Billy Newport, Ryan Rawson, Bill de hOra, Jeff Darcy, and others), I've been pondering three CAP points. Last post on this topic (for a bit), I promise, before I return to your regularly scheduled RESTful programming.
1. Definitions Matter. Or, are CA/CP are the same thing?
Gilbert & Lynch's paper defines the following (I'm not going to use their exact terms, so as to keep this accessible)
- Consistency (big-C)- aka Atomicity, or a total order on operations (as visible to any user); aka. by database practitioners as serializable isolation. Makes distributed systems look like a single system.
- Eventual Consistency (little-C)- aka Delayed-t Consistency, or a bounded amount of time before a data object is consistent, also implies a minimum latency between writes on a single piece of data (without specialized conflict resolution)
- Weak Availability (little-A)- every request received from a non-failing node must result in a response, but this response may take forever to come (i.e. latency is not a part of this definition)
- Strong Availability (big-A)- every request received from a non-failing node must result in a response even when network partitions occur, but this response may take forever to come (i.e. latency is not a part of this definition)
- Strong Network Partition Tolerance (big-P)- the network will be allowed to lose arbitrarily many messages sent from one node to another (not counting total network outage). When a network is partitioned, all messages sent across the partition are lost.
I'm going to add the following:
- Weak Network Partition Tolerance (little-P)- a system that can tolerate some network partitions, but not all combinations of them.
I add this due to my reading of their definition of Big-P and sections 3.2.1 and 3.2.2: "If there are no partitions, it is clearly possible to provide atomic, available data. In fact the centralized algorithm described in Section 3.2.1 meets these requirements. Systems that run on intranets and LANs are an example of these type of algorithms". Section 3.2.1 defines an algorithm for Consistent & Partition Tolerant systems. Hang on to this one, I'll finish my point in a moment.
So, by these definitions, you basically can have (assuming mixed reads & writes):
- AP: eventual consistency, strong availability, strong partition tolerance
- CP: strong consistency, weak availability, strong partition tolerance
- CA: strong consistency, strong availability, weak partition tolerance
Clearly an AP system is different from CP/CA. But the crux is whether CA and CP are the same kind of system. If you consider the idea of weak vs. strong partition tolerance, the difference becomes clear (to me, anyway, but I'm admittedly blonde).
A CA system might be able to handle some partitions, but not all. Simply put, it has a single point of failure, somewhere. The example I like to give is a shared-disk cluster database with a SAN. Partition the SAN, and you're hosed. Partition the nodes, or even the interconnect, and you will still be available, albeit with higher latency. Redundancy is the common way to reduce this risk.
Whereas a CP system is designed to handle all partitions, i.e. it has no single point of failure. But some non-failed nodes may not be able to service clients at all during a partition. This can suck. Anyone who's managed a synchronous log shipping setup or EMC SRDF probably knows about this one.
2. Scope matters
The definition of "node" and "network" can vary depending on your granularity and scope.
The above definitions only make sense if the scope of the system is the server-side of a client/server relationship. (i.e. How can non-failed nodes experiencing a partition still receive and respond to a request? Only if the client isn't the one being partitioned.)
A database that's normally a CA system can be considered CP if you zoom out to notice that it's in a multi-node synchronous replication ring. Or AP if it's in an asynchronous multi-node replication ring. But notice that, probabilistically, it's behaving like a CA system for 99.9% of the time (or however long your MTBF is).
An AP system on the other hand has one big drawback, one that's not spoken about often. It's about the scope of the partition: is it recoverable or not?
An unrecoverable partition in a CP or CA system is "no data loss", even if you can't get at the system. That's not true in an AP system, if there's an unrecoverable error in a certain set of nodes during the "delayed consistency" time window. This occurs during catastrophic media failures, like a bug in the SAN controller corrupting all the disks, trucks driving through a data center, or floods, or bombs, hurricanes, etc.
Even with backups, during this "replication latency", you have to re-enter those transactions, or hope that a CP or CA system somewhere has a copy.
3. Definitions are too easy to get hung up on.
One of the main reasons I'm talking about this is because I see bolded, underlined claims that can be easily misconstrued, like, "you can't choose consistency and availability together in a distributed system". This gives me The Fear.
This point of the above quote, is not to say you can't have some consistency. You just can't have fully serializable consistency and isolation.
But in practice, when you think about it, this claim is actually rather banal. Practitioners have long understood that there are a spectrum of consistency & isolation tradeoffs at scale, even at a single node. Why?
Because the definition of availability arguably needs to also include latency, which is something Daniel Abadi brought up. Serializability means locks, in practice, and that means increased latency, and thus reduced availability. This is why we don't like Two Phased-Commit very much these days.
But, for kicks, go back to the ANSI SQL isolation level debate that Jim Gray stirred up. It was because databases like Oracle (gasp) don't provide serializable consistency, and haven't since the late 1980s! They provide snapshot consistency, which is a different beast where readers don't block writers, and arguably was the primary technical reason for Oracle's success in the market place through the 1990s. Amusingly, this is the same argument that Jim Starkey keeps bringing up when discussing CAP, having invented the idea of multi-version concurrency control, when he's talking about his new baby, Nimbus DB.
So the idea that you can't have fully serializable consistency at scale -- even in a single node database -- is completely uncontroversial. It's when the followup feels like "...and therefore you need to throw out your RDBMS", that cognitive dissonance and tempests in a tweetpot stirs up.