In light of some interesting twitter, blog, and IM discussions over yesterday's post (thanks to Chad Walters, Billy Newport, Ryan Rawson, Bill de hOra, Jeff Darcy, and others), I've been pondering three CAP points. Last post on this topic (for a bit), I promise, before I return to your regularly scheduled RESTful programming.
1. Definitions Matter. Or, are CA/CP are the same thing?
Gilbert & Lynch's paper defines the following (I'm not going to use their exact terms, so as to keep this accessible)
- Consistency (big-C)- aka Atomicity, or a total order on operations (as visible to any user); aka. by database practitioners as serializable isolation. Makes distributed systems look like a single system.
- Eventual Consistency (little-C)- aka Delayed-t Consistency, or a bounded amount of time before a data object is consistent, also implies a minimum latency between writes on a single piece of data (without specialized conflict resolution)
- Weak Availability (little-A)- every request received from a non-failing node must result in a response, but this response may take forever to come (i.e. latency is not a part of this definition)
- Strong Availability (big-A)- every request received from a non-failing node must result in a response even when network partitions occur, but this response may take forever to come (i.e. latency is not a part of this definition)
- Strong Network Partition Tolerance (big-P)- the network will be allowed to lose arbitrarily many messages sent from one node to another (not counting total network outage). When a network is partitioned, all messages sent across the partition are lost.
I'm going to add the following:
- Weak Network Partition Tolerance (little-P)- a system that can tolerate some network partitions, but not all combinations of them.
I add this due to my reading of their definition of Big-P and sections 3.2.1 and 3.2.2: "If there are no partitions, it is clearly possible to provide atomic, available data. In fact the centralized algorithm described in Section 3.2.1 meets these requirements. Systems that run on intranets and LANs are an example of these type of algorithms". Section 3.2.1 defines an algorithm for Consistent & Partition Tolerant systems. Hang on to this one, I'll finish my point in a moment.
So, by these definitions, you basically can have (assuming mixed reads & writes):
- AP: eventual consistency, strong availability, strong partition tolerance
- CP: strong consistency, weak availability, strong partition tolerance
- CA: strong consistency, strong availability, weak partition tolerance
Clearly an AP system is different from CP/CA. But the crux is whether CA and CP are the same kind of system. If you consider the idea of weak vs. strong partition tolerance, the difference becomes clear (to me, anyway, but I'm admittedly blonde).
A CA system might be able to handle some partitions, but not all. Simply put, it has a single point of failure, somewhere. The example I like to give is a shared-disk cluster database with a SAN. Partition the SAN, and you're hosed. Partition the nodes, or even the interconnect, and you will still be available, albeit with higher latency. Redundancy is the common way to reduce this risk.
Whereas a CP system is designed to handle all partitions, i.e. it has no single point of failure. But some non-failed nodes may not be able to service clients at all during a partition. This can suck. Anyone who's managed a synchronous log shipping setup or EMC SRDF probably knows about this one.
2. Scope matters
The definition of "node" and "network" can vary depending on your granularity and scope.
The above definitions only make sense if the scope of the system is the server-side of a client/server relationship. (i.e. How can non-failed nodes experiencing a partition still receive and respond to a request? Only if the client isn't the one being partitioned.)
A database that's normally a CA system can be considered CP if you zoom out to notice that it's in a multi-node synchronous replication ring. Or AP if it's in an asynchronous multi-node replication ring. But notice that, probabilistically, it's behaving like a CA system for 99.9% of the time (or however long your MTBF is).
An AP system on the other hand has one big drawback, one that's not spoken about often. It's about the scope of the partition: is it recoverable or not?
An unrecoverable partition in a CP or CA system is "no data loss", even if you can't get at the system. That's not true in an AP system, if there's an unrecoverable error in a certain set of nodes during the "delayed consistency" time window. This occurs during catastrophic media failures, like a bug in the SAN controller corrupting all the disks, trucks driving through a data center, or floods, or bombs, hurricanes, etc.
Even with backups, during this "replication latency", you have to re-enter those transactions, or hope that a CP or CA system somewhere has a copy.
3. Definitions are too easy to get hung up on.
One of the main reasons I'm talking about this is because I see bolded, underlined claims that can be easily misconstrued, like, "you can't choose consistency and availability together in a distributed system". This gives me The Fear.
This point of the above quote, is not to say you can't have some consistency. You just can't have fully serializable consistency and isolation.
But in practice, when you think about it, this claim is actually rather banal. Practitioners have long understood that there are a spectrum of consistency & isolation tradeoffs at scale, even at a single node. Why?
Because the definition of availability arguably needs to also include latency, which is something Daniel Abadi brought up. Serializability means locks, in practice, and that means increased latency, and thus reduced availability. This is why we don't like Two Phased-Commit very much these days.
But, for kicks, go back to the ANSI SQL isolation level debate that Jim Gray stirred up. It was because databases like Oracle (gasp) don't provide serializable consistency, and haven't since the late 1980s! They provide snapshot consistency, which is a different beast where readers don't block writers, and arguably was the primary technical reason for Oracle's success in the market place through the 1990s. Amusingly, this is the same argument that Jim Starkey keeps bringing up when discussing CAP, having invented the idea of multi-version concurrency control, when he's talking about his new baby, Nimbus DB.
So the idea that you can't have fully serializable consistency at scale -- even in a single node database -- is completely uncontroversial. It's when the followup feels like "...and therefore you need to throw out your RDBMS", that cognitive dissonance and tempests in a tweetpot stirs up.

It depends on what 'serializable' means... for example take HBase row locks, they operate within a single JVM and use the Java synchronization primitives. These are probably some of the cheapest synchronization locks you can have, apart from having none. So having full serialization on a single row is not expensive, apart from making clients wait. Replication operates at a different level, so synchronizing across datacenters doesnt share row locks and moving to that would be expensive indeed.
Another thing to consider is all systems are 'partition tolerant'. If you dig into the formal definition of what a 'partition' is in the CAP proof, it turns out that even a node crash, node network loss, process crash, etc, are all examples of 'partitions'. This is really different than what other people think of as 'partition tolerant'. HBase is a proud CP system, which means in the (rare) case that machines go down, we suffer data availability but you will never see inconsistent data. I think for many many applications this is the right tradeoff to take. You will _always_ need application code to deal with data store unavailability even if you are a "AP" system, but dealing with inconsistent data is much harder and very application dependent.
So for example, you can use write buffers to handle data store unavailability, you can throw errors and not handle user requests, you can maintain multiple clusters and datacenters and shunt requests that fail to clusters that arent failing.
But there are multiple types of 'inconsistency'... Dynamo shows one model where you have a DHT to achieve reads when machines are down. But there is also the standard replica approach, which I think could be buildable on top of HBase. In this case you would store data at multiple Regionserver nodes and stream edits around, and when a node goes down a client could move to the secondary server. This requires a lot of work since it violates a few of the major operating principles, but it wouldn't be impossible.
Thanks for your response, Ryan.
Gilbert & Lynch's definition of 'partition tolerant' is what I listed above, which is almost verbatim. i.e. "the network will be allowed to lose arbitrarily many messages sent from one node to another (not counting total network outage). When a network is partitioned, all messages sent across the partition are lost."
So, yes, that covers crashes as well as network losses.
My point was that, indeed, all systems are partition tolerant, thus my point of little-P vs. big-P. One way to interpret this is that a CA distributed system is impossible, which is what coda hale claims. I claim that this is absurd. It's absolutely possible, because most CA systems do tolerate at least some (even most) partitions. The problem with CA systems is that they can't tolerate *all* combinations of partitions because they need to have at least one centralized element to get strong consistency.
Thanks for continuing the discussion.
Note that you are using "big-A, little-a" in a different way than I was using it in our Twitter conversation. When I said "little-a availability", I meant the layman's concept of availability, in contract with "big-A Availability", the CAP Theorem's formal definition of Availability.
I still think there is a fundamental confusion here about partition tolerance. What do you mean by partition tolerance as a stand-alone property? I think it is actually meaningless without qualifying it as either consistency that tolerates partitioning or availability that tolerates partitioning. Otherwise, what is it that is being tolerated?
You seem to be equating "CA" with "has a single point of failure" and "CP" with "has no single point of failure". While I agree that this relates to the intuitive notion of availability (and hence why I prefer systems with no single point of failure), I don't see this as related to big-A Availability in the formal sense in any way.
Something with a single point of failure has the possibility of being knocked entirely offline and hence 100% unavailable by one failure (of the appropriate kind). Something with no single point of failure does not have the possibility of being knocked offline and hence 100% unavailable by any single failure - but it can still be knocked offline by multiple failures (of the appropriate kind). But the CAP Theorem is not about being knocked 100% offline or about single points of failure.
What the theorem actually says is that any system that is big-C Consistent cannot maintain big-A Availability in the face of big-P Partitions. Full stop. It doesn't matter if the Consistent system has a single point of failure or not - in either case, it cannot maintain the property of having all requests to non-failing nodes succeed in the face of least-convenient arbitrary message loss between nodes.
You may regard this as overly pedantic but my point is to clear up what I regard as some of the fundamental confusions around the CAP Theorem, with the express purpose of setting CAP aside as not all that important for the purposes of real-world distributed systems engineering.
As I see it, the main confusions are these:
1. Thinking that the CAP Theorem is "C, A, P - choose 2"
2. Thinking that C, A, and P as defined in the CAP Theorem map to directly to intutive definitions of consistency, availability, and partitioning
3. Drawing the conclusion therefore that using strongly consistent subsystems within a larger service is unadvisable if you are interested in having high availability
One of the things that is most difficult about clearing up these confusions is that Brewer, Gilbert, and Lynch themselves actually demonstrate at least both confusions 1 and 2 in their writings on the subject. Note that I have the highest regard for Eric Brewer - but that doesn't exempt him from the all-too human ability to occasionally mistate things especially when not working with formalized notions (God knows I do it oh so often).
Here I am very much in agreement with Daniel Abadi's comments in his blog post on CAP: - we both say:
1. CAP is "C vs A in the face of P"
2. The CA/CP distinction doesn't make any sense
3. C vs A is not the most interesting of tradeoffs
I haven't given enough thought to Abadi's PACELC concept to say whether I think it is definitive -- I think it is probably still missing some factors of import - but it is closer in spirit to what I think is important to real-world distributed systems than CAP.
Thanks for your response, Chad.
Firstly, yes, I realize I'm using big-A vs. little-A differently than our conversation.
Secondly,
It absolutely is a qualifier on availability or consistency.
I'm not equating the two, I'm pointing out the practical difference in the implementation between these two systems, which is what leads to different behaviour in the face of partitions.
A CP system can tolerate any combination of partition other than total network outage. What do you call a system that can't?
A CA system is generally either (a) not a distributed system - more like a traditional database, and thus not worrying about partitions or (b) a type of distributed system with a single point of failure that leads to total outage, like a shared-disk database cluster that runs on a LAN.
This fits with the formal definition of "A". "A" implies that requests to non-failing nodes result in a response (at some point). A "CA" system can do that except when it experiences a specific kind of partition that leads to a full failure of all nodes.
Again, I bring up Oracle RAC losing its SAN as an example. That is the equivalent to the failure of all nodes / the full network. The formal definition of "A" does not cover such catastrophic failures. Contrast this to a CP system, where some nodes, even though they're in perfectly fine shape, cannot respond to requests because they're in a minority partition. That's the distinction.
The "full stop" part is misleading. The theorem also includes, as part of its definitions and proof process, an example of services running on a LAN as an example of an Atomic, Available (i.e. CA) system. Brewer's papers and presentations also include cluster databases as an example of a CA system.
The current argument, made by Coda Hale's essay, is that this is a mistake, and a CA system by definition can't be a distributed system. I disagree with that (as does Jeff Darcy). Jeff and I not too far apart either, in my opinion, on what to do about it: he believes CA systems are highly undesirable in favour of CP or AP systems, whereas I believe that they're popular and useful in certain contexts, but should likely either have heavy redundancy on the risk areas or be surrounded by a larger CP or AP system.
There are few large-scale websites in existence today that *don't* use a strongly consistent subsystem somewhere within their larger service. It would be insanity to avoid this for some use cases. CA or CP systems are just loosely coupled and surrounded by a bigger AP system. Amazon, for example, is a massive Oracle user, both in the single-node and RAC variants, for live services on the website. Google AdWords for a long time was built on MySQL (and Oracle for a short period).
I go back to the original justification for the CAP conjecture. It was about exploring the tradeoffs to scale given state management in distributed systems. The call was to bridge the worlds of database & distributed system research -- distributed computing was overly-focused on computation instead of data, and database work was overly-focused on single node scale-up instead of multi-node scale-out. CA systems were placeholders from traditional database systems, as well as their LAN-clustered brethren that behave a lot more like a single-node database system than a classic distributed system.
But If we equate CA = CP, or just think about "P", we are basically back to ignoring DBMS completely, declaring "distributed systems > databases", and rejecting that the whole space is useful when analyzing the tradeoffs of how to maintain state at scale.
I think we may be talking past eachother a bit at this point but I'm still willing to try this out a bit longer.
I don't think you have pointed out a practical difference between "CA" and "CP" systems. You have pointed out the tautological difference between systems with a single point of failure and those without a single point of failure. You can label the first set of systems as "CA" and the second as "CP" if you like but I don't think this distinction is meaningful in the context of the CAP theorem and so I think you are applying these labels in an inappropriate and misleading manner.
I still don't think you have answered the question I have asked a couple of times: "What does it mean to be Partition tolerant?" You do say "A CP system can tolerate any combination of partition other than total network outage" but what do you mean by "tolerate" here? Remain Consistent? Remain Available? Or something else?
If you mean "remain Consistent" - which is the only logical meaning in the context as far as I can tell - then I say that you can call out two other classes of systems -- one which we can call "AP" (those that remain Available while giving up Consistency) and the other which I will call "mu" (which are neither C nor A in the face of P). Note that by this definition, RDBMS generally fall into the "CP" category -- its Consistency tolerates the Partitioning even though it loses Availability (and its availability drops to zero).
Your example of Oracle RAC losing its SAN does not support your point. The RAC nodes have not failed -- they just can't fulfill requests because they cannot reach the SAN. This is exactly the loss of A in the face of P. The whole system happens to be 100% unavailable as well, but that is not relevant to CAP.
The examples that Gilbert and Lynch provide of "CA" and "CP" systems are actually not part of the proof in any way. As I said above, they are actually confused in these informal classifications in a way that has no bearing on the formal proof, as is Brewer in his slides. And, as I said above, this is a big part of the problem in the discussion of CAP - unintentionally slipping back and forth between the formal and non-formal definitions to suit the argument.
In his post, Coda Hale is not in fact saying what you claim he is saying. You claim he is saying that "CA" systems are not distributed systems. But Coda is in fact saying exactly what I am saying - that, excluding (non-distributed) single node systems so-called "CA" systems, just like so-called "CP" systems, are in fact not both Consistent and Available in the face of Partitions. Why is he sayn this? Because that is exactly what the CAP Theorem says.
As far as I can tell, nobody is saying that distributed DBMS are not distributed systems, at least not in any of the posts I have read. Many people may claim that anything with a single point of failure is not a *good* distributed system but that is a whole different argument.
I think we're more or less done on this particular argument, as I don't think we're making progress. To summarize my current views with regards to your response:
1. I've tried my best to answer "what does it mean to be partition tolerant" in terms of availability or consistency, several times, quite explicitly, over the past two blog posts. If you don't see the point by now, then either I am dreadfully wrong, or I am not capable of explaining it to you.
2. The complete outage scenario is precisely the consequence of a CA system. It's not a loss of "A" in the sense of the Gilbert & Lynch paper. Unfortunately, we're going in circles on this one, and it's a weakness I guess in the way CA systems are described in the Gilbert & Lynch paper.
My point all along is that thinking of an clustered database system as a CP system really misses a fundamental difference between a quorum consensus system (CP) and a shared disk cluster (which is much more CA-like).
3. I do not believe Gilbert, Lynch or Brewer are confused when they provide those CA examples, and think they're fundamental to understanding the CAP tradeoffs when contrasting distributed state management from traditional DBMS'.
4. Regarding coda hale's assertions, apparently we've read different essays. "Partition Tolerance is mandatory in distributed systems. You cannot not choose it." is equivalent to saying "a CA system is not a distributed system". Read his tweets on the matter, as well.
5. Oracle RAC is not a distributed DBMS, it's a shared-disk cluster DBMS, they're quite different constructs (in my opinion, CP vs. CA).
You seem to feel that Coda's position was an attack on RDBMS systems. It is not - in fact, it can be interpreted as a defense of RDBMS (and other Consistent systems) from the claims of the "AP" crowd, who are applying the CAP Theorem erroneously to promote their systems.
He does think, as I do, that "CA" is a misclassification. He proposes instead to focus on the notions of Harvest and Yield, which to my mind at least lie closer to the practical concerns.
WRT #4 above: I think that one line is a bit of misstatement by Coda -- in fact, I think that statement by itself is entirely wrong but not for the reason you are concerned about. Daniel Abadi's post had an example of a functional system that sacrificed both C and A.
Coda's statement should not be interpreted as saying that "CA" systems aren't real distributed systems. What he is saying is that "CA" systems are Partition tolerant, in that they are Consistent in the face of Partitions (as opposed to Available in the face of Partitions) -- this is a property they share with "CP" systems. All the so-called "CA" systems and all the so-called "CP" systems are equivalent from the perspective of CAP - they privilege Consistency over Availability in the face of Partitioning.
I wouldn't focus on that one statement overly much - the rest of the essay makes his position clear, and it is pretty close to mine and to Daniel Abadi's. I also looked through his tweets going back at least to the beginning of October. I could only find one tweet which I thought was relevant to your concern, which he takes issue with what he sees as the misclassification of a particular system, not an attack on "CA" systems in general.
I think I see where you are coming from and where we are missing each other in this conversation but I can't seem to get them across to you somehow. Nevertheless, I have gotten value out of our conversation and I hope that you have as well.
I see where you're coming from as well, I think it really comes down to whether CA is just a variant of a CP system, or if it is distinguishable as its own category, which is depends on how you read the various papers.
And yes, this was a valuable conversation.
another thing to think of is different classes of 'partitions' are easier/harder to handle. For example single node loss is fairly trivial to detect and route around. When your cluster goes from N interconnected nodes to N/2 and N/2 bisected nodes, aka "split brain syndrome" things get a lot harder.
I'm not sure there are any distributed data stores that could take a split brain mode and not suffer major degradation of some sort. And in terms of serving your queries, usually performance degradation is worse than systems outage. You end up cascading and watching a big pile of fail as your apache workers pile up and your load balancers start going wacky, etc, etc.
So this notion that you can handle, ride over and recover from any arbitrary network split is not so realistic I think. Unless you are over provisioning yourself at least 2x then you will always see performance hits in a split brain. Never mind tri-brain, quad brain and other higher orders of magnitude brains.