Put on your Thinking CAP

| 9 Comments | No TrackBacks

In light of some interesting twitter, blog, and IM discussions over yesterday's post (thanks to Chad Walters, Billy Newport, Ryan Rawson, Bill de hOra, Jeff Darcy, and others), I've been pondering three CAP points. Last post on this topic (for a bit), I promise, before I return to your regularly scheduled RESTful programming.

1. Definitions Matter. Or, are CA/CP are the same thing?

Gilbert & Lynch's paper defines the following (I'm not going to use their exact terms, so as to keep this accessible)

  • Consistency (big-C)- aka Atomicity, or a total order on operations (as visible to any user); aka. by database practitioners as serializable isolation. Makes distributed systems look like a single system.
  • Eventual Consistency (little-C)- aka Delayed-t Consistency, or a bounded amount of time before a data object is consistent, also implies a minimum latency between writes on a single piece of data (without specialized conflict resolution)
  • Weak Availability (little-A)- every request received from a non-failing node must result in a response, but this response may take forever to come (i.e. latency is not a part of this definition)
  • Strong Availability (big-A)- every request received from a non-failing node must result in a response even when network partitions occur, but this response may take forever to come (i.e. latency is not a part of this definition)
  • Strong Network Partition Tolerance (big-P)- the network will be allowed to lose arbitrarily many messages sent from one node to another (not counting total network outage). When a network is partitioned, all messages sent across the partition are lost.


I'm going to add the following:

  • Weak Network Partition Tolerance (little-P)- a system that can tolerate some network partitions, but not all combinations of them.

I add this due to my reading of their definition of Big-P and sections 3.2.1 and 3.2.2: "If there are no partitions, it is clearly possible to provide atomic, available data. In fact the centralized algorithm described in Section 3.2.1 meets these requirements. Systems that run on intranets and LANs are an example of these type of algorithms". Section 3.2.1 defines an algorithm for Consistent & Partition Tolerant systems. Hang on to this one, I'll finish my point in a moment.

So, by these definitions, you basically can have (assuming mixed reads & writes):
- AP: eventual consistency, strong availability, strong partition tolerance
- CP: strong consistency, weak availability, strong partition tolerance
- CA: strong consistency, strong availability, weak partition tolerance

Clearly an AP system is different from CP/CA. But the crux is whether CA and CP are the same kind of system. If you consider the idea of weak vs. strong partition tolerance, the difference becomes clear (to me, anyway, but I'm admittedly blonde).

A CA system might be able to handle some partitions, but not all. Simply put, it has a single point of failure, somewhere. The example I like to give is a shared-disk cluster database with a SAN. Partition the SAN, and you're hosed. Partition the nodes, or even the interconnect, and you will still be available, albeit with higher latency. Redundancy is the common way to reduce this risk.

Whereas a CP system is designed to handle all partitions, i.e. it has no single point of failure. But some non-failed nodes may not be able to service clients at all during a partition. This can suck. Anyone who's managed a synchronous log shipping setup or EMC SRDF probably knows about this one.


2. Scope matters

The definition of "node" and "network" can vary depending on your granularity and scope.

The above definitions only make sense if the scope of the system is the server-side of a client/server relationship. (i.e. How can non-failed nodes experiencing a partition still receive and respond to a request? Only if the client isn't the one being partitioned.)

A database that's normally a CA system can be considered CP if you zoom out to notice that it's in a multi-node synchronous replication ring. Or AP if it's in an asynchronous multi-node replication ring. But notice that, probabilistically, it's behaving like a CA system for 99.9% of the time (or however long your MTBF is).

An AP system on the other hand has one big drawback, one that's not spoken about often. It's about the scope of the partition: is it recoverable or not?

An unrecoverable partition in a CP or CA system is "no data loss", even if you can't get at the system. That's not true in an AP system, if there's an unrecoverable error in a certain set of nodes during the "delayed consistency" time window. This occurs during catastrophic media failures, like a bug in the SAN controller corrupting all the disks, trucks driving through a data center, or floods, or bombs, hurricanes, etc.

Even with backups, during this "replication latency", you have to re-enter those transactions, or hope that a CP or CA system somewhere has a copy.

3. Definitions are too easy to get hung up on.

One of the main reasons I'm talking about this is because I see bolded, underlined claims that can be easily misconstrued, like, "you can't choose consistency and availability together in a distributed system". This gives me The Fear.

This point of the above quote, is not to say you can't have some consistency. You just can't have fully serializable consistency and isolation.

But in practice, when you think about it, this claim is actually rather banal. Practitioners have long understood that there are a spectrum of consistency & isolation tradeoffs at scale, even at a single node. Why?

Because the definition of availability arguably needs to also include latency, which is something Daniel Abadi brought up. Serializability means locks, in practice, and that means increased latency, and thus reduced availability. This is why we don't like Two Phased-Commit very much these days.

But, for kicks, go back to the ANSI SQL isolation level debate that Jim Gray stirred up. It was because databases like Oracle (gasp) don't provide serializable consistency, and haven't since the late 1980s! They provide snapshot consistency, which is a different beast where readers don't block writers, and arguably was the primary technical reason for Oracle's success in the market place through the 1990s. Amusingly, this is the same argument that Jim Starkey keeps bringing up when discussing CAP, having invented the idea of multi-version concurrency control, when he's talking about his new baby, Nimbus DB.

So the idea that you can't have fully serializable consistency at scale -- even in a single node database -- is completely uncontroversial. It's when the followup feels like "...and therefore you need to throw out your RDBMS", that cognitive dissonance and tempests in a tweetpot stirs up.

No TrackBacks

TrackBack URL: http://www.stucharlton.com/blog/mt-tb.cgi/148

9 Comments

It depends on what 'serializable' means... for example take HBase row locks, they operate within a single JVM and use the Java synchronization primitives. These are probably some of the cheapest synchronization locks you can have, apart from having none. So having full serialization on a single row is not expensive, apart from making clients wait. Replication operates at a different level, so synchronizing across datacenters doesnt share row locks and moving to that would be expensive indeed.

Another thing to consider is all systems are 'partition tolerant'. If you dig into the formal definition of what a 'partition' is in the CAP proof, it turns out that even a node crash, node network loss, process crash, etc, are all examples of 'partitions'. This is really different than what other people think of as 'partition tolerant'. HBase is a proud CP system, which means in the (rare) case that machines go down, we suffer data availability but you will never see inconsistent data. I think for many many applications this is the right tradeoff to take. You will _always_ need application code to deal with data store unavailability even if you are a "AP" system, but dealing with inconsistent data is much harder and very application dependent.

So for example, you can use write buffers to handle data store unavailability, you can throw errors and not handle user requests, you can maintain multiple clusters and datacenters and shunt requests that fail to clusters that arent failing.

But there are multiple types of 'inconsistency'... Dynamo shows one model where you have a DHT to achieve reads when machines are down. But there is also the standard replica approach, which I think could be buildable on top of HBase. In this case you would store data at multiple Regionserver nodes and stream edits around, and when a node goes down a client could move to the secondary server. This requires a lot of work since it violates a few of the major operating principles, but it wouldn't be impossible.

Thanks for continuing the discussion.

Note that you are using "big-A, little-a" in a different way than I was using it in our Twitter conversation. When I said "little-a availability", I meant the layman's concept of availability, in contract with "big-A Availability", the CAP Theorem's formal definition of Availability.

I still think there is a fundamental confusion here about partition tolerance. What do you mean by partition tolerance as a stand-alone property? I think it is actually meaningless without qualifying it as either consistency that tolerates partitioning or availability that tolerates partitioning. Otherwise, what is it that is being tolerated?

You seem to be equating "CA" with "has a single point of failure" and "CP" with "has no single point of failure". While I agree that this relates to the intuitive notion of availability (and hence why I prefer systems with no single point of failure), I don't see this as related to big-A Availability in the formal sense in any way.

Something with a single point of failure has the possibility of being knocked entirely offline and hence 100% unavailable by one failure (of the appropriate kind). Something with no single point of failure does not have the possibility of being knocked offline and hence 100% unavailable by any single failure - but it can still be knocked offline by multiple failures (of the appropriate kind). But the CAP Theorem is not about being knocked 100% offline or about single points of failure.

What the theorem actually says is that any system that is big-C Consistent cannot maintain big-A Availability in the face of big-P Partitions. Full stop. It doesn't matter if the Consistent system has a single point of failure or not - in either case, it cannot maintain the property of having all requests to non-failing nodes succeed in the face of least-convenient arbitrary message loss between nodes.

You may regard this as overly pedantic but my point is to clear up what I regard as some of the fundamental confusions around the CAP Theorem, with the express purpose of setting CAP aside as not all that important for the purposes of real-world distributed systems engineering.

As I see it, the main confusions are these:
1. Thinking that the CAP Theorem is "C, A, P - choose 2"
2. Thinking that C, A, and P as defined in the CAP Theorem map to directly to intutive definitions of consistency, availability, and partitioning
3. Drawing the conclusion therefore that using strongly consistent subsystems within a larger service is unadvisable if you are interested in having high availability

One of the things that is most difficult about clearing up these confusions is that Brewer, Gilbert, and Lynch themselves actually demonstrate at least both confusions 1 and 2 in their writings on the subject. Note that I have the highest regard for Eric Brewer - but that doesn't exempt him from the all-too human ability to occasionally mistate things especially when not working with formalized notions (God knows I do it oh so often).

Here I am very much in agreement with Daniel Abadi's comments in his blog post on CAP: - we both say:
1. CAP is "C vs A in the face of P"
2. The CA/CP distinction doesn't make any sense
3. C vs A is not the most interesting of tradeoffs

I haven't given enough thought to Abadi's PACELC concept to say whether I think it is definitive -- I think it is probably still missing some factors of import - but it is closer in spirit to what I think is important to real-world distributed systems than CAP.

I think we may be talking past eachother a bit at this point but I'm still willing to try this out a bit longer.

I don't think you have pointed out a practical difference between "CA" and "CP" systems. You have pointed out the tautological difference between systems with a single point of failure and those without a single point of failure. You can label the first set of systems as "CA" and the second as "CP" if you like but I don't think this distinction is meaningful in the context of the CAP theorem and so I think you are applying these labels in an inappropriate and misleading manner.

I still don't think you have answered the question I have asked a couple of times: "What does it mean to be Partition tolerant?" You do say "A CP system can tolerate any combination of partition other than total network outage" but what do you mean by "tolerate" here? Remain Consistent? Remain Available? Or something else?

If you mean "remain Consistent" - which is the only logical meaning in the context as far as I can tell - then I say that you can call out two other classes of systems -- one which we can call "AP" (those that remain Available while giving up Consistency) and the other which I will call "mu" (which are neither C nor A in the face of P). Note that by this definition, RDBMS generally fall into the "CP" category -- its Consistency tolerates the Partitioning even though it loses Availability (and its availability drops to zero).

Your example of Oracle RAC losing its SAN does not support your point. The RAC nodes have not failed -- they just can't fulfill requests because they cannot reach the SAN. This is exactly the loss of A in the face of P. The whole system happens to be 100% unavailable as well, but that is not relevant to CAP.

The examples that Gilbert and Lynch provide of "CA" and "CP" systems are actually not part of the proof in any way. As I said above, they are actually confused in these informal classifications in a way that has no bearing on the formal proof, as is Brewer in his slides. And, as I said above, this is a big part of the problem in the discussion of CAP - unintentionally slipping back and forth between the formal and non-formal definitions to suit the argument.

In his post, Coda Hale is not in fact saying what you claim he is saying. You claim he is saying that "CA" systems are not distributed systems. But Coda is in fact saying exactly what I am saying - that, excluding (non-distributed) single node systems so-called "CA" systems, just like so-called "CP" systems, are in fact not both Consistent and Available in the face of Partitions. Why is he sayn this? Because that is exactly what the CAP Theorem says.

As far as I can tell, nobody is saying that distributed DBMS are not distributed systems, at least not in any of the posts I have read. Many people may claim that anything with a single point of failure is not a *good* distributed system but that is a whole different argument.

You seem to feel that Coda's position was an attack on RDBMS systems. It is not - in fact, it can be interpreted as a defense of RDBMS (and other Consistent systems) from the claims of the "AP" crowd, who are applying the CAP Theorem erroneously to promote their systems.

He does think, as I do, that "CA" is a misclassification. He proposes instead to focus on the notions of Harvest and Yield, which to my mind at least lie closer to the practical concerns.

WRT #4 above: I think that one line is a bit of misstatement by Coda -- in fact, I think that statement by itself is entirely wrong but not for the reason you are concerned about. Daniel Abadi's post had an example of a functional system that sacrificed both C and A.

Coda's statement should not be interpreted as saying that "CA" systems aren't real distributed systems. What he is saying is that "CA" systems are Partition tolerant, in that they are Consistent in the face of Partitions (as opposed to Available in the face of Partitions) -- this is a property they share with "CP" systems. All the so-called "CA" systems and all the so-called "CP" systems are equivalent from the perspective of CAP - they privilege Consistency over Availability in the face of Partitioning.

I wouldn't focus on that one statement overly much - the rest of the essay makes his position clear, and it is pretty close to mine and to Daniel Abadi's. I also looked through his tweets going back at least to the beginning of October. I could only find one tweet which I thought was relevant to your concern, which he takes issue with what he sees as the misclassification of a particular system, not an attack on "CA" systems in general.

I think I see where you are coming from and where we are missing each other in this conversation but I can't seem to get them across to you somehow. Nevertheless, I have gotten value out of our conversation and I hope that you have as well.

another thing to think of is different classes of 'partitions' are easier/harder to handle. For example single node loss is fairly trivial to detect and route around. When your cluster goes from N interconnected nodes to N/2 and N/2 bisected nodes, aka "split brain syndrome" things get a lot harder.

I'm not sure there are any distributed data stores that could take a split brain mode and not suffer major degradation of some sort. And in terms of serving your queries, usually performance degradation is worse than systems outage. You end up cascading and watching a big pile of fail as your apache workers pile up and your load balancers start going wacky, etc, etc.

So this notion that you can handle, ride over and recover from any arbitrary network split is not so realistic I think. Unless you are over provisioning yourself at least 2x then you will always see performance hits in a split brain. Never mind tri-brain, quad brain and other higher orders of magnitude brains.

Leave a comment

About this Entry

This page contains a single entry by Stu published on October 28, 2010 9:35 AM.

Confused CAP Arguments was the previous entry in this blog.

Time for REST is over is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

About Me
(C) 2003-2011 Stuart Charlton

Blogroll on Bloglines

Disclaimer: All opinions expressed in this blog are my own, and are not necessarily shared by my employer or any other organization I am affiliated with.