October 2010 Archives

Put on your Thinking CAP

| 9 Comments | No TrackBacks

In light of some interesting twitter, blog, and IM discussions over yesterday's post (thanks to Chad Walters, Billy Newport, Ryan Rawson, Bill de hOra, Jeff Darcy, and others), I've been pondering three CAP points. Last post on this topic (for a bit), I promise, before I return to your regularly scheduled RESTful programming.

1. Definitions Matter. Or, are CA/CP are the same thing?

Gilbert & Lynch's paper defines the following (I'm not going to use their exact terms, so as to keep this accessible)

  • Consistency (big-C)- aka Atomicity, or a total order on operations (as visible to any user); aka. by database practitioners as serializable isolation. Makes distributed systems look like a single system.
  • Eventual Consistency (little-C)- aka Delayed-t Consistency, or a bounded amount of time before a data object is consistent, also implies a minimum latency between writes on a single piece of data (without specialized conflict resolution)
  • Weak Availability (little-A)- every request received from a non-failing node must result in a response, but this response may take forever to come (i.e. latency is not a part of this definition)
  • Strong Availability (big-A)- every request received from a non-failing node must result in a response even when network partitions occur, but this response may take forever to come (i.e. latency is not a part of this definition)
  • Strong Network Partition Tolerance (big-P)- the network will be allowed to lose arbitrarily many messages sent from one node to another (not counting total network outage). When a network is partitioned, all messages sent across the partition are lost.

I'm going to add the following:

  • Weak Network Partition Tolerance (little-P)- a system that can tolerate some network partitions, but not all combinations of them.

I add this due to my reading of their definition of Big-P and sections 3.2.1 and 3.2.2: "If there are no partitions, it is clearly possible to provide atomic, available data. In fact the centralized algorithm described in Section 3.2.1 meets these requirements. Systems that run on intranets and LANs are an example of these type of algorithms". Section 3.2.1 defines an algorithm for Consistent & Partition Tolerant systems. Hang on to this one, I'll finish my point in a moment.

So, by these definitions, you basically can have (assuming mixed reads & writes):
- AP: eventual consistency, strong availability, strong partition tolerance
- CP: strong consistency, weak availability, strong partition tolerance
- CA: strong consistency, strong availability, weak partition tolerance

Clearly an AP system is different from CP/CA. But the crux is whether CA and CP are the same kind of system. If you consider the idea of weak vs. strong partition tolerance, the difference becomes clear (to me, anyway, but I'm admittedly blonde).

A CA system might be able to handle some partitions, but not all. Simply put, it has a single point of failure, somewhere. The example I like to give is a shared-disk cluster database with a SAN. Partition the SAN, and you're hosed. Partition the nodes, or even the interconnect, and you will still be available, albeit with higher latency. Redundancy is the common way to reduce this risk.

Whereas a CP system is designed to handle all partitions, i.e. it has no single point of failure. But some non-failed nodes may not be able to service clients at all during a partition. This can suck. Anyone who's managed a synchronous log shipping setup or EMC SRDF probably knows about this one.

2. Scope matters

The definition of "node" and "network" can vary depending on your granularity and scope.

The above definitions only make sense if the scope of the system is the server-side of a client/server relationship. (i.e. How can non-failed nodes experiencing a partition still receive and respond to a request? Only if the client isn't the one being partitioned.)

A database that's normally a CA system can be considered CP if you zoom out to notice that it's in a multi-node synchronous replication ring. Or AP if it's in an asynchronous multi-node replication ring. But notice that, probabilistically, it's behaving like a CA system for 99.9% of the time (or however long your MTBF is).

An AP system on the other hand has one big drawback, one that's not spoken about often. It's about the scope of the partition: is it recoverable or not?

An unrecoverable partition in a CP or CA system is "no data loss", even if you can't get at the system. That's not true in an AP system, if there's an unrecoverable error in a certain set of nodes during the "delayed consistency" time window. This occurs during catastrophic media failures, like a bug in the SAN controller corrupting all the disks, trucks driving through a data center, or floods, or bombs, hurricanes, etc.

Even with backups, during this "replication latency", you have to re-enter those transactions, or hope that a CP or CA system somewhere has a copy.

3. Definitions are too easy to get hung up on.

One of the main reasons I'm talking about this is because I see bolded, underlined claims that can be easily misconstrued, like, "you can't choose consistency and availability together in a distributed system". This gives me The Fear.

This point of the above quote, is not to say you can't have some consistency. You just can't have fully serializable consistency and isolation.

But in practice, when you think about it, this claim is actually rather banal. Practitioners have long understood that there are a spectrum of consistency & isolation tradeoffs at scale, even at a single node. Why?

Because the definition of availability arguably needs to also include latency, which is something Daniel Abadi brought up. Serializability means locks, in practice, and that means increased latency, and thus reduced availability. This is why we don't like Two Phased-Commit very much these days.

But, for kicks, go back to the ANSI SQL isolation level debate that Jim Gray stirred up. It was because databases like Oracle (gasp) don't provide serializable consistency, and haven't since the late 1980s! They provide snapshot consistency, which is a different beast where readers don't block writers, and arguably was the primary technical reason for Oracle's success in the market place through the 1990s. Amusingly, this is the same argument that Jim Starkey keeps bringing up when discussing CAP, having invented the idea of multi-version concurrency control, when he's talking about his new baby, Nimbus DB.

So the idea that you can't have fully serializable consistency at scale -- even in a single node database -- is completely uncontroversial. It's when the followup feels like "...and therefore you need to throw out your RDBMS", that cognitive dissonance and tempests in a tweetpot stirs up.

Confused CAP Arguments

| 6 Comments | No TrackBacks

A bit of a Twittergasm lately over the CAP Theorem, with relevant blog posts

I am confused. I've narrowed it down to 7 matters.

Confusion #1: Changing system scope from underneath our feet
The biggest confusion I have is that the scope of "distributed system" under the CAP theorem seems to have changed, silently.

If you look at the original papers:

All of them seem to scope the CAP theorem to the server side of a client/server relationship. That is, it's assumed that the distributed system under consideration is a client/server architecture, but the server-side is distributed in some way. This is a reasonable assumption, since the Web is a client/server architecture variation, and CAP was originally about the tradeoffs in huge scale web sites.

But practically speaking, this is particularly important when considering Partition Tolerance. In the original papers, CA (Consistent, Available) was absolutely common, and in fact, the most common scenario, with Brewer giving examples like cluster databases (on a LAN), single-site databases, LDAP, etc. The most popular commercial cluster databases today arguably retain this definition of CA, such as Oracle's RAC.

But now, AP (availability, partition tolerance) advocates are questioning if CA distributed systems are even possible, or merely just something that insane people do. It's a very strange, extreme view, and I don't think it offers any enlightenment to this debate. It would be good to clarify or at least offer two scopes of CAP, one within a server scope, and another in a system-wide scope.

Confusion #2: What is Partition Intolerance?

This is never well defined. Like lactose intolerance, the results, can be, err, unpredictable. It seems to imply one of: complete system unavailability, data inconsistency, or both. It's also unclear if this is recoverable or not. Generally it seems to depend on the implementation of the specific system.

But because of this confusion, it's easy to let one's mind wander and say that a CA system "acts like" an AP or CP system when experiencing a partition. That leads to more confusion as to whether a CA system really was just a CP system in drag all along.

Confusion #3: Referring to Failures Uniformly

It seems to be a simplifying assumption to equate failures with network partitions, but the challenge is that there are many kinds of failures: node failures, network failures, media failures, and each are handled differently in practice.

Secondly, there are systems with many different kinds of nodes, where they tolerate partitions in one kind of node, but not another kind, or just maybe just not a "full" partition" across two distinct sets of nodes. Surely this is an interesting property, worthy of study?

Taking Oracle RAC or IBM DB2 pureScale, as an example. It actually tolerates partitions of many sorts - cluster interconnect failures, node failures, media failures, etc. What it doesn't tolerate is a total network partition of the SAN. That's a very specific kind of partition. Even given such a partition, the failure is generally predictable in that it still prefers consistency over availability.

Confusion #4: Rejecting Weak vs. Strong Properties

Coda hale sez: "You cannot, however, choose both consistency and availability in a distributed system [that also tolerates partitions]".

Yet this seems very misleading. Brewer's original papers and Gilbert/Lynch talk about Strong vs. Weak consistency and availability. The above statement makes it sound as if there is no such thing.

Furthermore, most distributed systems are layered or modular. One part of that system may be CA-focused, the other may be AP-focused. How does this translate to a global system?

Confusion #5: Conflating Reads & Writes

Consider the Web Architecture. The Web is arguably an AP system for reads, due to caching proxies, but is a CP system for writes.

Is it useful to categorize a system flatly as binary one or the other?

Confusion #6: Whether AP systems can be a generic or the "default" database for people to choose

AP systems to my understanding are application-specific, as weak consistency is difficult to correct without deep application knowledge or business involvement.

The question and debate is, what's the default database for most IT systems... should it be a traditional RDBMS that's still a CA system, like MySQL or Oracle? Or should it be an AP system like a hacked-MySQL setup with Memcached (as AP), or Cassandra?

When I see claims that CA advocates as "not getting it", or "not in their right mind", I have a severe case of cognitive dissonance. Most IT systems don't require the scale of an AP system. There are plenty of cases of rather scalable, highly available CA systems, if one looks at the TPC benchmarks. They're not Google or Amazon scale, and I completely agree that an AP system is necessary under those circumstances. I even agree that consistency needs to be sacrificed, especially as you move to multi-site continuous availability. But it's very debatable that the basic database itself should be CA by default or AP by default.

For example, many IT shops have built their mission-critical applications in a CA manner within a single LAN/rack/data-centre, and an AP manner across racks or data-centres via asynchronous log shipping. Some can't tolerate any data loss and maintain a CP system with synchronous log shipping (using things like WAN accelerators to keep latency down). And not just for RDBMS - this is generally how I've seen Elastic Data Cache systems (Gigaspaces or Coherence or IBM WebSphere ExtremeScale) deployed. They deal with the lack of partition tolerance at a LAN level through redundancy.

It's useful to critique the CA single-site / AP multi-site architecture, since it's so common, but it's still not clear this should be replaced with AP-everywhere as a useful default.

Confusion #7: Why people are being a mixture of dismissive and douchey in their arguments with Stonebraker

Stonebraker's arguments make a lot of sense to IT practitioners (I count myself as one). He speaks their language. Yes, he can be condescending, but he's a rich entrepreneur - they get that way. Yes, he gets distracted and talks about somewhat off-topic matters. Neither of those things implies he's wrong overall. Neither of these things imply that you're going to get your point across more by being dismissive or douchey, however good it feels. (I'm not referring to anyone specifically here, just a general tone of twitter and some of the snark in the blog posts. )

One particular complaint I find rather naive - that he's characterizing CAP theorem (more AP system) proponents as only using CAP theorem to evaluate distributed system availability. Clearly that's not true, but on the other hand, it is kind of true.

Here's the problem: when I hear venture capitalists discussing the CAP theorem over cocktails because of their NoSQL investments, or young newly-graduated devs bugging their IT managers to use Cassandra, you can bet that plenty of people are making decisions based on a twisted version of the new fad, rather than reality. But that's how every technology fad plays out. I think Stonebraker has seen his share of them, so he's coming from that angle, from my view. Or he could just be an out of touch dinosaur like many seem fond of snarking. I personally doubt it.

Yeah, I've been kind of silent lately. Hopefully that will change now that I'm out of my cave.

I'm no longer involved full-time with Elastra; I do remain a part time advisor. One day the story will be told. Until then, I highly recommend reading Steve Blank's blog & books for an apropos understanding of how startups often struggle and remain stubborn in the face of said struggle. On the bright side, the current team has their head screwed on straight and is still doing great work.

What am I up to? Consulting, for now at least! Jetstream Cirrus is my company, and yes, there are intentionally many ways of reading into that name. I'm offering independent advice & implementation on integration architecture, REST, agile operations, scalable systems, and cloud computing consulting in the U.S. and Canada.

About this Archive

This page is an archive of entries from October 2010 listed from newest to oldest.

May 2010 is the previous archive.

November 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.

About Me
(C) 2003-2011 Stuart Charlton

Blogroll on Bloglines

Disclaimer: All opinions expressed in this blog are my own, and are not necessarily shared by my employer or any other organization I am affiliated with.