October 2009 Archives

The Trouble with NoSQL

| No Comments | 1 TrackBack

I have an ambivalent feeling towards this NoSQL trend, on a few levels.

a) "RDBMS don't perform or scale".

I've seen this in presentations, blog posts, and even in the Hadoop O'Reilly book. I'm not sure if this is sloppiness, ignorance or plain dishonesty. Anyone paying attention, it's pretty clear that RDBMS do perform and scale: there are several 1+ petabyte Teradata implementations, Oracle RAC is used heavily at Amazon.com (70 TB) and Yahoo! (250 TB), for example. Of course, this is about scalability in terms of data volume, and huge queries. On the OLTP side, the TPC benchmark continues to show Oracle and DB2 are able to pull out staggering numbers both in classic SMP and in clustered configurations (yes, DB2 can do shared nothing).

This is not to say RDBMS are the solution to all data persistence problems. I'm an old object database guy and there were (and are) many reasons why one would use that (or one of the newer scalable key/value stores like Cassandra). But, please present the technology on its merits, not based on completely misleading claims.

One almost gets the impression that "If it's not open source, it doesn't exist", which is absurd considering the billions Oracle, IBM, Microsoft and Sybase continue to rake in.

b) "In the CAP tradeoffs, availability > consistency, almost always"

Except when you're running financial analyses. Regulators don't like "eventually consistent" accounting statements. Even when you have terabytes of them to go through.

To me, the best approach would be to provide developers and data architects a knob to adjust the level of consistency vs. availability vs. partition-tolerance depending on the circumstance (the query, the data, etc.)

c) What happened to "Data Management"?

I'd be willing to sacrifice the "R" in RDBMS for certain reasons, but I'm less interested in sacrificing the "MS" part, i.e. "Management System".

There's an eternal battle between those that want the data intertwined with the code and those that want the data separate from the code. I grew up thinking the former, and learned to appreciate the latter.

Every generation of programmers seem to go through this phase where the next-gen persistence engine becomes all the rage. From CODASYL to ODBMS to XML databases to Object caches, and now key/value stores or "cloud databases".

Managing data, scale, and partitioning in the application is a workaround, not a very pleasing solution. I understand people have to get their jobs done, but enterprises seem to have different data management requirements than young companies. Most cases the data exists to support business operations or business decisions. Quality is paramount, and poor management leads to data duplication and mistakes when other applications need to access that data. One tends not to notice these problems early in the life of an application, it tends to be something that occurs across applications that integrate with one another over time.

Similarly, "schema-less" data persistence is only beneficial early in development stages. Later on it becomes pretty useful, and over time, it's almost essential if you want to reuse or repurpose that data and be able to interpret it consistently without having to crack open the supporting codebase.

And strong DBAs have a unique perspective on data & performance management, one I've found lacking in many a programmer (with the ones at Google being a notable exception. They truly seem to have instilled the advanced DBA's sensibility into its engineering work).

d) Are you really sure that SQL is the problem?

I can agree that many of the cheaper (or free) RDBMS don't scale well. But why do people think SQL is the reason that they don't scale? It seems like conflating logical with physical issues. The traditional SQL RDBMS model may not be the only way to do logical data management, but relying on programmatic solutions and ad hoc query languages certainly isn't very satisfying, it seems all very 1970's. Throwing out logical data design & management implies horrible long-term consequences on data quality and correct modelling of a business domain.

On this note, there's a new paper contrasting Hadoop with parallel databases for large-scale data analysis tasks (written in part by the Vertica guys - Mike Stonebraker's new company). The conclusions are interesting -- Hadoop isn't the clear performance leader, but it certainly wins major points for being simpler to get going than your traditional DBMS. On the other hand, specifying SQL statements looks quite compelling vs. writing a bunch of map and reduce functions. And the results show these SQL databases certainly perform well on query (assuming you can load them fast enough). Another paper, by some of the same authors, looks at combining Hadoop with a parallel DBMS (probably Vertica) with encouraging results.

Having lived through the object database wars, and watch my beloved databases get trampled into a niche, my sense is this:

- The IT world didn't reboot when clouds came out, there's a lot of assumptions worth challenging, but lots to learn from history.

- There's a real chance most NoSQL solutions will remain niches while RDBMS continue to dominate, because customers will force their vendors to scale them out. The real question is whether this is an impossibility due to the rational model or SQL. I'd say that's highly doubtful. They'll find a way, if their customers pressure them.

- On the other hand, there's a real chance for a NoSQL alternative to make it big and succeed IF it evolves to be a true DBMS, not just a persistence engine, provides adjustable CAP tradeoffs in its interface, and offers us a worthy successor to SQL.

I've uploaded my position paper for the OOPSLA 2009 Cloud Design Workshop next week. This provides a detailed technical overview of what Elastra has been working on for the past year.

Cloud Computing has been a catalyst that has been accelerating a long-needed convergence between IT Operations and Application Architecture. We need to build systems to be operated, managed, and governed -- not as an afterthought. And we need better collaboration between IT specialists. Through a mix of web architecture, and a dose of autonomic computing, and we may have the beginnings of a new inter-cloud architecture. It feels like the end of a marathon, but we've only reached the first checkpoint.

About this Archive

This page is an archive of entries from October 2009 listed from newest to oldest.

June 2009 is the previous archive.

February 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.

About Me
(C) 2003-2011 Stuart Charlton

Blogroll on Bloglines

Disclaimer: All opinions expressed in this blog are my own, and are not necessarily shared by my employer or any other organization I am affiliated with.