February 2008 Archives
"You just gave a talk where you highlighted the importance of the user interface and of domain knowledge. And you had to tell them that!", from Jim Coplien's blog post
I'm a few months late seeing this one; the recent InfoQ debate between Uncle Bob & Jim sparked this entry.
It's been a number of years since I regularly have participated in the Agile community. I used to be on Wards' Wiki quite a bit, managed to get quoted in a book, have been branded a pain in the ass by P(i)MPs , yet I have had some reasonable successes with XP and SCRUM techniques in the past.
First, a sobering observation: Having worked at BEA the past 3+ years, and seeing a large cross-section of financial, telecom, and government organizations in the U.S. and Canada, I will say this: the vast majority of development is still not Agile. (This is an obvious corollary of Online Rule #1: "The Internet is Not Reality".)
IMO, the big reason has nothing to do with the merits of agile, it has to do with the cycle and politics of annual capital budgeting combined with the process to actually use the funds once they've been budgeted. The agile-way of incremental funding, value-based pricing, and continuous improvements are about as foreign as anarchism to some business executives. It doesn't help that many CIOs I've spoken with believe that Agile is just a rebranding of James Martin's Rapid Application Development (RAD) from 1991. There are some CIOs that really do get it, but even he can't coax his bifurcated organization down that path easily.
Second, there is a part of me that is glad that many have not adopted Agile. They don't have to deal with the ridiculous religious posturing that goes on with regards to TDD, architecture, and design.
I have never bought the idea that the design would fully emerge through TDD. It leads, as Cope says, to overly procedural designs, and misses most of the benefits of object-orientation. For years I've read the debates over "when to introduce patterns" into a TDD process, with the consensus seeming to say "when you refactor". Except that some codebase mistakes (architectural ones, or core domain misunderstandings) are not refactorable -- they require overhauls that don't have the safety properties of the refactoring operations.
Again, from Cope's blog (emphases mine)...
If we're going to do testing, that's where the focus should lie. The interface is our link to the end user. Agile is about engaging that user; TDD, about engaging the code. TDD is a methodology. TDD is about tools and processes over people and interactions. Woops..... Sorting these things out requires a system view. It used to be called "systems engineering," but the New Age Agile movements consistently move away from such practices.
What's funny I see a lot of evidence that companies agree with this, even if they still get the development methods and funding models horribly wrong. The "architect" has been a rising profession, and while it is an abused term in many cases, sometimes an object of scorn, there is a desire for people that can see the whole.
From my view, TDD is a complement, not a replacement for
- Usability design & testing
- Domain knowledge
- Architectural constraints
- Stable intermediate forms & leverage at the interfaces (two classic systems heuristics)
- Design for performance
Simple framing question: Has a TDD-like emergent design process ever lead to a breakthrough product?
... is smouldering.
I live 4 blocks away from the area... waking to lots of smoke this morning. My girlfriend's bike shop, open since 1914, is gone. I eat almost weekly at Shanghai Cowgirl directly across the street, and two of my favorite alternative clubs (BSC and Savage Garden) are across the street too -- all of those seem to be OK thankfully but closed for now.
Comments, trackbacks, and OpenID have been broken, but are now fixed.
Upgrading from MT 2.6 to 4.1 has been painful -- some old setting seemed to be poisoning my configuration.
Here's a case, I think, where the scale of solutions in a corporate Intranet are different from the solutions at Internet-scale.
Say you're in an IT department, want to use RESTful web services for your SOA, but have your own canonical XML schemas for representing data in many of your business domains. How do you register those media types?
- use the plain application/xml media type and hope people will sniff the XML namespace and hope it accurately describes what's in the document (most common, not very RESTful)
- use my own media type with my own private registry (pretty common but not necessarily interoperable + consumers require a priori knowledge of where the registries are)
- use the most general media type you can for the representation and a URI as a media type parameter that points to a registry with more metadata (which could lead to some interoperability, cacheability, etc.)
- go back to using SOAP and UDDI. (....)
Obviously #3 seems to make the most sense, with caveats. I echo other commenters when I say that "application/data-format" is too general, that the metadata shouldn't just be RDDL (an HTML page may be more useful in practice!), and that the number of registries should be minimal.
Media type proliferation is a governance problem. On the Internet, the IANA is the governing body. In an Intranet, .... it depends on your governance model. What's clear is that having everyone's IT department register their own vnd media type seems both silly and untenable because those media types will not likely be general. So they'll have their own corporate&partners registry.
As for mixed vocabulary semantics, we do have a problem -- but RDF/OWL is a non-starter for most IT departments. I agree this should change some day, but, baby-steps are needed. So, what can an IT department that wants to use RESTful media types for its SOA do to indicate representation meaning *today*, without adopting the Semantic Web?
For this I imagine a registry that points to a model, whether written text, UML, ERD, or something more formal, that shows an architect or developer how the mixed elements relate to one another. In other words, use configuration management as a palliative. This does not solve the problem in general, but it arguably makes for a workable solution in a smaller scale.
So, coming back to decentralized media types, here's what I see:
- There are many that feel a need to introduce a standardized "more information on this representation" hook , beyond just the IANA media type.
- A URI likely is the best candidate format for this hook.
- Other media types are already offering this feature inside the representation body (e.g. XMLNS declarations, GRDDL declarations in HTML) ....
- ... But to work best with the deployed web, and to be most general-purpose, it seems this URI should be somewhere in the HTTP header.
- The debate is mostly matter of whether a) there is such a thing as a general purpose "more info on this media type" resource , and b) if so, where to place the link, so that it fits well with the deployed Web and doesn't necessarily cause problems for a future Semantic Web.
Well, after 4 years since a major change, I've upgraded to Moveable Type 4.1 , and re-activated both Trackbacks and Comments on entries from the past few months. Spam filters FTW.
Enjoy the facelift (I always liked the Tokyo skyline). More will come as I explore new templates & plugins...
I continue to think the trend towards treating the RDBMS as a dumb indexed filesystem is rather ridiculous. So, here's a rant, coming from an old Data Warehousing guy with an Oracle Certified Professional past, who also happens to be a web developer, distributed systems guy, etc.
Witness the blogosphere reaction to DeWitt and Stonebraker's recent critique of MapReduce. I thought Stonebraker's critique was spot on. Apparently I'm the only person in my Bloglines list that thought so.
A major complaint is that people seem to think Stonebraker missed the point that MapReduce is not a DBMS, so why critique like it were one? But this seemed obvious: there is a clear trend that certain developers, architects, and influential techies are advocating that the DBMS should be seen as a dumb bit bucket, and that the state-of-the-art is moving back to programmatic APIs to manipulate data, in an effort to gain scalability and partition-tolerance. Map Reduce is seen as a sign of the times to come. These are the "true believers" in shared nothing architecture. This is Stonebraker's (perhaps overstated) "step backwards".
My cynical side thinks this is the echo chamber effect -- it grows in developer circles, through blogs, conferences, mailing-lists, etc., self-reinforcing a misconception about the quality of what an RDBMS gives you. From what I've seen on the blogosphere, most web developers, even the really smart ones, have a complete lack of experience in understanding a) the relational model, and b) working with a modern RDBMS like Oracle 10g, MS SQL 2005, or DB2 UDB. And even practitioners in enterprises have a disconnect here (though I find it's not as pronounced). There clearly are _huge_ cultural and knowledge divides between developers, operating DBAs, and true database experts in my experience. It doesn't have to be this way, but it's a sign of our knowledge society leading to ever-more-specialized professions.
Now, to qualify my point, I completely understand that one has to make do with what one has, and come up with workable solutions. So, yes, de-normalize your data if your database doesn't have materialized views. Disable your integrity constraints if you're just reading a bunch of data for a web page. But, please let's remember:
- massively parallel data processing over hundreds or sometimes 1000+ nodes really _has_ been done since the 1980's, and has not required programmatic access (like MapReduce) for a long, long time -- it can be done with a SQL query.
- denormalization is appropriate for read-mostly web applications or decisions support systems. many OLTP applications have a mixed read/write profile. and data integration in a warehouse benefits from normalization (even if the queries do not)
- modern databases allow you to denormalize for performance while retaining a normalized structure for updates: it's called a materialized view.
- many analysts require very complicated, unpredictable, exploratory queries that are generated at runtime by OLAP tools, not developers.
- consistency is extremely important in many data sets. It may not require it for all cases. There definitely is a clear case to relax this in some cases to eventual consistency, expiry-based leasing & caching, and compensations. But, generating the aggregate numbers for my quarterly SEC filings, even if it involves scanning *billions* of rows, requires at least snapshot consistency across all of those rows, lest you want your CFO to go to jail.
- data quality is extremely important in many domains. Poor data quality is a huge source of customer dissatisfaction. Disabling integrity constraints, relaxing normalization for update-prone data, disallowing triggers & stored procs, etc. will contribute to the degrading of quality.
- Teradata has been doing massively parallel querying for almost 25 years (1024 nodes in 1983, the first terabyte DBMS in 1992 with Walmart, many hundreds of terabytes with others now!).
- Oracle's Parallel Server (OPS) has been out for almost 17 years. Real Application Clusters is OPS with networked cache coherency, and is going to be 7 years old this year.
- Take a look at this 2005 report of the top Data Warehouses. This is a voluntary survey; there are much larger systems out there. You'll notice that Yahoo! was running a single node 100 terabyte SMP warehouse. Amazon.com is running a couple of Linux-based Oracle RAC warehouses in the 15-25 terabyte range since 2004.
The point is that there is no magic here. Web developers at Amazon, eBay, Youtube, Google, SixApart, Del.icio.us, etc. are doing what works for them *today*, in their domain. There is no evidence that their solutions will be a general purpose hammer for the world's future scalable data management challenges. There's a lot more work and research to be done to get there, and I don't think it's going to primarily come out of the open source community the way it did for the Web. Sorry.
Look, I think products such as MySQL + InnoDB, are fantastic and even somewhat innovative. They give IBM, MS, and Oracle a big run for their money for many applications.
On the other hand, *no* open source RDBMS that I'm aware of has a general purpose built-in parallel query engine. Or a high-speed parallel data loader. But, if it isn't open source, it doesn't seem to exist to some people. I can understand why ($$ + freedom), though I think usage-based data grids will greatly reduce the first part of that challenge.
It's been 3 years since I discussed (here too) Adam Bosworth's "there are no good databases" blog entry. I felt that many of the problems he expressed have to do with the industry's vociferous ignorance, but I did agree there was room for innovation. The trends towards Column-Oriented DBMS seems to be playing as expected, encouraging innovation at the physical layer. I still haven't seen a good unification of querying vs. searching in general databases yet -- they still feel like independent islands. But, if anything, the vociferous ignorance has gotten worse, and that's a shame.
So, what's the trend?
- Much of the limitations of RDBMS' have nothing to do with the relational model, but have to do with an antiquated physical storage format. There are alternatives that are fast emerging. Take a look at the latest TPC-H benchmarks. Between ParAccel and EXASOL, not to mention Stonebraker's Vertica, there's a revolution underway.
- I do think parallel data processing will graduate out of its proprietary roots and become open source commoditized. But this is going to take a lot longer than people think, and will be dominated by commercial implementations for several more years, unless someone decides to donate their work (hint).
- I think the trend will be towards homegrown, programmatic data access and integrity solutions over the coming years, as a new generation re-learns data management and makes the same mistakes our parents made in the 1960's and 70's, and our OODBMS colleagues made in the 1990's. Whether this is maintainable or sustainable depends on who implemented it.
- I think the Semantic Web may actually turn out to be the renaissance of the RDBMS, and a partial way out of this mess. RDF is relational, very flexible, very partitionable across a column-oriented DBMS on grid, solves many of the agility problems with traditional schema and constraints, and simplifies some aspects of data integration. The obstacles will be: making it simpler for everyday use (eliminating the need for a degree in formal logic), and finding organizations who will make the leap.