There has been some discussion in recent days regarding Object-Relational Mappers (ORMs), Drupal, and why the latter doesn't use the former. There are, actually, many reasons for that, and for why Drupal doesn't do more with the Active Record pattern.

Rather than tuck such discussion away in an issue queue, I figured it better to document a bit more widely.

On ORMs in Drupal

I highly recommend that people read The Vietnam of Computer Science. It does an extremely good job of laying out why OO<->SQL mapping is a fundamentally intractable problem. We can do some, in some cases, but it very very quickly becomes very very ugly. The only viable solution is to rely on SQL as the primary system and treat the object structure as a secondary add-on, or to treat the object structure as the primary system and the data persistence system as an unstructured data store.

Because of its stateless nature, PHP doesn't lend itself well to the latter method. Anything done in-memory needs to be completely reinitialized on every page request. That means you need to move the entire data storage and searching (never forget searching/querying/slicing/dicing!) into a separate persistent storage engine akin to an SQL database. Such systems do exist, but have three important problems where our usage is concerned:

They're still relatively immature and have not stabilized, and most have not proven themselves capable of scaling very large and very small. They are not at all standardized between vendors and implementations, meaning we would need to pick one and hope that it would be the winning strategy later on. They are not universally available, and universal availability is a key goal for Drupal.

In contrast, it is difficult to find a web host that doesn't offer MySQL. PostgreSQL is available on many, and anyone running their own server can install it. SQLite requires effort to not have available in PHP 5.2. While there are big differences between different SQL implementations, they are far far more standardized than object databases are, especially if you include the new wave of "naive" databases. The available tools to work with them (both PHP-based tools like PDO and user-space tools like desktop GUIs) are very mature and robust and widely available. SQL is also a very widely available skill set. You can pick it up anywhere (to varying degrees of competency, of course), and any CS program in the world worth its salt includes at least one class in some SQL dialect.

So for now and for the foreseeable future, favoring relational data over object data for PHP, and Drupal in particular, seems the better strategy than gutting and rewriting Drupal entirely for CouchDB, Bigtable, Oracle-specific object stores, or whatever else.

Given that a relational database is still the most sensible primary storage system for Drupal to use, we should make it as easy as possible to leverage, but no easier lest we lose power in the process. That fits with the Rasmus Lerdorf-coined term for PHP's architecture of "Leaky Abstraction". We don't want to prevent people from accessing the data store directly, which means giving them SQL access so they can do things we didn't think of when creating the API. That precludes creating our own heavy ORM, as that's yet another Drupal specific API people would need to learn.

While yes, it may be possible to create a Drupal-specific data access system that is easier to learn than SQL, there would be about 10 people who understood it and a few hundred who knew how to use it. Compare that to the hundreds of thousands of people who already know SQL well enough to hit the ground running in Drupal. I'd like to use more conventional and standard approaches, patterns, and techniques in Drupal, not fewer.

There's also the challenge of creating such an abstraction in the first place. Currently, the trend in Drupal is toward a unified API and structure for primary data entities (nodes, users, etc.) that forces data into a very specific structure, fields. Think "CCK everywhere". Now look at the hundreds of hours that have gone into CCK and Fields. Now look at how messy the code for CCK is in Drupal 6. Now look at how the API breaks in so many places already when you do unconventional things. Now look at how there are already massive performance issues that haven't been resolved yet, except with the cop-out of "uh, use caching". And the people working on this problem are really really smart people. It's really hard to do well, and that's just for a small subset of the larger ORM problem. Multiply all of those issues by 10 and you approach the challenge that is making Drupal all-ORM without someone else solving the problem for us already.

We did briefly consider Doctrine early in the DBTNG development life cycle. (More specifically, some of the Doctrine folks tried for a few weeks to sell us on just using Doctrine.) We decided against it on the grounds that Doctrine was way too big for us to include (I think someone said the code base was larger than core, but I'm not sure if that's true), it was yet another custom string-based pseudo-language that people would have to learn (whereas basic SQL is a widespread skill), and it didn't give us the flexibility to deal with non-primary-data entities whereas SQL allows all kinds of unexpected flexibility. DBTNG, by contrast, is very thin, flexible, and for someone who already knows SQL is almost trivial to pick up even with the query builders.

Every abstraction layer adds a performance cost, too. "Just add more abstraction" is not always a viable answer, as there are costs to any abstraction.

The DBTNG approach

When I started working on Drupal 7's new database layer, back before it was called "DBTNG", my goal was to make writing secure, portable SQL easier, not easy :-), and provide a lot of new features for SQL writers. Although SQL is far more standardized than object databases are, it's still an iffy standard at best. Imagine trying to write a definitive grammar parser for English, in all its variants: British, Scottish, New Zealander, Outback Australian, plus the dozen or so distinct North American versions. (Really, can anyone understand what someone from Brooklyn is saying? :-) ) That's what writing a truly vendor-agnostic SQL abstraction layer is like, except that you can't rely on the human brain's ability to guess what is meant by context. SQL servers don't do fuzzy parsing. Doing that at runtime in interpreted code like PHP is a recipe for performance disaster.

PDO takes care of a lot of the abstraction for us, by providing a layer, in compiled C code, where differences in prepared statements are normalized and abstracted. Of course, even that is imperfect at times as we see from the need to provide alternate handling for BLOB and LOB data types in PostgreSQL and Oracle, or provide our own user-space buffering for SQLite.

The only viable solution that does not involve runtime parsing and re-creation of SQL strings (which as noted above is slow, error-prone, and extremely hard to do) is to require developers to provide a pre-parsed data structure that can be customized for each database. That's what the query builders are for, as they describe the SQL behavior in structured terms that are then fairly easily compiled to each target database. In all modesty, I think we did a far better job there than I every expected us to on the first try. (OK there were several iterations, but this approach has been fairly stable for over a year now.) But even those introduce overhead. SelectQuery-based queries are more verbose to write and slower to execute than just providing an SQL string, because they have to be built up (a dozen or more method calls) and then compiled before being executed. Sometimes that is a very good trade-off, other times not.

Of course, that means there are many use cases for which we do not have a solution. Not all queries are valid in all SQL dialects. The only ways to ensure that such queries work cross-database are:

Parse all query strings and mutate them into DB-specific SQL. As discussed above, this is simply infeasible due to both performance concerns and the shear effort involved. Some databases have concepts that simply do not exist in other databases. Require every single database query to go through a query builder so that we have a pre-parsed data structure to work with. We do this already for modifier queries, but that's pushing it. Select queries are much more common, and the performance cost there much higher. And even then, some mappings would be non-trivial. Plus, writing a built Select query is indisputably more work for the developer than just writing a literal query in the majority of cases. (Those cases where it's not? Well, use the query builder. It's a worthwhile trade-off there.) Require module developers to write portable SQL where possible. The only performance cost here is on the developer, not at runtime. Most queries are, in fact, reasonably portable. The cost to the developer here, given that PDO handles prepared statements for us, is fairly low. It's only an issue on more complex queries.

The only option that doesn't make Drupal Way Too Slow(tm) is #3: Write portable SQL where possible and submit bug reports where it hasn't been done. That's the approach that Drupal has taken, since doing more would be cost prohibitive.

If someone manages to come up with a brilliant solution that lets us do #1 without serious performance degradation, I'm all for it. However, I don't see that happening until Drupal 8, if ever. Database vendors need to get their act together and better standardize their SQL dialects.

Some problems we simply cannot solve on our own.