By Kore Nordmann, first published at Tue, 28 Aug 2007 10:16:10 +0200

Why active record sucks

It is not really Active Record (AR) which sucks but the implied, perhaps just misinterpreted, common usage as an ORM (object relational mapping). To summarize the following blog post in one sentence, so that you may skip reading it:

Active Record may be used to implement ORM, but it should never be used as an ORM.

What is Active Record again?

Active Record (AR), Table Row Gateway, Table Data Gateway ... let's not difference between them, but just consider the concept of plain mapping of a row to an object with properties representing the columns of a table.

Of course you always have some mechanism which enables you to fetch the related contents depending on your ERM (entity relationship model). This may happen with explicit calls to fetch, let's say, the author of a news post, or a more advanced implementation of AR may even construct JOINS in the initial query to fetch all requested related data.

This sounds perfect to map my business objects to the database, or?

The basic model

Considering a very simple DB model AR may work for you. Let's consider the following structure for something like a very simple news system:

+-----------+ | Article | +-----------+ | id | +--------+ | author_id | ---> | Author | | title | +--------+ | text | | id | +-----------+ | name | | email | +--------+

For such a structure even AR fulfills all basic requirements for an ORM, when you extend the basic classes with some validations, which are not topic of this blog post.

Causing some trouble

Having a model and an easy to use interface which provide full access to the data, and abstracts away the SQL by generating the queries still seems a quite perfect solution - so let's introduce some requirements which are not properly realizable with Active Record. This is - of course - not a complete list, but very basic and quite common requirements for a data model.

Translation, Versioning and Status

A quite common requirement is translation of contents. In this case you want to add some key to your table containing the contents of your application which somehow maps to a language. The logical consequence is, that you get more then one article with the same value in the column 'id' and need to define a multi column index on the columns 'id' and '~language'.

If somebody is able to edit contents, which commonly will be true for articles managed by some web interface, you should introduce content versioning in your application, so that no content is lost, if somebody accidentally deleted some text or somebody abuses his rights to do bad stuff.

Here the same applies as for translation - you need an additional column representing the version and you get a three column primary index for your data.

I am not really sure if there are an AR implementation which handle the case of multi column indexes properly - if you know some please point me to them, but even if it is supported you end up with a messy table containing not normalized data, with duplicate titles, which are not required to store, if they did not change between two versions.

You may also have some other attributes like states ("published", "draft", ...) for your content which will pollute your table even more. State changes on one version of an article / in one row will result in the loss of change histories - and otherwise you might just need to copy the complete the contents of an article, because it has been published by the editor without any changes in an article draft...

Variants

Think of the case you do not only want to show articles on your website, but some other contents, like blog posts which also contain a list of tags or categories, perhaps you want to include project news which also contain some release information...

This will either mess up you basic table by adding optional attributes which destroys the semantic meaning of the table by containing completely different stuff, or you would need to add several new tables which will be aggregated into one list, sorted by some common criteria and handled by different views.

Now think of a online shop, or some other complex website, where you do not only have three types of contents, but hundreds or thousands - you may come up with the idea of serializing the content specific data in some way (XML, PHP specific stuff, JSON, ...), but you are not serious about that, are you? This would lead to a whole bunch of other problems like the requirement for specialized search engines, unnecessary parsing, using semantical information from the DB view...

Relations

A different requirement where AR fails totally used as an ORM are n:m relations with associated metadata. To stay with model introduced above you could imagine that you want to add references to existing articles in a new article you write - of course it should be a potentially finite count of article references. The metadata could be a description why you add this article as a reference, some date and / or link title - a table like this would be the result:

+----------------+ | reference | +----------------+ | source_id | | destination_id | | title | | description | | date | +----------------+

In the optimal cases - the AR implementation automagically fetches the right articles for you - you would end up with an article object containing a bunch of objects with the reference metadata, each containing the destination article object.

But, from a business object point of view, there are two things wrong now with your data structure. You normally just some reference objects to render in a defined way in your views, and the articles of course should not be fetched multiple times, if one article references another one multiple times.

You can work with this case, you may be able to handle it properly in your views, but this won't be the data structure you probably would design when not having the database scheme in your mind - and letting the abstraction layer define you business object structure just is the wrong way to go.

Conclusion

The conclusion is simple:

You either destroy your business objects or your DB model.

This "OR" is not a "XOR" - you may even destroy both, just because you used the wrong tool for the right job - ORM per se is nothing bad. The above described examples force you to use hacks of a model which originally were thought to clean up your data storage system and offer a nice model. You may reach the opposite.

Proper ORM

As usually, doing something the proper way, it gets harder. I am sorry for this ;). And as always, there is no magic store, where you may just put your data, without thinking about your application and hope that everything works perfectly...

A trivial model

Let's take a look at one model which can map the above described problems - for now ignoring the relations. This is of course not the ultimate solution, but just an example. I will mention some of its weaknesses later.

Entity Relationship Model: Object Relation Mapping

The model is structured in four table clusters.

Content class

The content class is the equivalence to some class definition in your object orientated code. It defines the structure of some content, which has been done in the AR example above by the SQL table. To define the content class for an article you create an entry in the table content_class with a well chosen name and create entries in the table content_class_attribute with the properties of your content class - title and text in the example above. Those attributes are of some attribute type which may be anything from string, text, date, integer up to complex data types like (multi) select boxes, geo information, models, images or something completely different.

Obviously you are then able to automagically create forms using various widgets (combinations) - but this is not really relevant when it comes to the data model.

Instance

Of course you should be able to have a finite count of instances (articles) of your content class (article). Those are stored once , with an optional name and a reference to the defining content class in the table content_object.

The real instances, beside the identifier, are the versions - if you want language and / or status aware - in the table content_object_version. The attributes data, which actually somehow defines the object, is stored content_object_data with the obvious references to the obvious tables. The version and object data tables are using a multi column index on version, language and id here.

Metadata

You see very few essential metadata in the example, like the language, status, creator and the creation date. The last two ones probably would better fir in some abstracted metadata storage extended by tags, ratings or whatever you want to add here. This is not really relevant.

Attributes

As written earlier you can easily generate forms (not only for (X)HTML) from those content class definitions, but you can also easily include descriptions, configurations of data types (with some default configuration options in the table attribute_type), which may define valid value ranges, limit the host for some URL or whatever you imagine.

If an attribute has not been changed in a new version of a content object you may automatically fetch it from earlier versions of the content objects without any real drawbacks, but this may result in more complex queries. It is up to you to find the proper balance for your application - as always.

And the n:m relations?

Until now nothing is defined for the code basis you use. You just may define a common n:m relation with some metadata. In the case of busimess.org, for example, the relation data again is defined by a structure like the one above.

Considerations

This model only eliminates the drawbacks on the DB model side, no possible drawback on the application side, like mentioned above with relations in AR. Implementing this model as a backend in your application you should do the mapping or use some content management framework which already implements a model like this.

Enhancements to the model

eZ Publish, for example does not only versions and translate the content objects, but also the content classes - you may imagine that this makes some of the queries even a bit more complex, but when good caching comes in place here, this shouldn't be such a big deal with common requirements - and if you have very special performance requirements you still may use views, another model or more caching ;). And - as written earlier - this is not meant being the optional model for each use case. Something like this obviously does not exist.

Including model in your application

The earlier mentioned relations between objects are one way to include such a model in your application, if you have data only structured by some graph, as known from computer science. This is true for wikipedia, for example, where the only structure is given by references between articles - hard to brows for the visitor and not a very common access method.

Tree

Far more common are trees which put the data into a hierarchic structure - this can be read as some navigation, or maybe you also want to structure the data not displayed on your website in a tree and only display a subtree.

This can be done easily by using some extra table defining the hierarchic structure using some of the common algorithms to map trees into a database like parent child relations, or nested sets where the nodes relate to some object. You may want to consider being able to connect one object to multiple nodes...

eZ Publish

In eZ Publish you get one tree which contains objects, similar to the ones described above, connected to one or more nodes of the tree (where one is the main node). Additionally you get the possibility to add relations between the objects (so that you don't have a tree any more in theory, but this does not really matter anywhere).

Each object in eZ Publish is connect to at least one node, and nearly everything is such an content object. Even users, for example, are only sub content objects of a subtree which is an instance of the content class user group.

Of course you may also add user content objects to other nodes, you may define permissions on parent content class, define which content class / object may have childdren at all, or which children it may have, for users or groups, or complete sub trees, either by sub tree or section. Going into detail here obviously won't fit any blog post. ;)

Other ORMs

There are complete different ways to define your ORM. You may use an arbitrary table structure and define some random mapping using some XML definitions, or even a mapping in your database. Both ways can make the mapping completely independent from your business objects and your database schema.

Then ... why does AR exist at all?

AR still may be used for accessing the tables in the model described above. If this makes sense again depends on your use case, and - for example - may be useful as an abstraction layer of the used relational database management system.

Also you may try using some AR framework to just do rapid prototyping. But you should know the limited scope of this approach - and hope your boss will not come into your office and request something which may not be implemented using AR - or you can just do a quick hack, of course. ;)

Conclusion

As shown, AR is just a wrapper for database access and not appropriate as an ORM. Used carefully, or with very simple applications it really may proof useful.

But there are other ways to abstract the access to your database like query construction using some object orientated interface, like it is done in the database component in eZ Components. The classes then may create SQL which work with the currently used RDBMS. Another (slow) way is parsing the query given by the user and restructuring for the current relational database management system.

The possibilities for object relational mapping described in this blog post are initially more complicate to use, but also may offer great extensibility, real mapping with future proof concepts. You won't run in the discussed problems. But, as always, higher abstraction cause more complex code and may bring you a performance impact - which could be solved by proper caching.

And, there is no perfect system to solve all your needs, which some people think Active Record was, but you should review your modelling concepts.

Other solutions

There are completely different solutions which may work better with your object structures like CouchDB - something Jan blogged about and is a really cool solution which does not force any relational database structures and still are ACID.