On Persistence, Still Not Happy

Persistence is hard and something you need to deal with in every app. I've written about what's available in Squeak, written about simpler image-based solutions for really small systems where just dumping out to one file is sufficient; however, nothing I've used so far has satisfied me completely for various reasons, so before I get to the point of this post, let me do a quick review of my current thoughts on the matter.

Relational Databases

Tired of em, I don't care how much they have to offer me in the areas of declarative indexing and queries, transactions, triggers, stored procedures, views, or any of the handful of things they offer that I don't really want from them. The price they make me pay in programming just isn't worth it for small systems. I don't want my business logic in the database. I don't want to use a big mess of tables to model all my data as a handful of global variables, aka tables, that multiple applications share and modify freely. What I do want from them, transactional persistence of my object model, they absolutely suck at and all attempts to shoehorn an object model into a relational database ends up being an exercise in frustration, compromise, and cussing. I think using a database as an integration point between multiple applications is a terrible idea that just leads to a bunch of fragile applications and a data model you can't change for fear of breaking them. Enough said, on to more object-oriented approaches!

Active Record

Ruby on Rails has brought the ActiveRecord pattern mainstream, which was as far as I know, first popularized in Martin Fowler's book Patterns Of Enterprise Application Architecture which largely dealt with all the various known methods of mapping objects to databases. Initially, I wasn't a fan of the pattern and preferred the more complex domain model with a metadata mapping, but having written an object-relational mapper at a previous gig, used several open-source ones, as well as tried out several pure object databases, I've come to appreciate the simplicity and explicitness of its simple API.

If you have to work with a relational database, this is a fairly good compromise for doing so. You can't bind a real object model to a relational database cleanly without massive effort, so don't try, just revel in the fact that you're editing rows rather than trying to hide it. It works reasonably well, and it's easy to get other team members to use it because it's simple.

"Simplicity is the ultimate sophistication" -- Leonardo Da Vinci

Other Approaches

A total OO purist, or a young one still enamored with patternitis, wouldn't want objects to save themselves as an ActiveRecord does. You can see this in the design of most object-oriented databases available, it's considered a sin to make you inherit from a class to obtain persistence. I used to be one of those guys too, but I've changed my mind in favor of pragmatism. The typical usage pattern is to create a connection to the OODB server which basically presents itself to you as a persistent dictionary of some sort where you put objects into it and then "commit" any unsaved changes. They will save any object and leave it up to you what your object should look like, intruding as little as possible on your domain, so they say.

Behind the scenes there's some voodoo going on where this persistent dictionary tries to figure out what's actually been changed either by having installed some sort of write barrier that marks objects dirty automatically when they get changed, comparing your objects to a cached copy created when they were originally read, or sometimes even explicitly forcing the programmer to manually mark the object dirty. The point of all of this complexity of course, is to minimize writes to the disk to reduce IO and keep things snappy.

Simplicity Matters

What seems to be overlooked in this approach is the amount of accidental complexity that is imposed upon the programmer. If I have to open a connection to get a persistent dictionary to work with, I now have to store this configuration information, manage the creation of this connection, possibly pool it if it's an expensive resource, and decide where to hang this dictionary so I can have access to it from within my application. This is usually some sort of current session object I can always reach such as a WASession subclass in Seaside. Now, this all actually seems pretty normal, but should it be?

I'm not saying this is wrong, but one has to be aware of the trade-offs made for any particular API or style. At some point, you have to wonder if we're not suffering from some form of technical Stockholm syndrome where we forget that all this complexity is killing us and we forget just how painful it really is because we've grown accustomed to it.

Sit down and try explaining one of your programs that use some of this stuff to another programmer unfamiliar with your setup. If you really pay attention, you'll notice just how much of the explaining you're doing has nothing to do with the actual problem you're trying to solve. Much of it is just accidental complexity for plumbing and scaffolding that crept in. If you spend more time explaining the persistence framework than your program and the actual problem it's solving, then maybe that's a problem you'll want to revisit sometime. Do I really want to write code somewhat like...

user := User firstName: 'Ramon' lastName: 'Leon'. self session commit: [ self session users at: user id put: user ].

with all the associated configuration setup and cognitive load of remembering what I called the accessor to get #users and how I'm hashing the user for this or that class while remembering the semantics of what exactly is committed, or whether I forgot to mark something dirty, or would I rather do something more straight forward and simple like this...

user := User firstName: 'Ramon' lastName: 'Leon'. user save.

And just assume the object knows how to persist itself and there's no magic going on? If I say save I just know it commits to disk, whether there were any changes or not. No setup, no configuration, no magic, just save the damn object already.

Contrary to popular belief, disk IO is not the bottleneck, my time is the bottleneck. Computers are cheap, ram is cheap, disks are cheap, programmer's time is usually by far the largest expense on any project. Something simple that just works OK but solidly every time is far more useful to me than something complex that works really really well most of the time but still breaks in weird ways occasionally, forcing me to dig into someone else's complex code for change detection or topological insertion sorting and blow a week of programmer time working on god damn plumbing. I want to spend as much time as possible when programming working on my actual problem, not fighting with the persistence framework to get it to behave correctly or map my object correctly.

A Real Solution

Of course, GemStone is offering GLASS, a 4 gig persistent image that just magically solves all your problems. That will be the preferred option for persistence when you really need to scale in the Seaside world, and I for one will be using it when necessary; however, it does require a 64 bit server and introduces the small additional complexity of changing to an entirely different Smalltalk and learning its class library. Definitely an option if you outgrow Squeak. But will you? I'll get into GemStone more in another post when I can get more into it and give it the attention it deserves, but my main point now is that there's still a need for simple GemStone'ish like persistence for Squeak.

Reality Check

Let's be honest, most apps don't need to scale. Most apps in the real world are written to run small businesses, which DHH calls the fortune five million. The simple fact is, in all likelihood scaling is not and probably won't ever be your problem. We might like to think we're writing the next YouTube or Twitter, but odds are we're not. You can make a career just replacing spreadsheets from hell with simple applications that make people lives easier without ever once hitting the limits of a single Squeak image (such was the inspiration for DabbleDb), so don't waste your time scaling.

You don't have a scaling problem unless you have a scaling problem. Even if you do have an app that needs to scale, it'll probably need 2 or 3 back end supporting applications that don't and it's a waste of time making them scale if they don't need too. If scaling ever becomes a problem, be happy, it's a nice problem to have unless you're doing something stupid like giving away all of your services for free and hoping you'll figure out that little money thing later on.

Conventions Rule

Ruby on Rails has shown us that beyond making things easier with ActiveRecord, things often need to be made more structured and less configurable. Configuration is a hidden complexity that Java has shown can kill any chance for any real productivity, sometimes having more configuration than actual code. It's amazing how much simpler programs can get if you just have the guts to make a few tough choices, decide how you want to do things, and always do it that way. Ruby on Rails true contribution to the programming community was its convention over configuration philosophy, ActiveRecord itself was in use long before Rails.

Convention over configuration is really just a nice way of the framework writer saying "This is how it's done and if you don't like it, tough." The problem then of course becomes finding a framework with conventions you agree with, but it's a big world, you're probably a programmer if you're reading this, so if you can't find something, write your own. The only problem with other people's frameworks is that they're other people's frameworks. There's nothing quite like living in a world of your own creation.

What I Wanted

I wanted something like ActiveRecord from Rails but not mapped to a relational database, that I could use with Seaside and Squeak for small applications. I've accepted that if I need to scale, I'll use GemStone, this limits what I need from a persistence solution for Squeak.

For Squeak, I need a simple, fast, configuration free, crash-proof, easy to use object database that doesn't require heavy thinking to use, optimize, or explain to others that allows me to build and iterate prototypes and small applications quickly without having to keep a schema in sync or stop to figure out why something isn't working, or why it's too slow to be usable.

I don't want any complex indexing schemes to be necessary, which means I want something like a prevalence system where all the objects are kept in memory all the time so everything is just automatically fast. I basically just want my classes in Squeak to be persistent and crash-proof. I don't need a query language, I have the entire Smalltalk collections hierarchy at my disposal, and I sure as hell don't need SQL.

I also don't want a bunch of configuration. If I want to find all the instances of a User in memory I can simply say...

someUsers := User allInstances.

Without having to first go and configure what memory #allInstances will refer to because obviously I want #allInstances in the current image. After all, isn't a persistent image what we're really after to begin with? Don't we just want our persistent objects to be available to us as if they were just always in memory and the image could never crash? Shouldn't our persistent API be nearly as simple?

Since I'm basically after a persistent image, I don't need any configuration; the image is my configuration. It is my unit of deployment and I've already got one per app/customer anyway. I don't currently, nor do I plan on running multiple customers out of a single image so I can simply assume that when I persist an instance, it will be stored automatically in some subdirectory in the directory my image itself is in, overridable of course, but with a suitable default. If I want to host another instance of a particular database, I'll put another image in a different directory and fire it up.

And now I'm finally getting to the point...

SandstoneDb

Since I couldn't find anything that worked exactly the way I wanted, though Prevayler was pretty close, I just wrote my own. It's a simple object database that uses SmartRefStreams to serialize clusters of objects to disk. Ordinary ReferenceStreams can mix up your instance variables when deserializing older versions of a class.

The root of each cluster is an ActiveRecord / OODB hybrid. It makes ActiveRecord a bit more object oriented by treating it as an aggregate root and its class as a repository for its instances. I'm mixing and matching what I like from Domain Driven Design, Prevayler, and ActiveRecord into a single simple framework that suits me.

SandstoneDb API

To use SandstoneDb, just subclass SDActiveRecord and restart your image to ensure the proper directories are created, that's it, there is no further configuration. The database is kept in a subdirectory matching the name of the class in the same directory as the image. This is a Prevayler like system so all data is kept in memory written to disk on save; on system startup, all data is loaded from disk back into memory. This keeps the image itself small.

Like Prevayler, there's a startup cost associated with loading all the instances into memory and rebuilding the object graph, however once loaded, accessing your objects is blazing fast and you don't need to worry about indexing or special query syntaxes like you would with an on-disk database. This of course limits the size of the database to whatever you're willing to put up with in load time and whatever you can fit in ram.

To give you a rough idea, loading up a 360 meg database containing about 73,000 hotel objects on my 3ghz Xeon Windows workstation takes about 57 minutes. That's an average of about 5k per object. Hefty and definitely pushing the upper limits of acceptable. Of course, load time will vary depending upon your specific domain and the size of the objects. This blog is nearly two years old and only has a few hundred objects varying from 2k to 90k, some of my customers have been using their small apps for nearly a year and only accumulated 500 to 600 business objects averaging 0.5k each. Load time for apps this small is insignificant and using a relational database would be akin to using a sledgehammer to hang an index card with a thumbtack.

API

SandstoneDb has a very simple API for querying and iterating on the class side representing the repository for those instances:

queries

#atId: (for fetching a record by its #id)

#atId:ifAbsent:

#do: (for iterating all records)

#find: (for finding first matching record)

#find:ifAbsent:

#find:ifPresent:

#findAll (for grabbing all records)

#findAll: (for finding all matching record)

Being pretty much just variations of #select: and #detect:, little if any explanation is required for how to use these. The #find naming is to make it clear these queries could potentially be more expensive than just the standard #select: and #detect:.

Though it's memory-based now, I'm leaving open the option of future implementations that could be disk-based allowing larger databases than will fit in memory; the same API should work regardless.

There's an equally simple API for the instance side:

Accessors that come in handy for all persistent objects.

#id (a UUID string in base 36)

#createdOn

#updatedOn

#version (useful in critical sections to validate you're working on the version you expect)

#indexString (all instance variable's asStrings as a single string for easy searching)

Actions you can perform on a record.

#save (thread safe)

#save: (same as above but you can pass a block if you have other work you want done while the object is locked)

#critical: (grabs or creates a Monitor for thread safety)

#abortChanges (rollback to the last saved version)

#delete (thread safe)

#validate (for subclasses to override and throw exceptions to prevent saves)

You can freely have records holding references to other records but a record must be saved before it can be referenced. If you attempted to save an object that references another record that answers true to #isNew, you'll get an exception. Saves are not cascaded, only the programmer can know the proper save order his object model requires. To do safe cascaded saves would require actual transactions. Saves are always explicit, if you didn't save it, it wasn't saved, there is no magic, and you should never be left scratching your wondering if your objects were saved or not.

Events you can override to hook into a records life cycle.

#onBeforeFirstSave

#onAfterFirstSave

#onBeforeSave

#onAfterSave

#onBeforeDelete

#onAfterDelete

Be careful with these, if an exception occurs you will prevent the life cycle from completing properly, but then again, that might be what you intend.

A testing method you might find useful on occasion.

#isNew (answers true prior to the first successful save)

Only subclass SDActiveRecord for aggregate roots where you need to be able to query for the object, for all other objects just use ordinary Smalltalk objects. You DO NOT need to make every one of your domain objects into ActiveRecords, this is not Ruby on Rails, choosing your model carefully gives you natural transaction boundaries since the save of a single ActiveRecord and all ordinary objects contained within is atomic and stored in a single file. There are no real transactions so you can not atomically save multiple ActiveRecords.

A good example of an aggregate root object would an #Order class, while its #LineItem class just be an ordinary Smalltalk object. A #BlogPost is an aggregate root while a #BlogComment is an ordinary Smalltalk object. #Order and #BlogPost would be ActiveRecords. This allows you to query for #Order and #BlogPost but not for #LineItem and #BlogComment, which is as it should be, those items don't make much sense outside the context of their aggregate root and no other object in the system should be allowed to reference them directly, only aggregate roots can be referenced by other objects.

This of course means should you improperly reference say an #OrderItem from an object other than its parent #Order (which is the root of the file they're bother stored in), then you'll ultimately end up referencing a copy rather than the original because such a reference won't be able to maintain its identity after an image restart.

In the real world, this is more than enough to write most applications. Transactions are a nice to have feature, they are not a must-have feature and their value has been grossly oversold. Starbucks doesn't use a two-phase commit, and it's good to remind yourself that the world chugs on anyway, mistakes are sometimes made and corrective actions are taken, but you don't need transactions to do useful work. MySql became the most popular open-source database in existence long before they added transactions as a feature.

Here are some examples of using an ActiveRecord...

person := Person find: [ :e | e name = 'Joe' ]. person save. person delete. user := User find: [ :e | e email = 'Joe@Schmoe.com' ] ifAbsent: [ User named: 'Joe' email: 'Joe@Schmoe.com' ]. joe := Person atId: anId. managers := Employee findAll: [ :e | e subordinates notEmpty ].

Concurrency is handled by calling either #save or #save: and it's entirely up to the programmer to put critical sections around the appropriate code. You are working on the same instances of these objects as other threads and you need to be aware of that to deal with concurrency correctly. You can wrap a #save: around any chunk of code to ensure you have a lock on that object like so...

auction save:[ auction addBid: (Bid price: 30 dollars user: self session currentUser) ].

While #critical: lets you decide when to call #save, in case you want other stuff inside the critical section of code to do something more complex than a simple implicit save. When you're working with multiple distributed systems, like a credit card processor, transactions don't really cut it anyway so you might do something like save the record, get the auth, and if successful, update the record again with the new auth...

auction critical: [ [ auction acceptBid: aBid; save; authorizeBuyerCC; save ] on: Error do: [ :error | auction reopen; save ] ]

That's about all there is to using it, there are some more things going on under the hood like crash recovery and startup but if you really want to know how that works, read the code. SandstoneDb is available on SqueakSource and is MIT licensed and makes a handy development and prototyping or small application database for Seaside. If you happen to use it and find any bugs or performance issues, please send me a test case and I'll see what I can do to correct it quickly.