If your first thought after reading the title is “because you’re crazy”, you won’t hear any disagreement from us. Deciding to go build your own database is not something a completely sane person does. We don’t even have any better reasoning other than we thought the idea was too cool to ignore and it fit our needs like a glove. We asked ourselves: “how hard could it be?” (HHCIB). Well pretty hard, as it turns out.

The idea did not just come out of the blue. We started out working on a project called Emit Cloud. What we tried to build with Emit Cloud was a file synchronisation solution anyone could install and use (i.e. a private Dropbox). Something that can easily be used by an individual, a large company with thousands of employees or a telecom that wants to offer a file synchronisation service to their users or corporate customers.

So our goals for Emit Cloud were:

Easy to use

Easy to administer

No single point of failure (SPOF)

Horizontal scalability

The last two of course are the kicker. We were aiming for the proverbial moon. If we were to hit it thats great, we would have settled for less. We really didn’t want to though.

When you’re talking about no SPOF and horizontal scalability, the entire issue is how are you going to store your data and how you’re going to keep it consistent. Emit Cloud data is basically: files, directories and events. Files and directories are a hierarchy, events are a list of things that happened: move, new, delete, etc.

Events are easy to store. You could use any of the great key-value stores available and be scalable with no SPOF. You would lose a point by requiring the administrator to maintain an external database he might not be familiar with. KV and document stores are all the rage in the web world. In the corporate world they are not as common.

Files and directories are quite an issue however. A user can drop off a large directory into his desktop client. You need a way to store that data and always keep it consistent. You can’t allow an inopportune crash to cause data inconsistency. Files and directories are also one big hierarchy. KV stores are unsuitable, document stores lack transactions, SQL databases… well they’re awesome. But storing lots of data and being able to handle lots of queries reaches a limit on a master node quickly. You’ve got to keep a constant eye on it. Horizontal scalability and no SPOF is not something you can get easily out of SQL databases.

So we were torn on how to proceed. When you’re in such a dilemma and weighting the pros and cons of one technology over another, you can often find the right path by asking yourself: “what can I live without?”. What does one solution offer that is entirely unnecessary for me? You need to ponder about the nature of your data.

The nature of our data was such that yes it was relational and we need transactions. But we only need those transactions within a given user. A file system hierarchy of one user has nothing to do with a different user. Theoretically we can easily have user X on database A, user Y on database B. If we give every user a simple SQLite database, we can actually be horizontally scalable.

SQLite is pretty awesome. Rock solid stable, ACID, foreign keys, recursive queries (you can read an entire tree in one go), good performance with WAL mode. We were getting excited.

But of course there is that nagging issue of no SPOF. SQLite is just a software library. Not a database server. It has no replication. This is where our “well how hard could it be?” comes out.

There are multiple ways you can implement replication to a database. You can replicate SQL statements and apply them on every server. This is the simpler solution and one we went with first. It worked reasonably well but was not optimal by any means.

Right now what we do is replicate the actual WAL log. One node executes, the others just append to WAL. This way no random() can cause inconsistency.

But how are you going to achieve consensus if you have a database running on multiple nodes? Well first we went with something homegrown (that damn HHCIB) and it replicated SQL statements. It seemed to work well but there was a much better option out there and that is the Raft consensus algorithm.

It’s not like we were not aware that it exists, the problem was how to actually use it with SQLite’s WAL. Raft requires some additional values that need to be stored. It needed us to patch SQLite’s WAL implementation with additional functionality and we were hesitant to do that. Writing a server on top of a database engine is one thing (basically what ActorDB is), trying to write or change an existing database engine is an entirely different matter.

But for kicks, we started tinkering with it. Seeing if we could get anywhere and as it turned out, everything went pretty smoothly. SQLite is just really well designed and documented. The WAL code has a nice clean interface to the rest of the DB. First we made a simple change of extending the WAL format. This allowed us to implement Raft. Then we went ahead with a larger update of combining all WAL files of every SQLite (there can be thousands) into a single file. All while keeping the existing WAL logic and extending it with new functionality. Now even if we are sending writes to thousands of SQLite databases, we are actually only appending to the same WAL file for every write.

One nice unique feature of ActorDB is data locality. We don’t have one giant dataset that queries must navigate. We have many small datasets. If a user is stored in the system but is not doing anything, his data is out of the way. Users that are active execute queries over their own private data.

So what we ended up with is something that works very well for us and we think others might find it useful also. There are many aspects of ActorDB that I have not elaborated on in this post, but they are explained at www.actordb.com

Written by Sergej Jurečko