Feb 6, 2010 by Sean Cribbs

His reply was quite humorous (no, John, Riak won’t merge into MongoDB), but it got me thinking. So here are my…

Last fall, I heard about Riak and thought it sounded awesome, and it boasts a lot of neat features that I’ll go into detail about below. I was further impressed when I met the Basho team at nosqleast in Atlanta. They really seem to know what they’re doing.

7 Reasons why Riak is the Merb of databases (or better)

…and why it should be the database for your next Rails app.

Scales horizontally without pain

Riak runs just about in the same way on 1 node as it does 100. This is because, in short, it is an implementation of Amazon’s Dynamo. Need more storage capacity, throughput and availability? Add more nodes.

Now, a lot of people throw around the word “scalability” without really knowing what it means. It’s really a ratio of the desired properties of your system to the cost inherent in achieving those goals. High scalability means you have a low cost in proportion to your ability to grow. It has nothing to do with performance (although good performance can be a side-effect of good scalability). By this measure, actually, Rails scales well (shared nothing) — but that doesn’t mean it’s always performant (slow Ruby implementations, large memory consumption, etc). Despite the trolls, it’s pretty well known that Twitter’s scaling problems were not really caused by Rails.

So what makes Riak more capable of handling growth than MongoDB, CouchDB or MySQL? It’s built with master-less replication in mind from the beginning. When you grow a MongoDB or MySQL database, you typically add a slave server which receives and replays the transaction logs of the master server. Ideally, this slave can take over if the master goes down. There are also master-master setups and other ways of configuring replication, but in general it’s a thing you add on, not built-in.

Instead, Riak has no notion of a master database node. Every node participates equally in the cluster and spreads the load in a predictable way using consistent hashing (another feature of Dynamo). So yes, increasing capacity and throughput and all those other things you expect from your database is as easy as starting up new nodes and having them join the ring. Data will be replicated in the background. You can even run Riak nodes on less-powerful machines (think EC2) and get decent results. Large, monolithic DB servers are the last vestige of the mainframe era, so democratize your database layer!

HTTP-Compliant REST interface

Rails nerds like me love REST, but most don’t really understand what it means, or rather think that it is a one-to-one mapping to CRUD. Rails’ concept of REST is at best incomplete and potentially misleading. (See also Scott Raymond’s RailsConf 2007 talk)

The Riak developers spent a lot of time creating an awesome toolkit in Erlang for building RFC-compliant HTTP resources, and Riak has several of them. This means that when interacting with Riak, you use HTTP in a natural way and get predictable responses — even the hard stuff like content-negotiation, entity tags and conditional methods.

The side effect of composing Riak of well-behaved HTTP resources is that it plays very nicely with other HTTP-related infrastructure, including proxies, load balancers, caches, and clients of all stripes.

Robust failure recovery

One of Riak’s most compelling features is that it handles node failure robustly. If a node goes down in your cluster, its replicas will take over for it until it comes back, a feature known as hinted handoff. If it doesn’t come back — as happens all too often on EC2 — you can add a new node and the cluster will rebalance. The failure recovery story for MySQL, even in a replicated scenario, is much more difficult, time-consuming, and costly.

Riak has fine-grained robustness as well. It was built primarily in Erlang and with the OTP Design Principles in mind. One aspect of the OTP design is that the processes in the system are organized in a supervision tree, so that when one crashes, it will be restarted by its supervisor process (which is in turn supervised). Send some bad data to the web interface and get a 500 error? That internal error doesn’t crash the whole system, bringing your database node down.

Justin Sheehy, Basho’s CTO, tells a story of a customer whose cluster had two nodes go down late one night. The Basho team looked for less than 10 minutes at the cluster, verified that it was still going, and then decided to wait until the morning to fix the lost nodes. As Joe Armstrong, Father of Erlang, says (paraphrased), “In order to have fault tolerance, you need more than one computer.” Riak embraces that philosophy.

Links make graph-like structures possible

Most Dynamo implementations are simple key-value stores. While that’s still pretty useful — Dynomite is really awesome at storing large binary objects, for example — sometimes you need to find things without a priori knowledge of the key.

The first way that Riak deals with this is with link-walking. Every datum stored in Riak can have one-way relationships to other data via the Link HTTP header. In the canonical example, you know the key of a band that you have stored in the “artists” bucket (Riak buckets are like database tables or S3 buckets). If that artist is linked to its albums, which are in turn linked to the tracks on the albums, you can find all of the tracks produced in a single request. As I’ll describe in the next section, this is much less painful than a JOIN in SQL because each item is operated on independently, rather than a table at a time. Here’s what that query would look like:

GET /raw/artists/TheBeatles/albums,_,_/tracks,_,1

“/raw” is the top of the URL namespace, “artists” is the bucket, “TheBeatles” is the source object key. What follows are match specifications for which links to follow, in the form of bucket,tag,keep triples, where underscores match anything. The third parameter, “keep” says to return results from that step, meaning that you can retrieve results from any step you want, in any combination. I don’t know about you, but to me that feels more natural than this:

SELECT tracks.* FROM tracks INNER JOIN albums ON tracks.album_id = albums.id INNER JOIN artists ON albums.artist_id = artists.id WHERE artists.name = "The Beatles"

The caveat of links is that they are inherently unidirectional, but this can be overcome with little difficulty in your application. Without referential integrity constraints in your SQL database (which ActiveRecord has made painful in the past), you have no solid guarantee that your DELETE or UPDATE won’t cause a row to become orphaned, anyway. We’re kind of spoiled because ActiveRecord handles the linkage of associations automatically.

The place where the link-walking feature really shines is in self-referential and deep transitive relationships (think has_many :through writ large). Since you don’t have to create a virtual table via a JOIN and alias different versions of the same table, you can easily do things like social network graphs (friends-of-friends-of-friends), and data structures like trees and lists.

Powerful Map-Reduce and soon, Lucene search

The second way to find data stored in Riak is using Map-Reduce. Unlike links, where you have to start with a known key, you can run map-reduce jobs over any number of keys or an entire bucket.

One thing that might not be immediately obvious when using an SQL database is that the declarative nature of the query language hides the imperative nature of performing the query. SQL databases have sophisticated query planners that decompose your query into several steps (see EXPLAIN command on MySQL) and attempt to optimize the order of those steps based on known table properties, like indices, keys, and row counts. Riak’s map-reduce, while still largely declarative, lets you decide which order to run steps in. You trade a little abstraction for a lot of power.

Riak’s map-reduce is also quite different from CouchDB’s views:

You’re not trying to build an index which you’ll query later, you’re performing the query. You can have any number and combination of map, reduce and link phases (link is actually a special case of map). Since it’s not an index, your query need not be contiguous across the keyspace. There’s no concept of re-reduce because reduce phases are only run once.

Up until recently, you could only write map-reduce jobs in Erlang, but thanks to Kevin Smith’s awesome work, you can now write jobs in Javascript and submit them over the HTTP interface. This opens up a lot of doors to developers who aren’t familiar with Erlang but use Javascript every day.

Although not complete or released, Basho also has an awesome Lucene-compatible search system in the pipeline (already in use at Collecta, or so I hear). I saw it in action this past week while in Boston and was impressed. It’s fairly comparable in performance to Solr for large datasets, but scales across your Riak cluster.

Beyond schema-less: Content-type agnostic

One thing that may seem unintuitive at first is that Riak doesn’t care what type of content you put in it. CouchDB and MongoDB use JSON and BSON (Binary JSON), respectively, to store objects. Instead of forcing a format, Riak lets you store pretty much anything. There’s no concept of “attachments” or “GridFS” (unless you add it yourself), so just store the file directly in a bucket/key with a PUT or POST request. As long as you specify the “Content-Type” header, Riak will remember it and give you back the right thing when you access it the next time. No need to do base 64 encoding to store a picture, PDF, Word document, podcast, or whatever. This could enable you to replace a distributed filesystem like GFS, or even S3, with a Riak cluster. Bonus: you get replication and fail-over for no extra cost.

There are some minor kinks with large files (>100MB), but I’ve been assured by Basho that they are addressing the issue. In the meantime, you could chunk the file manually and follow links to get the pieces and put them back together.

Tunable levels of consistency, durability, and performance

The knee-jerk argument against Dynamo-like data stores is that they don’t have ACID properties and that “eventual consistency” is too eventual. In practice, especially in large deployments and high throughput scenarios, ACID breaks down because it requires actions to be “all or nothing”, effectively creating bottlenecks while clients wait for transaction handles. If you’ve done any distributed or concurrent computing, you know that contention for shared resources is a primary cause of failure, including problems like deadlock, starvation and race conditions. For the sake of reducing single points of failure (bottlenecks), Riak implements eventual consistency — where storage operations are accepted immediately, and then propagated across the cluster in an asynchronous fashion. This makes it nearly always available for writes and reads.

In order to deal with inconsistency, Riak tags each datum with a vector clock that internally reveals the datum’s lineage (who modified what version). When there are conflicts — that is, two parallel versions of the same datum — Riak returns you both versions so that your application can decide how to resolve it. In an SQL database with this kind of conflict, your transaction might fail and rollback, forcing you to resolve it and retry anyway.

Beyond just eventual consistency, Riak lets you tune the amount of consistency, durability, and availability you want, even somewhat at request-time. If you need more read availability and replication, you can increase the N value for a bucket (number of replicas, default 3). If you want to be sure the data you’re reading is consistent across the cluster, increase the R value at request time (R, read quorum, is always <= N). If you want high assurance that your data is stored, increase the W and DW values (W = write quorum, DW = durable-write quorum, both always <= N). If you want better performance, e.g. to use Riak as a persistent cache, keep the R, W, and DW values low. You also have the choice of which backend to use when storing your data on the replicas (anything from in-memory to on-disk, or combinations), including the recently released Innostore. Although in many cases the defaults are good enough, you have the flexibility to choose a model that fits your application.

(Note: currently, the N value must be set before you insert any data into the bucket).