Beyond the RDBMS: the Brave New (Old) World of NoSQL by Andrew Grumet and Philip Greenspun in January 2011, as part of Three Day RDBMS

Consider the following:

Relational database management ideas and systems, introduced in 1970 by E.F. Codd, have been with us for over 40 years.

Michael Stonebraker, one of the pioneers of the RDBMS, says that "these legacy systems are at the end of their useful life" in "The End of a DBMS Era (Might be Upon Us)".

The Wikipedia page for NoSQL lists over 50 non-relational databases at the time of this writing.

With its native support for concurrent updates, the RDBMS enabled programmers of ordinary skill to build the early collaborative Internet applications in a reliable enough fashion to be useful. The RDBMS may have become a victim of its own success, as those applications were so useful to the millions of early Web users that they encouraged billions to sign up for Internet access (users by country). Twitter, a primarily text-based service, was collecting 7 TB of new data every day in early 2010 (slide show). An accepted definition of a "very large database" in 2000 was "more than 1 TB" and a database size between 5 and 50 terabytes (one week of Twitter data!) was something to write about in an academic journal (example).

Will a single RDBMS server be adequate?

When the relational model is a natural fit to your data, the simplest and usually least expensive way to run any Internet application is with a single physical computer running an RDBMS from a set of local hard drives. The system administration and hosting costs of running one computer are lower than those of running more than one computer. The cost in time, hardware, and dollars of synchronizing data is never cheaper than when all of the data are in the main memory of one machine. The costs of software development and maintenance are also low, since SQL is so widely understood and SQL programs tend to be reliable.

Below are some criteria for evaluating when it is time to considering abandoning the simple one-server RDBMS as the backend for an application.

Transaction throughput

It is difficult to find realistic benchmarks for the kind of database activity imposed by an Internet application. The TPC-E benchmark is probably the closest among industry standards. TPC-E is a mixture of fairly complex reads (SELECTs) and writes (INSERTs and UPDATEs) into an indexed database that gets larger with the number of SQL statements that are attempted for processing. The "transactions per second" figures put out by the TPC include both reads and writes. In 2010, a moderately priced Dell 4-CPU (32 core) server hooked up to, literally, more than 1000 hard disk drives, processed about 2,000 transactions per second into an 8-terabyte database.

The authors crowd-sourced the question in this blog posting and for smaller databases that can be stored mostly on solid-state drives, it seems as though a modest $10,000 or $20,000 computer should be capable of 10,000 or more SQL queries and updates per second.

If you think that there is a realistic chance of exceeding one thousand SQL operations every second, you should put some effort into benchmarking your hardware to see if it falls over. Note that at three SQL statements per page served, 10 pages per user session, and 86,400 seconds in a day, 1000 SQL operations per second translates to a capacity of 288,000 users per day (assuming that the load is smooth, which is probably unreasonable, so divide by two to get 144,000 users per day).

Note that some Web developers manage to load the database with more than 1,000 SQL requests every second even when there are only a handful of users. In tools such as Ruby on Rails, for example, it is possible for a programmer to generate code that, unbeknownst to him or her, will fetch 50,000 rows from the database using 50,000 SQL queries rather than a single query that returns 50,000 rows (and then of course the programmer will use Ruby to filter that down to the 30 rows that are displayed to the user!). See this tale of a sluggish Ruby on Rails server for more.

Global reach

Users who live far, in network terms, from the data center will have a slower experience of the service than users who live close to the data center. There are simply more network hops for the data to travel, and if those data are not cacheable, users will have to make the full round trip every time. In some cases the round trips can be sped up using route optimization such as Akamai's dynamic site accelerator. Unlike a traditional content delivery network (CDN) setup, the edge servers on these systems do not cache, they simply proxy uncached user data. Between the data center and user, the data takes a CDN-managed "express lane" to deliver higher speeds to the user. This arrangement allows you to improve global performance without having to distribute the data.

An alternative is to move the data closer to the users. If the data can be partitioned by the user id, you can set up a second data center in Europe and place data for EU community users there, a third data center in China and so on. Amazon.com did this when they set up Amazon UK and Amazon Germany. They copied all of the software onto a new server and set up shop (literally) in those foreign countries. It may be cumbersome to do a query comparing sales of a product in Germany to sales of the same product in the U.S., but management of each database is easier and the consequences of a failure don't shut down the business everywhere in the world.

A similar approach can be taken whenever there is limited value in comparing data from different users. Consider supporting smartphone users with an RDBMS storing contacts, emails, and calendar. Is there any value in comparing the content of Joe's email with Susan's phone numbers? If not, why lump them all together in one huge database? When that one huge database server fails, for example, every customer of the phone company will be calling in at once asking "What happened to my contacts?" [This is not a hypothetical, see "When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data".] Instead of one enormous cluster of exotic machines and hard drives, why not buy 100 1U ("pizza box") machines and assign customers to them according to the last two digits of their phone numbers?

Service redundancy

Given a standard hard drive layout, committed transactions can always be recovered in the event of hardware or software failure. That doesn't mean, however, that data will always be available. The simplest approach to redundancy is a "hot standby" server that has access the transaction logs of the production machine. If the production server dies for any reason, the hot standby machine can roll forward from the transaction logs and, as soon as it is up to date with the last committed transaction, take the place of the dead server. As a bonus, the hot standby machine can be used for complex queries that don't need to include up-to-the-second changes. For Oracle 11g, look for "Oracle Data Guard" to learn more about this approach.

To facilitate disaster recovery, e.g., after a fire that destroys a server room, changes to the database must be sent to a server in another building or in another part of the world. If transaction logs are sent every night, the worst possible consequence of the disaster will be the loss of a day's transactions. If that isn't acceptable, it will be necessary to use various replication and distribution strategies to ensure that transactions are transmitted to the disaster recovery server as part of the commit process (look up "two phase commit" in the Oracle documentation, for example).

RDBMS distributed databases

Here are some examples of products that go beyond the "one computer" architecture but are not part of the "NoSQL", "NoACID" fad:

HyperDB: HyperDB powers Wordpress.Com, one of the 20 most visited sites on the Internet. HyperDB partitions Wordpress data by user and table, routing queries accordingly. Hence each partition is an independent RDBMS instance (when MySQL Replication, which introduces data inconsistency, is not used).

HyperDB powers Wordpress.Com, one of the 20 most visited sites on the Internet. HyperDB partitions Wordpress data by user and table, routing queries accordingly. Hence each partition is an independent RDBMS instance (when MySQL Replication, which introduces data inconsistency, is not used). Oracle RAC: Oracle's Real Application Cluster (RAC) builds service redundancy into the monolithic architecture by allowing multiple servers to open a single database residing on shared physical storage. Every server can write as well as read, but the writes must be coordinated via locking.

Oracle's Real Application Cluster (RAC) builds service redundancy into the monolithic architecture by allowing multiple servers to open a single database residing on shared physical storage. Every server can write as well as read, but the writes must be coordinated via locking. VoltDB: VoltDB, founded in 2009 by DBMS R&D pioneer Mike Stonebraker and serial entrepreneur, Andy Palmer, implements data partitioning while providing full ACID guarantees and a SQL interface. Data are partitioned at the table level, by user-specified shard keys. Data partitions are also replicated for durability.

What are the major types of non-relational databases?

Key-value databases

At its simplest, a key-value database is a persistent associative array. The most familiar example is BerkeleyDB, a key/value store derived from the 1979 dbm. Since a key-value database can be straightforwardly implemented in any RDBMS as a single table with a VARCHAR and BLOB, modern key/value databases tend to arise in the context of solving one or more of the RDBMS' deficiencies. Examples:

Memcached is a non-redundant, non-persistent, distributed key/value db with very fast in-memory access. It is typically used for caching the results of RDBMS queries in web applications. For example, suppose that the home page of a sports site contains scores from 45 different games and some statistics that must be computed or that require looking at historical data in RDBMS tables. All or a portion of that home page could be computed every minute and then stored in memcached to be served to thousands or tens of thousands of users who requested the page until the next cache update. This reduces the load on the DBMS, at least for that page, by a factor of however many visitors come to the site between cache updates. Memcached is used by some of the largest sites on the Internet, including YouTube, Twitter and wordpress.com. Caveat: Remember that if you keep "stuffing crud into RAM" you'll eventually need a lot more RAM. If your core database is, say, N GB, you'll want N GB of RAM on your RDBMS server so that typical queries aren't slowed down by going to disk. If you have 20 load distribution machines, each with a cache for much of the data, you'll now have to buy 20*N GB of RAM.

is a non-redundant, non-persistent, distributed key/value db with very fast in-memory access. It is typically used for caching the results of RDBMS queries in web applications. For example, suppose that the home page of a sports site contains scores from 45 different games and some statistics that must be computed or that require looking at historical data in RDBMS tables. All or a portion of that home page could be computed every minute and then stored in memcached to be served to thousands or tens of thousands of users who requested the page until the next cache update. This reduces the load on the DBMS, at least for that page, by a factor of however many visitors come to the site between cache updates. Memcached is used by some of the largest sites on the Internet, including YouTube, Twitter and wordpress.com. Caveat: Remember that if you keep "stuffing crud into RAM" you'll eventually need a lot more RAM. If your core database is, say, N GB, you'll want N GB of RAM on your RDBMS server so that typical queries aren't slowed down by going to disk. If you have 20 load distribution machines, each with a cache for much of the data, you'll now have to buy 20*N GB of RAM. Dynamo is Amazon.Com's highly-available key/value store, developed to support millions of reliable shopping carts every day in the face of a "small but significant number of server and network components that are failing at any given time." Redundant copies of data are stored across an array of nodes that can be added and removed (i.e. if dead) without requiring manual partitioning or redistribution. Dynamo is not available for public use, but a number of projects including Voldemort aim to replicate the design set forth in Amazon's white paper.

is Amazon.Com's highly-available key/value store, developed to support millions of reliable shopping carts every day in the face of a "small but significant number of server and network components that are failing at any given time." Redundant copies of data are stored across an array of nodes that can be added and removed (i.e. if dead) without requiring manual partitioning or redistribution. Dynamo is not available for public use, but a number of projects including Voldemort aim to replicate the design set forth in Amazon's white paper. Cassandra, developed at Facebook and later open-sourced, is a "structured key-value store". Like Dynamo, Cassandra spreads copies of data across a distributed set of nodes that can be added to and removed from the live system without distrupting it. Cassandra adds timestamps to the values, and also the notion of "SuperColumns" which allows nested key/value pairs.

If you'd like to say that you're running both an RDBMS and a NoSQL DBMS, this posting on HandlerSocket explains how to bypass the SQL parser and turn MySQL into a NoSQL database (thanks to Michael Edwards for pointing this out).

Object databases

A familiar problem for users of object-oriented languages is how to map the runtime class hierarchy to a relational database when object persistence is needed. To meet the challenge, a variety of object-relational mapping systems such as Java's Hibernate and Ruby On Rails' ActiveRecord have evolved. Object databases, on the other hand, represent objects directly without any translation overhead. Given the popularity of object-oriented languages, such as Java, it is a mystery as to why object databases aren't more popular. Indeed, folks in the 1980s were already talking about the imminent death of the RDBMS, to be supplanted by the mighty new ODBMS. One theory: database users turned out to care more about attributes than identity. Object DBMSes are very fast if you already know what you're looking for, e.g., the thing that this other thing points to. On the other hand, relational databases are better suited when you need to query for objects matching a certain criteria, e.g., the things that are blue, happened on a Tuesday, and were not given away free.

Document databases

MongoDB is a distributed database system that supports data partitioning and redundancy. Data are stored in binary JSON or BSON format. Mongo supports indexing and ad hoc queries. MongoDB is in production deployments on a number of high-profile web sites including Shutterfly, Foursquare and bit.ly. Commercial support is available from 10gen.com.

is a distributed database system that supports data partitioning and redundancy. Data are stored in binary JSON or BSON format. Mongo supports indexing and ad hoc queries. MongoDB is in production deployments on a number of high-profile web sites including Shutterfly, Foursquare and bit.ly. Commercial support is available from 10gen.com. CouchDB is a distributed database system that supports data redundancy through replication, implements multi-version concurrency control (MVCC) to isolate readers from writers, and provides ACID guarantees. Data are stored in JSON format. CouchDB is deployed on a number of Facebook applications and other web sites.

What about MapReduce?

MapReduce is a framework for splitting up a big computing task into a number of smaller tasks that can be run in parallel. Let's illustrate with an example. Suppose that you have a very large HTTP server log file to parse on a mostly-idle four-core machine. Perhaps you need to count up the number of bytes returned for all responses delivered with either a 200 or 206 status code. You write a software routine that identifies lines matching the criteria for the request portion and status code, extract the bytes transferred for those lines and add them to a running sum. Then you kick off the job and wait. As the parsing proceeds, you notice that only one of your four cores is busy. How could you use the full power of the machine to speed up the job?

One option is to split the log file into four roughly equal-sized parts, making sure to split along line boundaries in order to avoid parse errors. Now you can run four copies of the parser in parallel, one on each of the smaller files. Each routine runs on an available core, taking a filename as input and returning a byte count as output. Assuming that the processes are CPU-bound, this should run about four times faster than the original program. A final piece of computation is still required: you must add up the outputs of the 4 jobs to get the final byte count.

This is the essence, then, of a system that implements MapReduce:

A method for splitting up the input into roughly equal-sized parts. Specification of a map routine --- the parser in this example --- that runs identically on each of the inputs. A method for dispatching the map routines against the inputs, taking into account the available processing power and keeping track of which inputs have been processed and which still need work. Specification and execution of a reduce routine --- SUM in this example --- that combines the outputs of the map routines when all have completed.

It turns out that a lot of the basic work of splitting, mapping, dispatching and reducing can be formalized for reuse in building MapReduce systems. Apache Hadoop is one such framework. As such, MapReduce is not itself a database management system. Instead, database management system may employ MapReduce to run queries against large data populations using multiple cores and/or machines.

More: Wikipedia

Conclusion

More