SCALE 8x: Relational vs. non-relational

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

PostgreSQL hacker Josh Berkus set out to do some "mythbusting" about differences in database technologies in his talk at SCALE 8x. While there are plenty of differences between the various approaches taken by database systems, those are not really the ones that are being highlighted by the technical press. In particular, the so-called "NoSQL movement" makes for a great soundbite, but is "not very informative or accurate". Berkus went on to survey the current database landscape while giving advice on how to approach choosing a database for a particular application.

This is a "more exciting time" to be a "database geek" than ever before, he said. Looking back seven years to 2003, he noted that there were essentially seven different free choices, all of which are SQL-based. In 2010, there are "dozens of new databases breeding like rabbits", with some 60 choices available. As an example of how quickly things are moving, Berkus noted that while he was in New Zealand at linux.conf.au, where a colleague was giving a related talk, two new databases were released.

Mythbusting

Berkus likened the NoSQL term to a partition that is created by putting dolphins, clown fish, and 1958 Cadillacs on one side and octopuses, Toyota Priuses, and redwood trees on the other—labeled as the "NoFins" group. The non-relational databases that are lumped together as NoSQL have "radically different" organizations and use cases. But, that's not just true of the non-relational databases, it's also true for the various relational databases as well.

Another myth that he pointed out was the "revolutionary" tag that gets associated with all of the new types of databases. Once again, that is a convenient soundbite that isn't accurate. He has not seen a new database algorithm since 2000, and all of the new crop of database systems are new implementations and combinations of earlier techniques. The new systems are not revolutionary, just evolutionary.

As an example, he put up a slide with the following description of a database: "A database storing application-friendly formatted objects, each containing collections of attributes which can be searched through a document ID, or the creation of ad-hoc indexes as needed by the application." He noted that it applies equally well to one of his current favorites, CouchDB, which was created in 2007, and to the Pick database system—the original object of the description—which was created in 1965.

Instead of a revolution, what we are seeing now is a "renaissance of non-relational databases". That description is far more accurate, Berkus said, and is a better way to view the change. It is a "big thing" that is going to "change the way that people use databases", so it is important to label it correctly.

Another myth is that non-relational databases are "toys", which is something that is often pushed by people who work on relational systems. Berkus pointed out that many SCALE sponsors would disagree: Google using Bigtable, Facebook using Memcached, Amazon with Dynamo, and so on.

The other side of that myth is that relational databases will become obsolete. Unsurprisingly, that myth is often promulgated by those who work on non-relational databases, and it is something that the relational community has heard before. Berkus pointed to a keynote speech in 2001 proclaiming that relational databases would be replaced with XML databases. He then asked if anyone even remembered or used XML databases; when even the crickets were silent, he pointed out that various relational and non-relational databases had hybridized with XML databases, incorporating the best features of XML databases into existing systems. He predicted that "over the next five years, we will see more hybridization" between different types of database technologies.

"Relational databases are for when you need ACID transactions" was myth number five. Support for transactions is "completely orthogonal" to the relational vs. non-relational question. There are systems like Berkeley DB and Amazon Dynamo that provide robust transactions in non-relational databases, as well as MS Access and MySQL that provide SQL without transactions.

The final myth that needs busting is the Lord of the Rings inspired "one ring theory of database use", Berkus said. There is "absolutely no reason" to choose one database for all of one's projects. He recommends choosing the database system that fits the needs of the application, or to use more than one, such as MySQL with Memcached or PostgreSQL with CouchDB. Another alternative is to use a hybrid, like MySQL NDB, which puts a distributed object database as a back-end to MySQL, or HadoopDB which puts PostgreSQL behind the Hadoop MapReduce implementation.

So, what about relational vs. non-relational?

Relational databases provide better transaction support than non-relational databases do, mostly because of the age and maturity of relational databases, Berkus said. Transaction support is something that many open source people don't know about because the most popular database (MySQL) doesn't implement it. Relational databases enforce data constraints and consistency because that is the basis of the relational model. There are other benefits of today's relational databases, he said, including complex reporting capabilities and vertical scaling to high-end hardware. He also noted that horizontal scaling was not that well-supported and that relational databases tend to have a high administrative overhead.

On the question of SQL vs. Not-SQL, Berkus outlined the tradeoffs. SQL promotes portability, multiple application access, and has ways to manage database changes over time. There are many mature tools to work with SQL, but SQL is a full programming language that must be learned to take advantage of it. Not-SQL allows fast interfaces to the data, without impedance-matching layers, which in turn allows for faster development. Typically, there are no separate database administrators (DBAs) for Not-SQL databases, with programmers acting in that role.

"It's always a tradeoff", Berkus said, but one place that a SQL-relational database makes the most sense is where you have "immortal data". If the data being stored has a life independent of the specific application and needs to be available to new applications down the road, SQL-relational is probably the right choice.

How to choose

For other situations, you need to define the "features you actually need to solve that particular problem" plus another list of features you'd like, "then go shopping". Chances are, he said, there is a database or combination of databases that fits your needs. He then went on to some specific application requirements, suggesting possible choices of database or databases to satisfy them.

I need a database for my blog : " use anything ", including MySQL, PostgreSQL, SQLite, CouchDB, flat files, DBase III, etc. Pick " whatever is easiest to install " because " it doesn't matter ".

: " ", including MySQL, PostgreSQL, SQLite, CouchDB, flat files, DBase III, etc. Pick " " because " ". I need my database to unify several applications and keep them consistent : For example a data warehousing application written C/C++ with reporting tools in Ruby and Rails, should use an OLTP SQL-Relational database like PostgreSQL. He also couldn't resist noting that the PostgreSQL 9 alpha was released the day before: " download it and test it out ".

: For example a data warehousing application written C/C++ with reporting tools in Ruby and Rails, should use an OLTP SQL-Relational database like PostgreSQL. He also couldn't resist noting that the PostgreSQL 9 alpha was released the day before: " ". I need my application to be location aware : a geographical database, such as PostGIS, is needed. Geographical databases allow queries like "what's near" and "what's inside".

: a geographical database, such as PostGIS, is needed. Geographical databases allow queries like "what's near" and "what's inside". I need to store thousands of event objects per second on embedded hardware : db4object is probably the right choice, but SQLite might also be considered.

: db4object is probably the right choice, but SQLite might also be considered. I need to access 100K objects per second over thousands of web connections : Memcached is a distributed in-memory key-value store, which is used by all of the biggest social networks. It can be used as a supplement to a back-end relational database. He also mentioned Redis and TokyoTyrant as possible alternatives.

: Memcached is a distributed in-memory key-value store, which is used by all of the biggest social networks. It can be used as a supplement to a back-end relational database. He also mentioned Redis and TokyoTyrant as possible alternatives. I have hundreds of government documents I need to serve on the web and mine for data : It's hard to get the government to release the data, so the structure of the data may not come with it, which means that the structure must be derived from examining the documents. For that, he suggests CouchDB.

: It's hard to get the government to release the data, so the structure of the data may not come with it, which means that the structure must be derived from examining the documents. For that, he suggests CouchDB. I have a social application and I need to know who-knows-who-knows-who-knows-who-knows-who : This is a very hard problem for relational databases and what's needed is a graphing database such as Neo4j. Long chains of relationships are difficult for relational databases, but graphing databases, used in conjunction with another database, can handle these kinds of queries, as well as queries to find items "you may also like".

: This is a very hard problem for relational databases and what's needed is a graphing database such as Neo4j. Long chains of relationships are difficult for relational databases, but graphing databases, used in conjunction with another database, can handle these kinds of queries, as well as queries to find items "you may also like". and so on ...

The slides [PDF] from Berkus's talk have additional examples. The basic idea is that "different database systems do better at different tasks" and it is impossible for any database system to do everything well, "no matter what a vendor or project leader may claim". For those who are looking for open source solutions, he recommended the Open Source Database survey which Selena Deckelmann has put together. While it is, as yet, incomplete, it does list around a dozen lesser-known database systems.

It is clear from the talk that it is an exciting time to be a database developer—or user for that matter. There are many different options to choose from, each with their own strengths and weaknesses, some of which can be combined in interesting ways. It is also very clear that there are many more axes to the database graph than just the overly simplified SQL vs. NoSQL axis that seems to dominate coverage of these up-and-coming database systems.

