Warning: this post is intended for developers. It gets a bit technical!

Sometimes I'm asked what platform we're running the Business Insider on. Well, we're using LAMP, of course: Linux, Apache, Mongo, PHP. After I get past defending our choice of PHP to the haters (you know who you are!), people ask what the M stands for. Our database is not MySQL, but Mongo.

So what's Mongo?

MongoDB is an open-source, non-relational database that combines three key qualities: scalable, schemaless, and queryable. It has native drivers for pretty much every major language, and a small but growing community.

Mongo's design trades off a few traditional features of databases (notably joins and transactions) in order to achieve much better performance. It is perhaps most comparable to CouchDB for its JSON document-oriented approach, but has much better querying capabilities: you can do dynamic queries without pregenerating expensive views.

So Mongo occupies a sweet spot for powering web apps.

Full disclosure: TBI and 10gen, the developers of MongoDB, share certain investors and board members. (Specifically, 10gen was also founded by our co-founder, Dwight Merriman, and Dwight still owns stock in both companies). In early 2008, 10gen assisted in the development of what was then Alleyinsider, and adopted MongoDB at that time. But that was all before I got here. We continue to use MongoDB today, despite rearchitecting the rest of the platform, because I believe it's the best technology for us.

Here's why and how we use it:

It's Scalable

TBI gets fairly high traffic, and we're growing quickly. On a typical business day we serve upwards of 600k pageviews, and we're blowing towards the 1m mark rapidly. We have three load-balanced Apache webservers, but our database is just running on a single box. (We do have a slave, but use it only as a hot backup.) Our one box isn't running anywhere near total capacity despite fielding a few hundred reads and writes per second. Typically, even when the site's busy, we're running at under 5% of total CPU time.

When we do eventually need to scale up further, Mongo has automatic sharding features to distribute data and load across multiple boxes. We don't need these features yet, but it's good to know they exist.

Document-oriented, not relational



RDMSs were invented in the 1970s, long before object-oriented programming and dynamic scripting languages became popular. By now, we're all accustomed to the process of translating our code's data structures back and forth to the tables in our database, but it doesn't have to be that way.

Rather than rows in a table, Mongo stores documents in collections. Documents are (slightly enhanced) JSON objects, so you can stash much more complex structured data in a single document than you can store in a table row. Natural data structures: arrays, objects, dictionaries. Data modelling becomes a much more natural process.

Embedding objects

Our data modelling approach is different -- instead of using multiple tables and joining them together with foreign keys, we can embed objects within a single document.

For example, each post on our site is a document. Similarly, in a MySQL-based system, a post would be a row in a table. But comments are different. We embed comments directly within the post document as an array of objects. All of the comment data, including the text of each comment, information on who posted it, and the thumbs up/thumbs down voting, is stored directly within the post document.

{ comments: [ { author: 'Ian White', comment: 'Great article!' },

{ author: 'Joe Smith', comment: 'You suck!',

replies: [ { author: 'Jane Smith', comment: 'No you suck!' } ]

}

]

}

When our code pulls up a post like this one, the database doesn't have to query over a separate comments table. The comments are right there as part of the post object, ready to be displayed. This is faster, and makes intuitive sense.

No Object-Relational Mappers



And we don't have to use an ORM. They've been described as "The Vietnam of Computer Science" for a reason. Our code winds up simpler, since we don't need to introduce an artificial layer of abstraction.

(We do use a light wrapper library I've written for PHP called SimpleMongoPhp -- you're welcome to make use of it if it helps you. There are other similar libraries for PHP and other languages, including plugins to popular existing frameworks.)

Schemaless

If you've ever managed a medium-sized site, you know what a pain making changes to your schema can be. On a large dataset, you can lock the database for a long time doing an alter, meaning you have to schedule downtime concurrent with code releases. Rollbacks can be even worse. Even though some frameworks and libraries will help you a little, it's a big deployment problem.

Plus you have the day-to-day nuisance of managing a schema: making sure your dev database has the changes, as well as all of your individual developers' versions of their databases.

We don't have to deal with that using Mongo. There's no database-enforced schema, so when we make a big change (like adding thumbs-ups to the comments, as we did last month), we can easily make it in a backwards-compatible way. We just make sure the code handles the case where a field isn't set.

Life isn't perfect: once in a while, we still have to do a data migration that goes along with a code release. Rather than an alter, maybe we need to do some kind of transformation of pre-existing data. But it doesn't happen nearly as often: maybe about a tenth as often as it did when I used MySQL.

Tagging

Mongo is excellent at indexed "tag" type queries. If we have a post tagged Apple and iPhone, we store that internally as an array of strings:

{ categories: ['Apple', 'iPhone'] }

Mongo can index that field and understand that it needs to search the contents of the array, so we can query for all posts tagged "iPhone" very easily.

Caching

It's still useful to have a caching layer, and so we do -- we use memcached. But we do a lot less caching than we would on a MySQL database. Mongo is very fast at retrieving individual objects, so we don't need to cache individual posts. The post you're looking at right now is not cached; it's being pulled live from the database. That doesn't kill Mongo because it will generally keep that document in memory.

But we do still do some caching on more complex queries. For example, all of our homepages are on a three-minute cache delay. Or the "Most Popular" listing on the sidebar to your right. But because Mongo is usually going to be as fast as memcached for retrieving individual documents, a lot of common situations that I used to cache don't have to be anymore.

And Mongo itself can be used as an effective caching layer. If your collection is small, Mongo will keep it entirely in memory and performance will be comparable to a cache. We do this with our "settings" collection, which stores dynamically customizable options, like which ads are turned currently on, and the "Hot" links below our nav bar.

Real-Time Analytics

Like many sites, we use Google Analytics and other packages to get detailed information about our traffic. But things move too fast to wait until the next day for data. We have statistics pages that provide up-to-the-second live data on what's happening on our site: what pages people are looking at and clicking on right now. Editors can use that data throughout the day for instant feedback on what they're doing.

Mongo is ideally suited to these real-time analytics. Our internal tracker does between 3 and 8 upserts on the database per pageview, and Mongo handles these without any trouble.

This is such an ideal use of Mongo that it'd be a great way to augment a lot of sites that plan to use a RDMS for the foreseeable future. Just get a spare box, throw Mongo on it, and have Mongo power your real-time analytics. There's more about this topic here.

Image Storage

Mongo can store binary data in the database, so that we don't have to deal with the common hassle of having files in the filesystem and metadata in the database. Using its GridFS API, we can easily stash all of our images on the site in Mongo.

We do use a CDN (CDNetworks) in front of our images, but on the occasions we've taken the CDN off, Mongo has performed fine serving the images.

Why Not Use Mongo?



Mongo's pretty great in general, and it's become my default choice for a datastore in a web app. But there are a few things it's not great at, at least right now:

it lacks transactions, so if you're a bank, I wouldn't use it.

it doesn't support SQL, so if you have a legacy codebase that relies on SQL, or if you need to use some of the more complex collation/grouping features of SQL-based DBs, Mongo may not be the best choice

it doesn't have any built-in revisioning like CouchDB

it doesn't have real fulltext searching features (slow regexes and tag-stemming is the best you can do)

But overall, MongoDB is probably well-suited for a lot of web applications -- maybe as many as 50%. The 10gen folks have a list of sites that are using Mongo in production, including some interesting names like Sourceforge, Electronic Arts, and Disqus.

There's an active mailing list, and I'm happy to answer questions about how we use Mongo. The developers themselves are also friendly and good about answering questions.

(This article was adapted from notes for a live webinar I did last week with Dwight Merriman, CEO of 10gen. You can watch that presentation here.)