Thomas Wanschik on September 21, 2010

JOINs via denormalization for NoSQL coders, Part 1: Intro

Non-relational databases like App Engine or MongoDB natively don't support operations like JOINs because they don't scale. However in some situations there just exists the need to use such operations. This is the first part of a series presenting one way to handle JOINs (at first in the to-one direction) on NoSQL databases while maintaining scalability. Additionally you'll learn some useful NoSQL coding techniques throughout this series.

Why would I need JOINs?

Let's take the example of users and their photos. Here users represent the to-one side and photos the to-many side:

It's common that users have the possibility to search for pictures of other users. While searching for photos of a specific user is easy to achieve on non-relational databases via

Photo.objects.filter(user=some_user)

for example, searching for photos of users given some specification like

Photo.objects.filter(user__gender='female', popularity='high')

isn't equally simple to achieve. Obviously we need JOINs here (which aren't supported on NoSQL databases) because we span a filter over two models. A straightforward solution to this problem is to get all female users first and then to get all popular photos whose user property points to these users:

user_pks = User.objects.filter(gender='female').values('pk') female_user_photos = Photo.objects.filter(user__in=user_pks, popularity='high')

Getting only the primary keys of users here is just an optimization so we don't have to fetch whole user instances out of the database. However this solutions comes with the problem of scalability. To demonstrate that, let's take a look at the following situation:

We have a high-traffic website

We want to display 20 photos per result page

But only 1 in 5000 users has popular photos

In such a situation we need to fetch more users in order to get more popular photos. It's easy to see that this doesn't scale because we have to fetch too much content out of the database at the same time. In addition to the scalability problem the site would become very slow and the database would get overloaded. On App Engine we would run into timeouts too.

Denormalization is the answer

Another way to solve the problem is to denormalize the users' gender into their photos i.e. to copy the user's gender into the Photo model.

class Photo(models.model): .... # copy the user's gender into denormalized_gender denormalized_gender = models.CharField(max_length=10) ...

Each time we create a new photo we have to denormalize the corresponding user's gender:

new_photo = Photo(user=some_user, denormalized_gender=some_user.gender) new_photo.save()

Once done, a query for popular photos whose user is female becomes a simple and efficient exact filter on the denormalized gender:

Photo.objects.filter(denormalized_gender='female', popularity='high')

But what if you need more than just the user's gender for some of your queries? Maybe we need the user's age too. Following the example above we just denormalize it into the Photo model. That's it.

If the user's gender doesn't change we've worked around the need for JOINs while maintaining scalability. However we have to keep in mind that changes to a user's gender can happen :P (if you refuse to agree with that replace the user's gender with his hair color :). In such cases we'll get wrong result sets because the query doesn't match the user's denormalized gender on the Photo model anymore. In the next post we'll discuss how to handle these situations.