Better Django models

Using model managers and improving query performance

In Django, most interactions with a database are through its Object-Relational Mapper (ORM), a feature Django shares with other recent Web frameworks, such as Rails. ORMs are increasingly popular with developers, as they automate many common interactions with a database and use familiar object-oriented approaches rather than SQL statements.

Django programmers may choose to bypass the native ORM in favor of the popular SQLAlchemy package, but although SQLAlchemy is quite powerful, it is also more difficult to use and requires more lines of code. Django applications can and have been developed using SQLAlchemy instead of the native ORM, but some of the most attractive features of Django, such as its automatically generated administrative interface, require use of the ORM.

Frequently used acronyms API: Application programming interface

Application programming interface HTML: Hypertext Markup Language

Hypertext Markup Language RDBMS: Relational database management system

Relational database management system SQL: Structured Query Language

This article sheds light on some of the lesser-known features of the Django ORM in particular, but SQLAlchemy users may find some of the cautions about inefficient query generation applicable to their own code.

Software versions used in this article include:

Django V1.0.2 (parts 1 and 2)

Django V1.1 alpha (part 3)

sqlite3

Python V2.4-2.6 (Django does not yet support Python V3.)

IPython (for the sample output)

The Django ORM supports many database back ends, but sqlite3 is the easiest to install and comes bundled with many operating systems. These examples should work with any back end. For a list of databases that Django supports, see Related topics.

Avoiding common traps in ORM query generation

Django's design encourages an agile-development style that rewards rapid prototyping and experimentation. In early stages, it's best not to worry about performance and instead favor readability and ease of implementation.

Sometimes, it doesn't take long to hit performance problems. Often, it happens the first time you try your application with real data. It may become obvious when the test suite execution creeps over the five-minute mark, but contains just a few tests. Other times, the application is just noticeably slow. Luckily, there are some easily identifiable patterns that are correspondingly easy to fix. A common case is shown in Listing 1 (the application's models.py file) and Listing 2.

Listing 1. Basic models for the examples application: models.py

from django.db import models # Some kind of document, like a blog post or a wiki page class Document(models.Model): name = models.CharField(max_length=255) # A user-generated comment, such as found on a site like # Digg or Reddit class Comment(models.Model): document = models.ForeignKey(Document, related_name='comments') content = models.TextField()

About the code samples Django provides a handy shortcut to setting up a working environment in stand-alone code: running python manage.py shell . All code samples in this article presuppose that the environment has been invoked in this way. In Django lingo, the following is assumed in this article: The Django project is called better_models.

The better_models project contains an application called examples. The examples application models a basic blog-like system of documents and zero or more comments on those documents.

Listing 2 shows how to access the models set up in Listing 1 in an inefficient way.

Listing 2. Accessing those models very slowly

from examples.model import * import uuid # First create a lot of documents and assign them random names for i in range(0, 10000): Document.objects.create(name=str(uuid.uuid4())) # Get a selection of names back to be looked up later names = Document.objects.values_list('name', flat=True)[0:5000] # The really slow way to get back a list of Documents that # match these names documents = [] for name in names: documents.append(Document.objects.get(name=name))

This is a contrived example, but it illustrates a fairly common use case: Given a list of identifiers, get all the items that correspond to those identifiers from the database.

The naively coded example above took 65 seconds to run when using sqlite3 in-memory. With a file system-dependent database, it would be even longer. However, Listing 3 has a fix for this slow-running query. Instead of issuing multiple database queries for each name value, use the fieldname__in operator to generate an SQL query similar to:

SELECT * FROM model WHERE fieldname IN ('1', '2', ...)

(The actual query syntax generated will vary by database engine.)

Listing 3. A fast query for getting a list of items

from examples import models import uuid for i in range(0, 10000): Document.objects.create(name=str(uuid.uuid4())) names = Document.objects.values_list('name', flat=True)[0:5000] documents = list(Document.objects.filter(name__in=names))

This code executes in only 3 seconds. Note that this code casts the result of the query to a list in order to force the query to be evaluated. Because Django queries are evaluated lazily, simply assigning the result of the query would not cause any database access to occur, invalidating the comparison.

Database gurus may find this example obvious, especially if they're accustomed to writing raw SQL, but many Python programmers don't have a database background. Sometimes, the best instincts of programmers can actually work against efficiency. Listing 4 demonstrates one way you might choose to refactor the code in Listing 2, not realizing it will turn out to be a trap.

Listing 4. A common pattern resulting in slow database usage

for name in names: documents.append(get_document_by_name(name)) def get_document_by_name(name): return Document.objects.get(name=name))

On the surface, it might seem like a good idea to create a separate method for retrieving documents from the database. Maybe there's other work that should be done here, such as adding data to the model before returning it. Watch out for this pattern because refactoring into discrete methods can appear to be an improvement on the code. Writing unit tests from the very beginning of development and including at least some tests that operate on large datasets can help identify when refactoring results in sudden performance drops.

Encapsulating common queries with manager models

All Django developers use the built-in Manager class: It's what gets called for all methods of the form Model.objects.* . The base Manager class is automatically available and provides commonly used methods that return QuerySets (for example, all() ), those that return values (for example, count() ), and those that return Model instances (for example, get_or_create() ).

Django developers are encouraged to override the base Manager class. To illustrate why this feature is useful, extend the examples application to include a new model, Format , which describes the file format of the documents in the system. An example is shown below.

Listing 5. Adding a model to examples

from django.db import models class Document(models.Model): name = models.CharField(max_length=255) format = models.ForeignKey('Format') class Comment(models.Model): document = models.ForeignKey(Document, related_name='comments') content = models.TextField() class Format(models.Model): type = models.CharField(choices=( ('Text file', 'text'), ('ePub ebook', 'epub'), ('HTML file', 'html')), max_length=10)

Best practices for database updates Any time you add tables or columns in models.py, you need to re-synchronize the related database. There are a few best practices for this: In early development, use an in-memory database like sqlite3 only and employ database fixtures to automatically load sample content. In-memory databases can be quite fast for a single user and reduce a lot of waiting time in dropping and re-creating tables in a traditional RDBMS like MySQL.

Use a test-driven development approach. The Django testing framework re-creates your database from scratch each time, so the tables are always up to date. Combining this functionality with sqlite3 in-memory databases makes testing even faster.

Try one of the many Django add-ons that manage database synchronization. I have had good experiences with the django-evolution package, but there are others. For more information about django-evolution , see Related topics. If you choose to use sqlite3 in your development or testing, be sure to perform final integration tests with your production database. Django's ORM helps to flatten out the differences between RDBMS engines for most common cases, but not all behaviors are guaranteed to be identical.

Next, use the changed models to create some sample documents that have Format instances assigned.

Listing 6. Create some documents with assigned formats

# First create a series of Format objects \ and save # them to the database them to the database format_text = Format.objects.create(type='text') format_epub = Format.objects.create(type='epub') format_html = Format.objects.create(type='html') # Create a few documents in various formats for i in range(0, 10): Document.objects.create(name='My text document', format=format_text) Document.objects.create(name='My epub document', format=format_epub) Document.objects.create(name='My HTML document', format=format_html)

Imagine that this application will provide a way to first filter the documents by format, then to filter that QuerySet by other fields such as title. Following is a simple query to return just the text documents: Document.objects.filter(format=format_text) .

In this example, the meaning of the query is quite clear, but in a mature application, you may need to apply many more restrictions to the result set. You may want to restrict the listing to only those documents marked public or documents no older than 30 days. If you have to call this query from multiple places in your application, keeping all of the filtering clauses in sync can become a real maintenance headache and a source of bugs.

This is where a custom manager can help. Custom managers offer the ability to define unlimited numbers of "canned" queries — similar to the built-in manager methods, such as latest() (which returns only the most recent instance of a given model) or distinct() (which issues a SELECT DISTINCT clause in the generated query). Not only do these queries reduce the amount of code that may need to be duplicated across your application but managers improve readability. Over time, would you prefer to read this:

Documents.objects.filter(format=format_text,publish_on__week_day=todays_week_day, is_public=True).distinct().order_by(date_added).reverse()

Or will it be easier for you or a new developer to understand:

Documents.home_page.all()

Creating a custom manager is extremely simple. Listing 7 shows the get_by_format example.

Listing 7. Custom Manager class that provides methods for each format type

from django.db import models class DocumentManager(models.Manager): # The model class for this manager is always available as # self.model, but in this example we are only relying on the # filter() method inherited from models.Manager. def text_format(self): return self.filter(format__type='text') def epub_format(self): return self.filter(format__type='epub') def html_format(self): return self.filter(format__type='html') class Document(models.Model): name = models.CharField(max_length=255) format = models.ForeignKey('Format') # The new model manager get_by_format = DocumentManager() # The default model manager now needs to be explicitly defined objects = models.Manager() class Comment(models.Model): document = models.ForeignKey(Document, related_name='comments') content = models.TextField() class Format(models.Model): type = models.CharField(choices=( ('Text file', 'text'), ('ePub ebook', 'epub'), ('HTML file', 'html')), max_length=10) def __unicode__(self): return self.type

A few comments about this code:

If you define a custom manager, Django automatically removes the default manager. I prefer to leave the default manager and the custom manager so other developers (or myself, when I forget) can still use objects and it will behave exactly as expected. However, because my new get_by_format manager is just a subclass of Django's models.Manager , all the default methods, such as all() , are available to it. Whether to include the default manager as well as your custom manager is a personal preference.

and it will behave exactly as expected. However, because my new manager is just a subclass of Django's , all the default methods, such as , are available to it. Whether to include the default manager as well as your custom manager is a personal preference. It's also possible to assign your new manager to objects directly. The only disadvantage is if you then want to override the initial QuerySet itself. Then your new objects will have an unexpected behavior that other developers might not anticipate.

directly. The only disadvantage is if you then want to override the initial itself. Then your new will have an unexpected behavior that other developers might not anticipate. You need to define the manager class in models.py prior to defining your model class, or the class won't be available to Django. This is similar to the restrictions around ForeignKey class references.

class references. I could have simply implemented DocumentManager with a single method that takes an argument, like with_format(format_name) . Generally, I prefer manager methods that have verbose method names but take no arguments.

with a single method that takes an argument, like . Generally, I prefer manager methods that have verbose method names but take no arguments. There's no technical limit to the number of custom managers you can assign to a class, but you won't likely need more than one or two.

Using the new manager methods is simple enough.

In [1]: [d.format for d in Document.get_by_format.text_format()][0] Out[1]: <Format: text> In [2]: [d.format for d in Document.get_by_format.epub_format()][0] Out[2]: <Format: epub> In [3]: [d.format for d in Document.get_by_format.html_format()][0] Out[3]: <Format: html>

Now there is a convenient place to hang any functionality related to these queries, and you can apply additional restrictions without cluttering up the code. It's also in keeping with the spirit of Django's brand of model-view-controller (MVC) to put this kind of functionality in models.py rather than littering it in views or template tags.

Overriding the initial QuerySet that a custom manager returns

Another coding pattern that can apply to manager classes is one that may have no custom methods at all. Instead of defining a new method that returns only HTML-format documents, for example, you could define a custom manager that operates entirely on that restricted set, such as the one shown below.

Listing 8. Custom manager for HTML documents

class HTMLManager(models.Manager): def get_query_set(self): return super(HTMLManager, self).get_query_set().filter(format__type='html') class Document(models.Model): name = models.CharField(max_length=255) format = models.ForeignKey('Format') html = HTMLManager() get_by_format = DocumentManager() objects = models.Manager()

The get_query_set() method is inherited from models.Manager and is overridden in this example to take the basic query (the same one that all() generates) and apply the additional filter to it. Any subsequent methods you add to this manager would call your get_query_set() method first, then apply additional query methods on that result, as shown below.

Listing 9. Using the custom format manager

# Our HTML query returns the same number of results as the manager # which explicitly filters the result set. In [1]: Document.html.all().count() Out[1]: 10 In [2]: Document.get_by_format.html_format().count() Out[2]: 10 # In fact we can prove that they return exactly the same results In [3]: [d.id for d in Document.get_by_format.html_format()] == [d.id for d in Document.html.all()] Out[3]: True # It is not longer possible to operate on the unfiltered # query in HTMLManager() In [4]: Document.html.filter(format__type='epub') Out[4]: []

Use this class-based method for filtering queries when you expect to have many operations on a subset of your data and want to reduce the amount of code and the complexity of the queries you need to generate.

Using class and static methods with models

There are no restrictions on the types of methods you can add to a manager. Methods can return QuerySet s, as shown above, or they can return instances of the related model class (available as self.model ).

TThere may be cases where you want to do operations that are related to a model, but don't return instances or QuerySets . The Django documentation states that all methods that aren't on instances of a model class should go on the manager, but another possibility is to use Python's class and static methods.

Here's a simple example of a utility method that's related to the Format class but not to an individual instance:

# Return the canonical name for a format extension based on some # common values that might be seen "in the wild" def check_extension(extension): if extension == 'text' or extension == 'txt' or extension == '.csv': return 'text' if extension.lower() == 'epub' or extension == 'zip': return 'epub' if 'htm' in extension: return 'html' raise Exception('Did not get known extension')

This code doesn't take or return an instance of the Format class, so it isn't appropriate as an instance method. You could add it to a FormatManager , but because it doesn't access the database at all, it doesn't feel right to put it there.

One solution is to add it to the Format class and declare it as a static method using the @staticmethod decorator, as shown below.

Listing 10. Adding a utility function as a static method on a model class

class Format(models.Model): type = models.CharField(choices=( ('Text file', 'text'), ('ePub ebook', 'epub'), ('HTML file', 'html')), max_length=10) @staticmethod def check_extension(extension): if extension == 'text' or extension == 'txt' or extension == '.csv': return 'text' if extension.lower() == 'epub' or extension == 'zip': return 'epub' if 'htm' in extension: return 'html' raise Exception('Did not get known extension') def __unicode__(self): return self.type

This method would be called as Format.check_extension(extension) without requiring an instance of Format or the creation of a manager.

Python also offers the @classmethod decorator, which generates methods on classes that take the class itself as the first argument. This might be useful in cases where you want to perform some introspection on the class object itself without instantiating it.

Aggregation queries in Django V1.1

In V1.1, Django's ORM includes powerful query methods that provide functionality previously only available via raw SQL. For Python developers leery of SQL — and for anyone wanting to keep his or her Django applications usable across multiple database engines — this is a real boon.

In today's socially mediated applications, it's extremely common to order items not just by rigid fields like alphabetical order or creation date but by dynamic data. For example, in the examples application, you may want to list documents by popularity, based on the number of comments made on each one. Prior to Django V1.1, you could do such a thing only by writing some custom SQL code, creating a nonportable stored procedure, or — worst of all — writing some inefficient object-oriented queries. Another approach was to define a dummy database field that would contain the desired value by which to count (for example, the number of Comment rows) and update that manually by overriding the document's save() method.

Django aggregation removes the need for all of this. You can now order documents by the number of comments they receive with just one QuerySet method: annotate() . Listing 11 provides an example.

Listing 11. Using aggregation to order results by number of comments

from django.db.models import Count # Create some sample Documents unpopular = Document.objects.create(name='Unpopular document', format=format_html) popular = Document.objects.create(name='Popular document', format=format_html) # Assign more comments to "popular" than to "unpopular" for i in range(0,10): Comment.objects.create(document=popular) for i in range(0,5): Comment.objects.create(document=unpopular) # If we return results in the order they were created (id order, by default), we get # the "unpopular" document first. In [1]: Document.objects.all() Out[1]: [<Document: Unpopular document>, <Document: Popular document>] # If we instead annotate the result set with the total number of # comments on each Document and then order by that computed value, we # get the "popular" document first. In [2]: Document.objects.annotate(Count('comments')).order_by('-comments__count') Out[2]: [<Document: Popular document>, <Document: Unpopular document>]

The annotate() QuerySet method does not perform any aggregation by itself. Instead, it directs Django to assign the value of the passed expression to a pseudo-column in the result set. By default, that column name will be the supplied field name (here, the value of Comment.document.related_name() ) prepended to the name of the aggregation method. This code is calling django.db.models.Count , which is one of the simple mathematical functions available in the aggregation library. (For the complete list, see Related topics.)

The result of Document.objects.annotate(Count('comments')) is a QuerySet to which a new property — comments__count — has been added. If you want to override that default name, you can pass the name as a keyword argument.

Document.objects.annotate(popularity=Count('comments'))

Now that the intermediate QuerySet contains a count of all the comments associated with each document, you can order by that field. Because you want the documents with the most comments to appear first, use descending order, as .order_by('-comments__count') .

Using aggregation not only reduces the amount of code you need to write but it also guarantees that these operations will be fast, as they're relying on the database engine to do the mathematical calculations. This process is much more efficient than extracting all the relevant data through the ORM and calculating over the result set manually.

Other types of aggregation in Django V1.1

The new aggregation library doesn't just allow for returning more complex result sets. You can also return non- QuerySet results extracted directly out of the database. For example, to get the average number of comments for all the documents in the database, use this code:

In [1]: from django.db.models import Avg In [2]: Document.objects.aggregate(Avg('comments')) Out[2]: {'comments__avg': 8.0}

You can apply aggregation to filtered or unfiltered queries, and columns generated with annotate can themselves be filtered on like normal fields. You can also apply aggregation methods across joins. For example, you could aggregate documents based on ratings assigned to comments, as in a Slashdot-style site. For more information about aggregation, see Related topics.

Conclusion

One charge leveled at object-relational mappers is that they abstract away so much of the database engine that writing efficient, scalable applications with them is impossible. For some kinds of applications — those with millions of accesses and highly interrelated models — this assertion is often true.

The vast majority of Web applications never reach such huge audiences and don't achieve that level of complexity. ORMs are designed to get projects off the ground quickly and to help developers jump into database-driven projects without requiring a deep knowledge of SQL. As your Web site gets bigger and more popular, you will certainly need to audit performance as described in the first section of this article. Eventually, you may need to start replacing ORM-driven code with raw SQL or stored procedures.

Happily, the capabilities of easy-to-use ORMs like Django's continue to evolve. Django V1.1's aggregation library is a major step forward, allowing efficient query generation while still providing a familiar object-oriented syntax. For even greater flexibility, Python developers should also look at SQLAlchemy, especially for Python Web applications that don't rely on Django.

Downloadable resources

Related topics