Although ORMs are very useful for developers, abstracting access to a database comes at a price. Developers who decide to go deep into the database will find that some things could have been made easier.

This article was inspired by our experience of optimizing the database usage to improve page load time. To reach the best possible output, we had to make many mistakes and decided to share them. This article will be useful for you if you are planning to develop or support a project written with Django.

Use model.fk_id instead of model.fk.id

At first glance, it seems that this method is absolutely useless. What can you find out about the object only by its identifier? In fact, all sorts of filter or map and other lambda-functions cannot do without it.

For example, you need to find all the events from the same store as our event. The first thing that comes to mind is to use a simple request:

Event.objects.filter(store_id=event.store.id) 1 2 3 Event . objects . filter ( store_id = event . store . id )

Everything seems simple and cloudless, but if we check our requests to DB log, we will see 2 requests there. In first request you get store object but do not need any data about the store except for its ID. To fix this, you need to access the foreign key directly. In this case, the code will look like this:

Event.objects.filter(store_id=event.store_id) 1 2 3 Event . objects . filter ( store_id = event . store_id )

Finally, there are no more requests for information about the store and your query will make just 1 request to DB.

Getting related objects with select_related and prefetch_related methods

Let’s imagine that you need to get all the events from the database, and then insert them into the template along with the stores and the list of brands for each event.

Lets implement the view using ListView :

class EventsListView(ListView): template_name = 'events/events_list.html' model = Event context_object_name = 'events' 1 2 3 4 5 6 class EventsListView ( ListView ) : template_name = 'events/events_list.html' model = Event context_object_name = 'events'

In the template you display the information about the event, store and brands:

<div> <h2>{{ event.title }}</h2> <p>Store: {{ event.store.title }}</p> <p>Brands: {% for brand in event.brands.all %} {{ brand }}{% if not forloop.last %}, {% endif %} {% endfor %} </div> 1 2 3 4 5 6 7 8 9 10 < div > < h2 > { { event . title } } < / h2 > < p > Store : { { event . store . title } } < / p > < p > Brands : { % for brand in event . brands . all % } { { brand } } { % if not forloop . last % } , { % endif % } { % endfor % } < / div >

In this case, you first receive all the events with a single SQL query and then for each of these events, store and brands are separately requested. You need to force Django to request all this data with a smaller number of requests.

Let’s start by getting the stores. For the QuerySet to obtain the data on certain foreign keys in advance, there is a select_related method. Update the QuerySet in your view to use this method:

queryset = Event.objects.select_related('store') 1 2 3 queryset = Event . objects . select_related ( 'store' )

The select_related () method returns a QuerySet that automatically includes data from related objects in the query when the query is executed. Accessing related objects through the model will not require additional database queries. It is convenient to use for “one-to-many” or “one-to-one” relationships.

The select_related method works only with foreign keys in the current model. To reduce the number of requests when receiving a set of related objects (such as brands in our example), you need to use the prefetch_related method.

Again, update the QuerySet attribute of the EventListView class:

queryset = Event.objects.select_related('store').prefetch_related('brands') 1 2 3 queryset = Event . objects . select_related ( 'store' ) . prefetch_related ( 'brands' )

The prefetch_related () method returns a QuerySet, which for one approach gets related objects for each of the specified search parameters.

Restriction of Fields in Selections (defer, only)

If you take a closer look at the SQL queries from the previous example, you will see that you get more fields than you need. You get all the fields of the store and brands, including a huge description of the event. You can significantly reduce the amount of data transferred using the defer method, which allows you to delay the receipt of certain fields. If the code still calls to such a field, Django will make an additional request to receive it. Add a call to the defer method in the queryset:

Event.objects.select_related('store').prefetch_related('brands').defer('description') 1 2 3 Event . objects . select_related ( 'store' ) . prefetch_related ( 'brands' ) . defer ( 'description' )

Now the unnecessary ‘description’ field is not requested, which reduces the request processing time.

Still, you get a lot of event fields that you don’t use. It would be simpler to indicate ‘only’ those fields that you really need. For this, there is the only() method, before which the field names would be transferred and the remaining fields would be put aside:

Event.objects.select_related('store').prefetch_related('brands').only('title', 'store__title', 'brands__title') 1 2 3 Event . objects . select_related ( 'store' ) . prefetch_related ( 'brands' ) . only ( 'title' , 'store__title' , 'brands__title' )

Those, defer() and only() perform the same task, limiting the fields in the samples. The only difference is that:

defer() postpones getting fields passed as arguments,

postpones getting fields passed as arguments, only() postpones the receipt of all fields except those transmitted.

Never use len(queryset)

If you need to get the QuerySet objects amount, do not use len() method. The count() method is much better for this purpose.

For example, if you need to get the total number of all events, the wrong way would be:

len(Event.objects.all()) 1 2 3 len ( Event . objects . all ( ) )

In this case, the query for selecting all data from the table will be carried out first, then transformed into a Python object, and the length of this object will be found with the help of len() method. Of course, this is not the best option and it would be enough for you to get one number from the database — the number of events.

For this, use the count() method:

Event.objects.count() 1 2 3 Event . objects . count ( )

With count() , an easier query would be carried out in the database and fewer resources would be required for python code performance.

if queryset is always a bad idea

If you need to check whether the result of the QuerySet exists, do not use QuerySet as a boolean value or queryset.count() > 0 . Use queryset.exists() instead.

The method exists() returns True if the QuerySet contains any results, and False if not. It tries to perform the query in the simplest and fastest way possible but executes nearly the same query as a normal QuerySet query.

Exists() is useful for searches relating to both object membership in a QuerySet and to the existence of any objects in a QuerySet, particularly in the context of a large QuerySet.

Database Indexes

Make sure that the fields you are searching by are indexed. Use the field parameter db_index = True in your model.

The index is stored in the B-tree, so the object will be found in logarithmic time – O (log (n)) . If you had a billion items, an object search would take as much time as a linear search of 30 items.

If you will not use db_index, the search will lead to table scanning. Imagine a vocabulary in which words are completely mixed and the only way to find the word is to turn all the pages one by one. You can imagine how much time it would take, if you had a billion items without indexes and you ran the query above.

Indexes are useful not only when filtering data, but also when sorting it. Besides, many DBMSs allow making indexes on several fields, which is useful if you are filtering data by a set of fields. We advise you to learn the documentation for your DBMS for more details.

Bulk Operations

a. Bulk insert with bulk_create method

Suppose that your new Django application replaces the old application and you need to transfer data about users to new models. You exported data from the old application to huge JSON files.

The file with users has the following structure:

[ { "email": "example@example.com", “first_name": “Example”, "last_name": “Example” } ] 1 2 3 4 5 6 7 8 9 [ { "email" : [email protected]" , “ first _ name ": “Example”, " last _ name" : “ Example ” } ]

Let’s make a method for importing users from the JSON file to the database:

def import_users(data_file_path): with open(data_file_path, 'r') as json_file: data = json.loads(json_file.read()) for user_data in data: user = User( email=user_data['email'], first_name=user_data['first_name'], last_name=user_data['last_name'], ) user.save() 1 2 3 4 5 6 7 8 9 10 11 12 def import_users ( data_file_path ) : with open ( data_file_path , 'r' ) as json_file : data = json . loads ( json_file . read ( ) ) for user_data in data : user = User ( email = user_data [ 'email' ] , first_name = user_data [ 'first_name' ] , last_name = user_data [ 'last_name' ] , ) user . save ( )

Check how many SQL queries are executed when 200 users are loaded and see that you have completed 200 queries. It means that for each user a separate INSERT SQL query is executed. If you have a large amount of data, this approach can be very slow. Let’s use the bulk_create method of the User’s model manager:

def import_users(data_file_path): with open(data_file_path, 'r') as json_file: data = json.loads(json_file.read()) user_instances = [] for user_data in data: user = User( email=user_data['email'], first_name=user_data[first_name], last_name=user_data[last_name], ) user_instances.append(user) User.objects.bulk_create(user_instances) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 def import_users ( data_file_path ) : with open ( data_file_path , 'r' ) as json_file : data = json . loads ( json_file . read ( ) ) user_instances = [ ] for user_data in data : user = User ( email = user_data [ 'email' ] , first_name = user_data [ first_name ] , last_name = user_data [ last_name ] , ) user_instances . append ( user ) User . objects . bulk_create ( user_instances )

After calling the method, you will see that one huge query to the database was executed for all users.

If you really need to insert a large amount of data, you may have to split it into several queries. For this, there is a batch_size parameter for the method bulk_create , which specifies the maximum number of objects that will be inserted in one request. So, if you have 200 objects, specifying bulk_size = 50 you will get 4 requests.

The bulk_size method has a number of restrictions that you can see in the documentation.

b. Bulk insert to M2M relation

In this case, you need to insert in the database events and brands that are in a separate JSON file with the following structure:

[ { "title": "Best event", "brands": [ "Brand", "Famous brand", "Most famous brand" ], "slug": "...", "description": "..." } ] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [ { "title" : "Best event" , "brands" : [ "Brand" , "Famous brand" , "Most famous brand" ] , "slug" : "..." , "description" : "..." } ]

The method for this will be the following:

def import_events(data_file_path): with open(data_file_path, 'r') as json_file: data = json.loads(json_file.read()) for event_data in data: event = Event( title=event_data['title'], slug=event_data['slug'], description=event_data['description']) event.save() for brand in event_data['brands']: brand_instance, _ = Brand.objects.get_or_create(title=brand) event.brands.add(brand_instance) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 def import_events ( data_file_path ) : with open ( data_file_path , 'r' ) as json_file : data = json . loads ( json_file . read ( ) ) for event_data in data : event = Event ( title = event_data [ 'title' ] , slug = event_data [ 'slug' ] , description = event_data [ 'description' ] ) event . save ( ) for brand in event_data [ 'brands' ] : brand_instance , _ = Brand . objects . get_or_create ( title = brand ) event . brands . add ( brand_instance )

By calling the method we get a huge amount of SQL queries!

The fact is that adding each brand to an article is performed by a separate request. It can be improved by passing a list of brands to the event.brands.add method:

brands = [] for brand in event_data['brands']: brand_instance, _ = Brand.objects.get_or_create(title=brand) brands.append(brand_instance) event.brands.add(*brands) 1 2 3 4 5 6 7 brands = [ ] for brand in event_data [ 'brands' ] : brand_instance , _ = Brand . objects . get_or_create ( title = brand ) brands . append ( brand_instance ) event . brands . add ( * brands )

This option sends almost 2 times less requests, which is a good result, considering that you have changed only a couple of lines of code.

c. Bulk update

After the transfer of data, you may get the idea that old events (before 2018) should be made inactive. For this, a boolean field ‘active’ was added to the Event model and you need to put down its value:

for event in Event.objects.filter(created_at__year__lt=2018): event.active = False event.save() 1 2 3 4 5 for event in Event . objects . filter ( created_at__year__lt = 2018 ) : event . active = False event . save ( )

By running this code, you get the number of requests equal to the number of old events.

In addition, for each event that is suitable for the condition, there is a separate SQL query, and all the fields of these events are overwritten.

It can lead to overwriting the changes made between SELECT and UPDATE queries, and in addition to performance problems, you also get race condition.

Instead, you can use the update method which is available on QuerySet objects:

Event.objects.filter(created_at__year__lt=2018).update(active=False) 1 2 3 Event . objects . filter ( created_at__year__lt = 2018 ) . update ( active = False )

This code generates only one SQL query. Awesome!

d. Bulk delete

Next, you needed to delete inactive events.

for event in Event.objects.filter(active=False): event.delete() 1 2 3 4 for event in Event . objects . filter ( active = False ) : event . delete ( )

The code will generate a 2N + 1 query to the database.

First, the connection between the event and the brand in the intermediate table is deleted, and then the event itself. You can do this in fewer queries using the delete method of the QuerySet class:

Event.objects.filter(active=False).delete() 1 2 3 Event . objects . filter ( active = False ) . delete ( )

This code does the same thing in just 3 queries to the database.

First, a single request gets a list of identifiers of all inactive events, then the second request deletes all relations between events and brands at once, and the last request deletes events.

Nice!

Using Iterator

Suppose you need to add the ability to export events to the CSV format. Let’s make a simple method for this, taking into account only the work with the database:

def export_events(export_file_path, columns): with open(export_file_path, 'w') as export_file: events_writer = csv.writer(export_file, delimiter=';') events_writer.writerow(columns) for event in Event.objects.select_related('store').all(): events_writer.writerow([getattr(event, column) for column in columns]) 1 2 3 4 5 6 7 8 def export_events ( export_file_path , columns ) : with open ( export_file_path , 'w' ) as export_file : events_writer = csv . writer ( export_file , delimiter = ';' ) events_writer . writerow ( columns ) for event in Event . objects . select_related ( 'store' ) . all ( ) : events_writer . writerow ( [ getattr ( event , column ) for column in columns ] )

To test this command, about 100,000 events were generated. Then, running the command through the memory profiler, it was found that the command uses about 200Mb of memory because when executing the query the QuerySet gets all events from the database at once and caches them in memory so that subsequent queries to this QuerySet would not run additional queries. You can reduce the amount of memory used by using the iterator() method of the QuerySet class, which allows you to get results one by one using the server-side cursor, and simultaneously disables the caching of the results in the QuerySet:

for event in Event.objects.select_related('store').iterator(): 1 2 3 for event in Event . objects . select_related ( 'store' ) . iterator ( ) :

Running the updated example in the profiler, the team uses only ~ 40Mb. besides, for any data size, when using the iterator, the command uses a constant amount of memory.

Conclusions