In our previous post about SQLAlchemy, we fixed our order search query to get the right rows, but the endpoint was still slower than we expected. Why?

We hit the “N+1” problem that comes with lazy loading relationships in any Object Relational Model framework. Let’s say we query for N orders, and we also want the line items for each order. In a lazy loading scenario, we would have to issue an additional query for the line items for each order, so N+1 in total. Now let’s say we want the product info for each line item, and the brand info for each product, and all of a sudden we’re making a lot of trips to the database.

Instead of doing a lazy load for each relationship, we could do a second query to get the children for all parent objects, so instead of N+1 queries, we do 2. This will be familiar if you’ve worked with Rails, and in SQLAlchemy, this is known as subquery loading. But we can do even better than that! With joined eager loads, we can ask SQLAlchemy to fetch everything we need in a massive JOIN query and deserialize the rows into objects.

Lightening the load / the burdens of “easy”

Before we get into the particulars of eager loading, let’s take a step back and ask if we actually need to load all of these related objects. To answer this question, we have to talk about marshmallow and marshamallow-sqlalchemy. The former provides an easy way to serialize a Python object into a JSON object, and the latter extends this functionality to our SQLAlchemy models. This is convenient because we can add a column or a relationship to the model, and it will show up in the serialized JSON without additional coding. On the other hand, this creates an obligation to do a lot of sub-loads, which we established above as being potentially burdensome.

The solution here is easy. In a marshmallow schema — a class that controls what gets serialized by marshmallow— you can control which fields are serialized by whitelisting only the fields we want using only or by blacklisting the ones we don’t want using exclude . In our case, as our model got more and more complex, maintaining a blacklist became hard to manage, so we switched to a whitelist.

This reduces the number of objects that we need to retrieve, but it doesn’t address the N+1 problem.

Don’t get mad, get eager!

Here is an abridged version of our base order query. Let’s say an Order has many Supplier Fragments which represent the various vendors that partner with Spring; each Supplier Fragment has Line Items, which represent what is being bought; each Line Item has a Variant, which represents the exact variation of the product being sold (e.g., a specific color and size); each Variant belongs to a Product, which represents the common attributes shared all variants (e.g., name, brand, description, country of origin).

For good measure, we add a raiseload to throw an exception if we try to load anything that we didn’t load here. This is helpful for discovering relationships that might throw an exception later when the object becomes detached from the session. More on this in a bit.

The query that gets emitted is pretty gnarly.

This may seem counterintuitive, because we’re now blowing up the number of rows retrieved to the total number of variants, but in practice, it can be more efficient to make one round trip to a remote service and assemble it in the client. The database is a limited resource, while your application instances can be scaled out. We wouldn’t want to code something like this by hand, but SQLAlchemy is able to do this reliably.

Results

Here’s a graph from New Relic of the deploy (the green line just after the 1:00 PM mark).

Before the change, requests averaged close to a full second. Today, that average is around 350ms. Not great, but acceptable for an internal tool, and we got to learn about eager loading.

Post-script: eager loading one-to-many relationships

Remember when I said we would talk a little more about raiseload ? A funny thing happened when we used it as part of a joined load on one of our product queries. Here’s an abridged version of our product model:

To reiterate, a product has many variants (a.k.a. SKUs) to represent all the permutations of the different sizes, colors, and other attributes of a particular product unit. A product contains attributes common to all of its variants, so it’s not uncommon to refer back to the parent product when processing a variant. Not a bad candidate for eager loading! We’ll do a joined load on the variants, and the variant can refer back to its parent product.

Let’s say we call that variant_names function after the session is done. We found that when we tried to access variant.product for any given variant for this product, an InvalidRequestError was raised. If we remove the raiseload , we would get a different error complaining about how we’re in a detached session and we can’t no longer load more fields. What the fudge?!? Didn’t we just say that products and variants are connected to each other?

We did, but we didn’t. When we define a backref , we’re actually defining two one-way relationships, product-to-variants and variant-to-product, and technically, the joined load only implies the product-to-variant relationship. Without any additional context, the variant-to-product relationship loads it like it would any other relationship — i.e., the session loads this as product.variant[0].product the same way it would load, say, product.variant[0].inventory . I.e., lazily.

Let’s go to the documentation! After a few readings of Routing Explicit Joins/Statements into Eagerly Loaded Collections, I figured out we can use contains_eager() to tell SQLAlchemy that there is indeed a there there, and it will “eager load” the variant-to-product relationship from the existing data instead of going back to the db.

Conclusions

Eager loading improves object retrieval performance for large collections of related objects by reducing the number of queries. Counterintuitively, it can be more performant to join everything into one query and let SQLAlchemy generate the relationships from that.

If you are using an automated serializer like marshmallow, review the fields that are actually being serialized and make sure you’re not incurring unnecessary subloads for objects you are not using.