Joins are an important class of operations in databases and data pipelines. There are many kinds of joins in SQL databases, but this post will focus only on the INNER JOIN .

It is possible to make the JOIN between two large tables very efficient if the result is going to be small. That being a common scenario in practice, makes JOIN s an interesting topic of research.

A naive JOIN algorithm can be very slow because JOIN s are essentially nested loops.

NestedLoop JOIN based on sequential scans 1 2 3 4 for row_i in table_a : for row_j in table_b : if ( row_i . key == row_j . key ): emit_join_result ( row_i , row_j )

A better alternative to the algorithm above is the HashJoin . In the first step of a HashJoin , a hash table of one of the JOIN operands — table_a — is built. Then a cheap lookup can replace the expensive innermost loop doing a sequential scan on the whole table_b for each row of table_a .

HashJoin 1 2 3 4 5 6 7 for row_i in table_a : hash_table [ row_i . key ] = row_i for row_j in table_b : row_i = hash_table [ row_j . key ] if ( row_i AND row_i . key == row_j . key ): emit_join_result ( row_i , row_j )

This code snippet above is very common in hand-written backend and UI code stitching data together to show something useful to users. So it should be no surprise that this is what databases commonly do to solve the same problem.

A Practical Example

Let’s look at an e-commerce database containing two tables: product and cart_item .

The product table contains all the product the store sells.

`product` table 1 2 3 4 5 6 7 8 9 10 11 12 13 14 id | name ----+------------- ... 34 | Iogurt 35 | Tea 36 | Sugar 37 | Chocolate 38 | Butter 39 | Detergent 40 | Panettone 41 | Coffee 42 | Shampoo 43 | Toothpaste ...

The cart_item contains products added to the cart of all the users.

`cart_item` table 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 id | user_id | product_id | quantity_added ------+---------+------------+---------------- 1000 | 776 | 34 | 4 1001 | 494 | 35 | 2 1002 | 494 | 34 | 2 1003 | 494 | 36 | 1 1004 | 494 | 37 | 3 1005 | 494 | 38 | 1 1006 | 494 | 39 | 2 1007 | 494 | 40 | 4 1008 | 494 | 41 | 2 1009 | 494 | 42 | 2 1010 | 574 | 34 | 2 1011 | 574 | 36 | 1 1012 | 574 | 37 | 4 1013 | 574 | 43 | 1 ...

An useful view of the cart of a user will need more information about the product other than just the id (e.g. the name of the product). That is when a JOIN becomes necessary.

JOIN example 1 2 3 4 5 6 SELECT cart_item . user_id , cart_item . product_id , cart_item . quantity_added , product . name FROM cart_item JOIN product ON cart_item . product_id = product . id ;

Results of the JOIN example query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 user_id | product_id | quantity_added | name ---------+------------+----------------+------------ 776 | 34 | 4 | Iogurt 574 | 34 | 2 | Iogurt 494 | 35 | 2 | Tea 494 | 36 | 1 | Sugar 574 | 36 | 1 | Sugar 494 | 37 | 3 | Chocolate 574 | 37 | 4 | Chocolate 494 | 38 | 1 | Butter 494 | 39 | 2 | Detergent 494 | 40 | 4 | Panettone 494 | 41 | 2 | Coffee 494 | 42 | 2 | Shampoo 574 | 43 | 1 | Toothpaste ...

PostgreSQL might use a HashJoin to execute this query. The EXPLAIN command shows the query plan.

HashJoin query plan 1 2 3 4 5 Hash Join Hash Cond : ( product . id = cart_item . product_id ) -> Seq Scan on product -> Hash -> Seq Scan on cart_item

The sequential scan on cart_item is necessary to build the hash table mapping cart_item.product_id to the cart_item record. That is the first for loop in the pseudo-code above. The second for loop is the sequential scan on product that queries the hash table with each product.id it finds.

If you mess with the PostgreSQL query planner really hard , you can convince it to use the NestedLoop join based on nested sequential scans of both tables.

NestedLoop query plan based on sequential scans 1 2 3 4 Nested Loop Join Filter : ( cart_item . product_id = product . id ) -> Seq Scan on product -> Seq Scan on open_cart_item cart_item

And this is the most naive way to JOIN two tables as shown in the first code snippet — two nested loops doing sequential scans.

Indexing

The hash table built in the HashJoin is a form of indexing (temporary index) that speeds up the second for loop. Hash tables allow efficient random queries (i.e. keys used in a sequence of lookups do not obbey any particular order) for an specific key value.

An issue with the HashJoin query plan above is that the hash table is built at the time of the query. Indexing in the context of databases more commonly means that an index is already built when a query comes in (persistent index). This requires knowing which index is more suitable for the queries that will come later. Creating all possible indexes is not a good strategy because they would be many and indexes are not free. When the table changes, indexes have to be updated to reflect the changes. So each index adds an extra cost to writes and deletes.

Let’s disable HashJoin and see how our JOIN query is planned.

NestedLoop query plan using a single index 1 2 3 4 Nested Loop -> Seq Scan on cart_item -> Index Scan using product_pkey1 on product Index Cond : ( id = cart_item . product_id )

This query plan execution is similar to the second for loop in the HashJoin algorithm, but instead of using a temporary hash table that needs to be built at the time of the query, it uses the readily available primary key index of the product table on the id column.

Pseudo-code of the NestedLoop query plan 1 2 3 4 for cart_item_j in cart_item : product_i = product_pkey1 . lookup ( cart_item_j . product_id ) if ( product_i . id == cart_item . product_id ): emit_join_result ( product_i , cart_item_j )

Indexing with B-Trees

Classical relational databases were designed at a time when RAM was expensive and small. A lot of effort went into making sure that queries could be performed without bringing too much data into memory. Optimizing the disk access patterns is crucial for good performance of the algorithms operating on data stored in disks. Unlike main memory, disk are very bad at serving random access since the cost of seeking to the right position is considerable.

Sequentially scanning the randomly sorted product_id s in the cart_item table and querying the product_pkey1 index stored in disk every time is not very good because it might require the disk to seek to random locations far too often.

The product_pkey1 index stored in disk is not a hash table, but a B-Tree. B-Trees (and other types of ordered indexes) allow more interesting queries than the specific key lookup provided by hash tables. In addition to key lookups, B-trees allow range queries.

Queries a B-Tree can efficiently resolve 1 2 3 4 5 6 7 8 9 10 Specific key lookup key = ? Range queries key < ? key <= ? key >= ? key > ?

B-Trees shine when a range of keys is read sequentially. This access pattern minimizes the number of disk seeks and number of page loads into memory (storage blocks containing sorted keys).

Let’s create a B-Tree index for the product_id column in the cart_item table and see what happens to the query plan.

NestedLoop query plan using two indexes 1 2 3 4 Nested Loop -> Index Scan using cart_product_id_idx on cart_item -> Index Scan using product_pkey1 on product Index Cond : ( id = cart_item . product_id )

Now the two indexes are being used. This is very good, because the scan for product_id on the cart_product_id_idx index will not be in the random order they appear in the cart_item table. The lookups for product by the product id will happen in the order of the product_pkey1 index. This will maximize the use of the page cache. Entry 43 of the product_pkey1 B-Tree is very likely to be in the same page of the entry 42 and so on. This locality is what guarantees the Nested Loop will not be quadratic in time. The cost of the innermost lookup will be amortized. Some will be slower than others when a page has to be loaded, but on average they will take a constant time. This is how the algorithm looks like in pseudo-code.

Pseudo-code of the NestedLoop query plan using two indexes 1 2 3 for ( product_id , cart_item_row_id ) in cart_product_id_idx : product_i = product_pkey1 . lookup ( product_id ) emit_join_result ( cart_item [ cart_item_row_id ], product_i )

Joining Sorted Relations

MergeJoin can take advantage of the two JOIN operands (relations) being sorted to efficiently produce output. The pseudo-code for the MergeJoin is not as straightforward as NestedLoop and HashJoin , but is not very complicated.

Pseudo-code for a simple MergeJoin 1 2 3 4 5 6 7 8 9 10 11 12 13 // Cursors for sequential scanning of table_a and table_b cursor_a = 0 cursor_b = 0 while cursor_a . hasNext and cursor_b . hasNext : if cursor_a . key < cursor_b . key : cursor_a . next elif cursor_a . key == cursor_b . key : emit_join_result ( cursor_a . row , cursor_b . row ) cursor_a . next () cursor_b . next () else : cursor_b . next ()

On every loop iteration at least one record is consumed from one of the two operands. This guarantees that MergeJoin will take time proportional to the sum of the size of the two sorted JOIN operands.

Since B-Tree indexes are sorted, they lend themselves very well to MergeJoin . This is a possible query plan to our example query. The query planner is taking advantage of two sorted indexes for the MergeJoin – cart_product_id_idx and product_pkey1 .

Pseudo-code for a simple MergeJoin 1 2 3 4 Merge Join Merge Cond : -> Index Scan using cart_product_id_idx on open_cart_item -> Index Scan using product_pkey1 on product

If a persistent index is not available (or data is small enough to be completely in-memory), PostgreSQL might sort one or both operands to be able to run a MergeJoin on two sorted relations. This is possible query plan when the cart_product_id_idx index is not available.

Pseudo-code for a simple MergeJoin 1 2 3 4 5 6 Merge Join Merge Cond : -> Sort Sort Key : open_cart_item . product_id -> Seq Scan on open_cart_item -> Index Scan using product_pkey1 on product

Conclusion

JOIN s are essentially nested loops.

s are essentially nested loops. Multiple techniques are used to improve the performance of JOIN s.

s. When datasets are small or no appropriate index is being maintained, query planners will likely decide to use memory to make queries run faster. The query executor can build hash tables (or sort records in memory, or…) and use HashJoin or MergeJoin to more efficiently join the two operands according to the join-condition in the query.

or to more efficiently join the two operands according to the join-condition in the query. Maintaining index data structures like B-Trees up to date allows Nested Loop algorithm to run much faster even when the dataset doesn’t conveniently fit in memory. Indexed JOIN s take time proportional to the size of the result and are not affected by the size of the tables.

There’s much more about JOIN s and techniques to speed them up, but these are the basics that will allow you to start thinking about how data layout and choice of algorithms can greatly affect query performance.

Understanding the three classical JOIN algorithms – NestedLoop , HashJoin , MergeJoin – how they take advantage of different indexes and how they behave when there is no index can give you a lot of insight on how databases run queries.

Further Reading