April 09, 2019 at 05:28 Tags Databases

If you store data in a relational database, it's good practice to have the data normalized. This typically requires splitting data to multiple tables that are logically connected through keys. As a result, most non-trivial queries require joins on multiple tables to gather all the interesting columns. This post is a brief tour of SQL joins, focusing on the differences between inner and outer joins.

Cross join To understand SQL joins, it's best to start with cross joins, because they are the simplest combination of tables supported by SQL. A cross join occurs when we write: select * from t1 , t2 ; Throughout this post, we'll be working with two sample tables called t1 and t2 : t1 t2 id | name code | id ----+---------- ------+---- 1 | Joanne x | 2 2 | Sam z | 3 3 | Emmanuel a | 7 4 | Brayden The SQL code to create these tables and run all the examples in this post is available here. All the code was tested on PostgreSQL 9.5. Running the cross join on these tables results in: id | name | code | id ----+----------+------+---- 1 | Joanne | x | 2 2 | Sam | x | 2 3 | Emmanuel | x | 2 4 | Brayden | x | 2 1 | Joanne | z | 3 2 | Sam | z | 3 3 | Emmanuel | z | 3 4 | Brayden | z | 3 1 | Joanne | a | 7 2 | Sam | a | 7 3 | Emmanuel | a | 7 4 | Brayden | a | 7 The cross join performs a cross product (or Cartesian product) between the two tables. For each row in t1 , it adds all possible rows from t2 . The resulting table has all the columns of t1 and of t2 , and its number of rows is the product of numbers of rows in t1 and t2 . I find cross joins to be a good starting point because they make inner joins much easier to understand. They are also the basis of joins in relational algebra. SQL also supports a more explicit way to invoke a cross join: select * from t1 cross join t2 ; This is equivalent to the first statement.

Inner join An important component of SQL queries is filtering results with a where clause. For example, we can create the following (slightly nonsensical) filter on the cross join shown earlier: select * from t1 , t2 where t2 . code = 'x' and t1 . name like '%d%' Resulting in: id | name | code | id ----+----------+------+---- 4 | Brayden | x | 2 One filter that's particularly useful when crossing two tables is checking whether there's a match on some column value. Both t1 and t2 have an id column; let's assume these IDs refer to the same thing, and that we want to find all combinations of rows from the two tables where the IDs match. We can do: select * from t1 , t2 where t1 . id = t2 . id ; Resulting in: id | name | code | id ----+----------+------+---- 2 | Sam | x | 2 3 | Emmanuel | z | 3 This kind of filtering is so useful that it has its own concept: the inner join : select * from t1 inner join t2 on t1 . id = t2 . id ; It produces the exact same result table. When the names of the columns we compare are the same in the two tables, there's an even shorter syntax that can be used: select * from t1 inner join t2 using ( id ); The result of this will only have a single id column, since we're making it explicit that id s match between the tables: id | name | code ----+----------+------ 2 | Sam | x 3 | Emmanuel | z I find the filtering equivalence very useful to understand inner joins. Just remember that it's a cross product of the two tables where only rows that satisfy a certain condition are returned. You may be wondering what's the difference between using where filtering and inner join ... on . While the two are logically equivalent, some things to keep in mind: At least theoretically, inner join ... on is more efficient because in multi-table joins (which is common) we get to apply the filtering per join and not at the end on one huge table. With modern SQL query optimizers it's not clear whether this is a real advantage, however. It's quite likely that the optimizer will generate exactly the same sequence of low-level operations for the two.

is more efficient because in multi-table joins (which is common) we get to apply the filtering per join and not at the end on one huge table. With modern SQL query optimizers it's not clear whether this is a real advantage, however. It's quite likely that the optimizer will generate exactly the same sequence of low-level operations for the two. In terms of readability, it's much nicer to be able to see what the join is on close to the join itself, rather than in the end of the query in one large where filter. This can be significant for multi-table joins. As an example, consider customers making orders, with order details in a separate table (since customers could have multiple orders). We could have a complex join done as: select * from customers , orders , order_details where customers . id = order_details . customerid and orders . id = order_details . orderid Compared with: select * from customers inner join order_details on customer . id = order_details . customerid inner join orders on orders . id = order_details . orderid In the latter it's much clearer what the criteria for each join is. Finally, I'll mention that some databases support the natural join, which is a shortcut for "inner join tables on the columns that have the same name". The following query is equivalent to the variant with using shown above: select * from t1 natural join t2 ; Natural join is a term from relational algebra, and it's not commonly used in SQL queries.

Outer join While the inner join is simple to understand as a special case of the cross product, outer join is a bit trickier. Luckily, it's not hard to grok outer joins once you undererstand inner joins, so we can build this knowledge step by step. Let's get back to our tables t1 and t2 . We could assign a logical meaning to the inner join using (id) as "show me all the codes (from t2 ) matching names (from t1 )". The result is two rows where a match on id was found in the two tables. However, sometimes we want something slightly different; we want to ask "show me all the names (from t1 ) and all the codes (from t2 ) that match them, if any". In other words, we want all the names to be in the results, perhaps with null values for code where no match was found in the t2 table . Let's break this request to pieces. We want: All names from t1 that have a match in t2 , with the code from t2

that have a match in , with the code from All names from t1 that have no match in t2 , with null for the code In SQL we can express this as follows: select id , name , code from t1 inner join t2 using ( id ) union select id , name , null from t1 where id not in ( select id from t2 ); Some things to note: The first query is precisely our inner join from the previous section, and it's answering the first piece.

The second query lists all the names that don't have a match in t2 using a subquery.

using a subquery. We're listing the column names explicitly here because column names must match exactly for the two tables being union -ed. What we just wrote is called a left outer join in SQL , and can be more easily written as: select * from t1 left outer join t2 using ( id ); The result is: id | name | code ----+----------+------ 2 | Sam | x 3 | Emmanuel | z 4 | Brayden | 1 | Joanne | This is the left outer join because we want all the rows from the left-hand side table to appear in the result. As you may have guessed, there's also a right outer join: select * from t1 right outer join t2 using ( id ); id | name | code ----+----------+------ 2 | Sam | x 3 | Emmanuel | z 7 | | a Here all the rows from the right-hand side table appear in the result, with a matching column ( name ) from the left-hand side if found, null otherwise. Finally, we may want rows from both sides of the join to always appear in the result table. That's called a full outer join: select * from t1 full outer join t2 using ( id ); Resulting in: id | name | code ----+----------+------ 2 | Sam | x 3 | Emmanuel | z 7 | | a 4 | Brayden | 1 | Joanne | A full outer join is straightforward to express using a union of left and right joins: select * from t1 left join t2 using ( id ) union select * from t1 right join t2 using ( id ); There's a slight caveat, though. While union removes duplicates, full outer join does not; therefore, the results can be different in some special cases. In the event that you care about seeing duplicates in the output and the database doesn't support a full outer join , this is a more accurate (though less efficient) translation: select * from t1 left join t2 using ( id ) union all select * from t1 right join t2 using ( id ) where t1 . id is null ;

Joins on multiple columns The examples so far showed joins on a single shared column - id . While this is the most common case, sometimes more complex matching criteria are used. SQL doesn't restrict the syntax of join to a single condition, so we can join on multiple columns and arbitrary conditions. Let's add another column to our two tables: t1 t2 id | name | ranking code | id | ranking ----+----------+--------- ------+----+-------- 1 | Joanne | 7 x | 2 | 8 2 | Sam | 7 z | 3 | 6 3 | Emmanuel | 6 4 | Brayden | 2 We can run joins on both id and ranking : select * from t1 inner join t2 on t1 . id = t2 . id and t1 . ranking = t2 . ranking ; Resulting in: id | name | ranking | code | id | ranking ----+----------+---------+------+----+--------- 3 | Emmanuel | 6 | z | 3 | 6 And with using : select * from t1 inner join t2 using ( id , ranking ); Resulting in: id | ranking | name | code ----+---------+----------+------ 3 | 6 | Emmanuel | z Similarly, we can run outer joins: select * from t1 left outer join t2 using ( id , ranking ); Resulting in: id | ranking | name | code ----+---------+----------+------ 3 | 6 | Emmanuel | z 2 | 7 | Sam | 4 | 2 | Brayden | 1 | 7 | Joanne | And so on.