The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.

How does this work?

Expression Tree Equivalence

Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:

A + (A - B)

Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:

(A + A) - B

We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:

(2 * A) - B

Now, imagine that A is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.

That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).

A SQL Example

An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:

SELECT first_name, last_name FROM customer WHERE first_name = 'JAMIE'

Or this one:

SELECT * FROM ( SELECT first_name, last_name FROM customer ) WHERE first_name = 'JAMIE'

With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:

Selection before projection

… or WHERE before SELECT :

Projection then selection

… or SELECT before WHERE :

Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT clause, but to the WHERE clause!

We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:

Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)

A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows

Let’s Talk About JOINs (and Their Elimination)

JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.

Now, consider the following simplified schema, taken from the Sakila database:

CREATE TABLE address ( address_id INT NOT NULL, address VARCHAR(50) NOT NULL, CONSTRAINT pk_address PRIMARY KEY (address_id) ); CREATE TABLE customer ( customer_id INT NOT NULL, first_name VARCHAR(45) NOT NULL, last_name VARCHAR(45) NOT NULL, address_id INT NOT NULL, CONSTRAINT pk_customer PRIMARY KEY (customer_id), CONSTRAINT fk_customer_address FOREIGN KEY (address_id) REFERENCES address(address_id) );

Let’s ignore indexing and other useful features for this example.

INNER JOIN Elimination

The following query shows a common JOIN use-case, joining a parent table ( ADDRESS ) to a child table ( CUSTOMER ) in a to-one relationship:

SELECT c.* FROM customer c JOIN address a ON c.address_id = a.address_id

We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER table and we don’t have any predicates at all, specifically not predicates using the ADDRESS table. So, we’re completely ignoring any contributions from the ADDRESS table. We never really needed the JOIN in the first place!

And in fact, the optimiser can prove this too, because of the FOREIGN KEY constraint on C.ADDRESS_ID , which guarantees that every CUSTOMER record has exactly one corresponding ADDRESS record. The JOIN does not duplicate, nor remove any CUSTOMER rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).

So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:

SELECT * FROM customer c

Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS table. Neat, huh? Who would have thought that FOREIGN KEY s can be so useful in terms of performance.

The above works if there’s also a NOT NULL constraint on the FOREIGN KEY . If there isn’t, e.g. as in this query:

SELECT title FROM film f JOIN language l ON f.original_language_id = l.language_id

The JOIN can still be eliminated, but there needs to be a replacement NOT NULL predicate, as such:

SELECT title FROM film WHERE original_language_id IS NOT NULL

OUTER JOIN Elimination

A LEFT [ OUTER ] JOIN will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN to the previous query…

SELECT c.* FROM customer c LEFT JOIN address a ON c.address_id = a.address_id

… then we’ll fetch all rows from CUSTOMER regardless if that customer has any ADDRESS . This is useful if the FOREIGN KEY is optional (nullable), or completely absent, e.g. through:

ALTER TABLE customer DROP CONSTRAINT fk_customer_address

OUTER JOIN is even easier to eliminate, as it doesn’t require any FOREIGN KEY constraint for the database to prove that it is unneeded. A UNIQUE constraint on the parent table (here: ADDRESS.ADDRESS_ID ) is sufficient to show that for every CUSTOMER there can be at most one ADDRESS , so the LEFT JOIN won’t duplicate any CUSTOMER rows (Unlike INNER JOIN , OUTER JOIN never remove rows).

Hence, the above query can again be rewritten to the more optimal:

SELECT * FROM customer c

OUTER JOIN Elimination with DISTINCT

Another interesting case of OUTER JOIN elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:

CREATE TABLE actor ( actor_id numeric NOT NULL , first_name VARCHAR(45) NOT NULL, last_name VARCHAR(45) NOT NULL, CONSTRAINT pk_actor PRIMARY KEY (actor_id) ); CREATE TABLE film ( film_id int NOT NULL, title VARCHAR(255) NOT NULL, CONSTRAINT pk_film PRIMARY KEY (film_id) ); CREATE TABLE film_actor ( actor_id INT NOT NULL, film_id INT NOT NULL, CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id), CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id) REFERENCES actor (actor_id), CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id) REFERENCES film (film_id) );

Now, consider this query:

SELECT DISTINCT first_name, last_name FROM actor a LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN to FILM_ACTOR doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR to child table FILM_ACTOR ), the JOIN is producing duplicate rows. Without DISTINCT , we’d get something like:

FIRST_NAME LAST_NAME ---------------------- ... PENELOPE GUINESS PENELOPE GUINESS PENELOPE GUINESS NICK WAHLBERG NICK WAHLBERG ...

But thanks to the DISTINCT keyword, the result is (provably) no different from the result of this much simpler query:

SELECT DISTINCT first_name, last_name FROM actor a

(Note, DISTINCT cannot be eliminated, unless we already have a UNIQUE constraint on (FIRST_NAME, LAST_NAME) ).

Why not Just Refactor the SQL Manually?

Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:

Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.

The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.

Finally, I’ve previously advocated avoiding needless, mandatory work, like SELECT * . Such work is mandatory because the optimiser cannot prove its needlessness. In the case of these JOINs , the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.

Here are some complex examples as promised, where this optimiser feature really shines:

Subtle Mistakes

Let’s consider the following query (in PostgreSQL syntax):

SELECT c.name, count(*) FROM actor a JOIN film_actor fa USING (actor_id) JOIN film f USING (film_id) JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name ORDER BY count(*) DESC

What does it do? For ACTOR_ID = 1 (Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:

NAME COUNT Horror 3 Family 2 New 2 Classics 2 Games 2 Music 1 Sci-Fi 1 Animation 1 Sports 1 Children 1 Comedy 1 Documentary 1 Foreign 1

Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR , nor did we need FILM

SELECT c.name, count(*) FROM film_actor fa JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name ORDER BY count(*) DESC

Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN are needed:

SELECT c.name, count(*), a.first_name, a.last_name, array_agg(f.title ORDER BY f.title) FROM actor a JOIN film_actor fa USING (actor_id) JOIN film f USING (film_id) JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name, a.first_name, a.last_name ORDER BY count(*) DESC

The result being:

NAME COUNT FIRST_NAME LAST_NAME FILMS Horror 3 PENELOPE GUINESS {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"} Family 2 PENELOPE GUINESS {"KING EVOLUTION","SPLASH GUMP"} New 2 PENELOPE GUINESS {"ANGELS LIFE","OKLAHOMA JUMANJI"} Classics 2 PENELOPE GUINESS {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"} Games 2 PENELOPE GUINESS {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"} Music 1 PENELOPE GUINESS {"WIZARD COLDBLOODED"} Sci-Fi 1 PENELOPE GUINESS {"CHEAPER CLYDE"} Animation 1 PENELOPE GUINESS {"ANACONDA CONFESSIONS"} Sports 1 PENELOPE GUINESS {"GLEAMING JAWBREAKER"} Children 1 PENELOPE GUINESS {"LANGUAGE COWBOY"} Comedy 1 PENELOPE GUINESS {"VERTIGO NORTHWEST"} Documentary 1 PENELOPE GUINESS {"ACADEMY DINOSAUR"} Foreign 1 PENELOPE GUINESS {"MULHOLLAND BEAST"}

As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.

Reusable Views

Sometimes, we simply add additional JOINs for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).

Consider this view:

CREATE VIEW v_customer AS SELECT c.first_name, c.last_name, a.address, ci.city, co.country FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id)

It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY and COUNTRY table as well.

From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:

SELECT first_name, last_name FROM v_customer

What do you think will happen? Exactly. JOIN elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:

SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name, a.address, ci.city, co.country FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id) ) v_customer

… which can be transformed into this (we don’t need all columns in the nested select):

SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id) ) v_customer

… which can be transformed into this (JOINs can be eliminated):

SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name FROM customer c ) v_customer

… and finally (the subquery is not useful):

SELECT first_name, last_name FROM customer

The view is even very useful for this particular query, thanks to JOIN elimination!

Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.

Cool, So Can My Database Do It?

Perhaps! Let’s look at the three different types of JOIN elimination in the context of these databases:

DB2 LUW 10.5

MySQL 8.0.2

Oracle 12.2.0.1

PostgreSQL 9.6

SQL Server 2014

INNER JOIN Elimination

Remember, this depends on the presence (and usefulness) of a FOREIGN KEY constraint. The SQL statement we’re using here is:

SELECT first_name, last_name FROM customer c JOIN address a ON c.address_id = a.address_id

We’re hoping to get:

SELECT first_name, last_name FROM customer c

DB2 LUW

The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS table:

Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |

MySQL

MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN elimination is not implemented:

ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using index

Not only is the JOIN executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.

Bummer.

Oracle

No problem at all for Oracle:

------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------

… the JOIN is eliminated as expected.

PostgreSQL

Unfortunately, PostgreSQL cannot eliminate INNER JOIN:

Hash Join (cost=19.57..42.79 rows=599 width=13) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=15) -> Hash (cost=12.03..12.03 rows=603 width=4) -> Seq Scan on address a (cost=0.00..12.03 rows=603 width=4)

Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.

SQL Server

No problemo for SQL Server, the ADDRESS table access is gone!

|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Notice, however, that SQL Server can only eliminate INNER JOIN on NOT NULL FOREIGN KEYs!

Excellent. What about…

OUTER JOIN Elimination

This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY anymore. A UNIQUE key in the parent table is sufficient to eliminate an OUTER JOIN . We can safely expect that if the INNER JOIN could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN can be eliminated, too.

Here’s the query:

SELECT first_name, last_name FROM customer c LEFT JOIN address a ON c.address_id = a.address_id

And the outcome:

DB2 LUW

Good

Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |

MySQL

Still nope:

ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using index

Oracle

Perfect:

------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------

PostgreSQL

Unlike INNER JOIN elimination, this works. Great!

Seq Scan on customer c (cost=0.00..14.99 rows=599 width=13)

SQL Server

As expected, good:

|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))

Finally…

OUTER JOIN Elimination with DISTINCT

Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by

Ignoring contributions from the child table

Removing duplicates with DISTINCT

It’s easy to prove that this:

SELECT DISTINCT first_name, last_name FROM actor a LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id

Is equivalent to this:

SELECT DISTINCT first_name, last_name FROM actor a

Remember also, this only works with OUTER JOIN , not with INNER JOIN , as the latter might remove rows, so we have to execute it to see if it does.

DB2 LUW

Cool, this actually works!

Explain Plan | ----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 20 | 2 | TBSCAN | 200 of 200 (100.00%) | 20 | 3 | SORT (UNIQUE) | 200 of 200 (100.00%) | 20 | 4 | TBSCAN ACTOR | 200 of 200 (100.00%) | 20 |

There’s no access to the FILM_ACTOR table, nor to its indexes. Very nice.

MySQL

As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.

ID TABLE TYPE REF ROWS EXTRA 1 a ALL 200 Using temporary 1 fa ref sakila.a.actor_id 27 Using index; Distinct

This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!

Oracle

I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:

--------------------------------------------------------------- | Id | Operation | Name | Rows | --------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5462 | | 1 | HASH UNIQUE | | 5462 | | 2 | NESTED LOOPS OUTER| | 5462 | | 3 | TABLE ACCESS FULL| ACTOR | 200 | |* 4 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 27 | ---------------------------------------------------------------

Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR table into memory and then HASH JOINed it to ACTOR . Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.

PostgreSQL

PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:

HashAggregate (cost=193.53..194.81 rows=128 width=13) Group Key: a.first_name, a.last_name -> Hash Right Join (cost=6.50..166.22 rows=5462 width=13) Hash Cond: (fa.actor_id = a.actor_id) -> Seq Scan on film_actor fa (cost=0.00..84.62 rows=5462 width=2) -> Hash (cost=4.00..4.00 rows=200 width=17) -> Seq Scan on actor a (cost=0.00..4.00 rows=200 width=17)

SQL Server

The pleasant surprise came from SQL Server, which does support this optimisation too:

|--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC)) |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))

As you can see, no access to any FILM_ACTOR related objects!

Summary

Here’s a summary of what databases can eliminate:

Database INNER JOIN:

to-one INNER JOIN nullable:

to-one OUTER JOIN:

to-one OUTER JOIN DISTINCT:

to-many DB2 LUW 10.5 Yep Yep Yep Yep MySQL 8.0.2 Nope Nope Nope Nope Oracle 12.2.0.1 Yep Yep Yep Nope PostgreSQL 9.6 Nope Nope Yep Nope SQL Server 2014 Yep Nope Yep Yep

Conclusion

JOIN elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.

It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR and FILM tables were removed).

Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.

If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser: