SQL: beware of NULL in WHERE NOT IN

If you have ever written some SQL you are probably aware of the special status of NULL in SQL.

To refresh your mind: NULL indicates absence of a value. You cannot use NULL in a comparison, it will always result in NULL. If this does not lead to the desired result there are special operators to deal with NULL explicitly. It behaves similar to NaN in floating point operations.

Example code:

[sourcecode language=”sql”]

CREATE TABLE pony

(

id INT PRIMARY KEY,

name VARCHAR(255)

);

INSERT INTO pony (id, name)

VALUES

(1, ‘Twilight Sparkle’),

(2, ‘Rainbow Dash’),

(3, ‘Pinkie Pie’),

(4, ‘Rarity’),

(5, ‘Applejack’);

SELECT * FROM pony;

— 5 rows as expected

SELECT * FROM pony

WHERE id = NULL;

— 0 rows as expected

SELECT * FROM pony

WHERE id != NULL;

— 0 rows, slight wtf

SELECT * FROM pony

WHERE id IS NOT NULL;

— 5 rows as expected

[/sourcecode]

NULL still works intuitively when using WHERE IN:

[sourcecode language=”sql”]

SELECT * FROM pony

WHERE id IN (1, 2, NULL);

— 2 rows as expected

— equivalent statement:

SELECT * FROM pony

WHERE id = 1

OR id = 2

OR id = null;

[/sourcecode]

WHERE NOT IN is where things get tricky:

[sourcecode language=”sql”]

SELECT * FROM pony

WHERE id NOT IN (1, 2, NULL);

— 0 rows, major wtf

[/sourcecode]

It makes sense if you split the clause into individual comparisons:

[sourcecode language=”sql”]

SELECT * FROM pony

WHERE NOT (

id = 1

OR id = 2

OR id = NULL

);

[/sourcecode]

And then remove the parenthesis using De Morgan’s laws:

[sourcecode language=”sql”]

SELECT * FROM pony

WHERE id != 1

AND id != 2

AND id != NULL;

[/sourcecode]

Like explained in the intro, id != NULL is always NULL, therefor the entire WHERE clause is always FALSE.

Practical advice and the subquery problem

If you are using external data, for example from XML or a CSV, you should filter out empty values.

More insidiously NULL can sneak in via a subquery. Lets say we want to know how many ponies are currently out of town:

[sourcecode language=”sql”]

WITH ponies_in_town AS (

SELECT pony.id, pony.name

FROM town

LEFT JOIN town_pony ON town

WHERE town.name = ‘Ponyville’

)

SELECT

COUNT(*) AS num_ponies_out_of_town

FROM pony

WHERE id NOT IN (

SELECT id FROM ponies_in_town

)

[/sourcecode]

If Applejack is alone in town, it says 4 ponies have left. If Applejack also leaves, it says 0 ponies have left. A fix for this is left as an exercise for the reader.

Subqueries are good

Let me emphasize this. The alternative to a subquery is to place table joins in the main query. This can lead to a more subtle insidious problem, which is duplicate rows. SQL using subqueries is more “modular” and therefore easier to understand and reuse. It can also be faster.

The example above as it stands does not demonstrate this but if we add a few more requirements we can demonstrate these benefits. So keep using WHERE NOT IN, just mind NULL.

But what if i actually want to take into account NULLS?

Make a separate where clause for null: WHERE val NOT IN (1, 2) OR val IS NULL

Author | 02 mrt 2018