Tony Hoare, who is mostly referred to as the inventor of the NULL reference, now calls it a billion-dollar mistake which pretty much all languages are now “suffering” from, including SQL.

Quoting Tony (from his Wikipedia article):

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn’t resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

The interesting thing here is that Tony was tempted to implement that reference because it was easy to do. But why did he even need such a reference?

The different meanings of NULL

In a perfect world, we wouldn’t need NULL. Every person has a first name and a last name. Every person has a birth date, a job, etc. Or do they?

Unfortunately, they don’t.

Not all countries use the concept of first and last names.

Not all people have a job. Or sometimes, we don’t know their job. Or we don’t care.

This is where NULL is extremely useful. NULL can model all of these states that we don’t really want to model. NULL can be:

The “undefined” value , i.e., the value that is not yet defined (probably for technical reasons) but may well be defined later. Think about a person that we want to add to the database in order to use it in other tables. At some later stage, we’ll add that person’s job.

, i.e., the value that is not yet defined (probably for technical reasons) but may well be defined later. Think about a person that we want to add to the database in order to use it in other tables. At some later stage, we’ll add that person’s job. The “unknown” value , i.e., the value that we do not know (and may never know). Perhaps we can no longer ask this person or their relatives about their birth date – the info will be forever lost. But we still want to model the person, so we use NULL in the sense of UNKNOWN (which is its true meaning in SQL, as we’ll see later on).

, i.e., the value that we do not know (and may never know). Perhaps we can no longer ask this person or their relatives about their birth date – the info will be forever lost. But we still want to model the person, so we use NULL in the sense of UNKNOWN (which is its true meaning in SQL, as we’ll see later on). The “optional” value , i.e., the value that doesn’t need to be defined. Note that the “optional” value also appears in the case of an OUTER JOIN, when the outer join doesn’t produce any values on one side of the relationship. Or also when using GROUPING SETS, where different combinations of GROUP BY columns are combined (or left empty).

, i.e., the value that doesn’t need to be defined. Note that the “optional” value also appears in the case of an OUTER JOIN, when the outer join doesn’t produce any values on one side of the relationship. Or also when using GROUPING SETS, where different combinations of GROUP BY columns are combined (or left empty). The “deleted” or “avoided” value , i.e., the value that we don’t want to specify. Perhaps we usually register a person’s marital status as is done in some jurisdictions, but not in others, where it is not legal to register any personal data of this type. Hence, we don’t want to know this value in some cases.

, i.e., the value that we don’t want to specify. Perhaps we usually register a person’s marital status as is done in some jurisdictions, but not in others, where it is not legal to register any personal data of this type. Hence, we don’t want to know this value in some cases. The “special” value in a given context , i.e., the value that we cannot model otherwise in the range of possible values. This is often done when working with date ranges. Let’s assume a person’s job is bounded by two dates, and if the person is currently working in that position, we’ll use NULL to say that the period is unbounded at the end of the date range.

, i.e., the value that we cannot model otherwise in the range of possible values. This is often done when working with date ranges. Let’s assume a person’s job is bounded by two dates, and if the person is currently working in that position, we’ll use NULL to say that the period is unbounded at the end of the date range. The “accidental” NULL, i.e., the NULL value that is just NULL because developers didn’t pay attention. In the absence of an explicit NOT NULL constraint, most databases assume for columns to be nullable. And once columns are nullable, developers might just “accidentally” put NULL values in their rows, where they didn’t even intend to.

As we’ve seen above these are only a select few of the 50 Shades of NULL.

The following example displays various different meanings of NULL in a concrete SQL example:

CREATE TABLE company ( id int NOT NULL, name text NOT NULL, CONSTRAINT company_pk PRIMARY KEY (id) ); CREATE TABLE job ( person_id int NOT NULL, start_date date NOT NULL, -- If end_date IS NULL, the “special value” of an unbounded -- interval is encoded end_date date NULL, description text NOT NULL, -- A job doesn’t have to be done at a company. It is “optional”. company_id int NULL, CONSTRAINT job_pk PRIMARY KEY (person_id,start_date), CONSTRAINT job_company FOREIGN KEY (company_id) REFERENCES company (id) ); CREATE TABLE person ( id int NOT NULL, first_name text NOT NULL, -- Some people need to be created in the database before we -- know their last_names. It is “undefined” last_name text NULL, -- We may not know the date_of_birth. It is “unknown” date_of_birth date NULL, -- In some situations, we must not define any marital_status. -- It is “deleted” marital_status int NULL, CONSTRAINT person_pk PRIMARY KEY (id), CONSTRAINT job_person FOREIGN KEY (person_id) REFERENCES person (id) );

People have always argued about the absence of a value

When NULL is such a useful value, why do people keep criticizing it?

All of these previous use-cases for NULL (and others) are displayed in this interesting, recent talk by C.J. Date on “The Problem of Missing Information” (watch the video on YouTube).

Modern SQL can do a lot of awesome things that few developers of general-purpose languages like Java, C#, PHP are unaware of. I’ll show you an example further down.

In a way, C.J. Date agrees with Tony Hoare that (ab)using NULL for all of these different types of “missing information” is a very bad choice.

For instance, in electronics, similar techniques are applied to model things like 1, 0, “conflict”, “unassigned”, “unknown”, “don’t care”, “high impedance”. Notice though, how in electronics, different special values are used for these things, rather than a single special NULL value. Is this really better? How do JavaScript programmers feel about the distinction between different “falsy” values, like “null”, “undefined”, “0”, “NaN”, the empty string ‘’? Is this really better?

Speaking of zero: When we leave the SQL space for a moment and go into maths, we’ll see that ancient cultures like the Romans or the Greek had the same problems with the number zero. In fact, they didn’t even have any way to represent zero unlike other cultures as can be seen in the Wikipedia article about the number zero. Quoting from the article:

Records show that the ancient Greeks seemed unsure about the status of zero as a number. They asked themselves, “How can nothing be something?”, leading to philosophical and, by the Medieval period, religious arguments about the nature and existence of zero and the vacuum.

As we can see, the “religious arguments” clearly extend to computer science and software, where we still don’t know for sure what to do with the absence of a value.

Back to reality: NULL in SQL

While people (including academics) still do not agree on the fact whether we need any encoding for “undefined”, “unknown”, “optional”, “deleted”, “special”, let us get back to reality and the bad parts about SQL’s NULL.

One thing that is frequently forgotten when dealing with SQL’s NULL is that it formally implements the UNKNOWN case, which is a special value that is part of so-called three-valued logic, and it does so, inconsistently, e.g. in the case of UNION or INTERSECT operations.

If we go back to our model:

If, for instance, we want to find all people who are not registered as being married, intuitively, we’d like to write the following statement:

SELECT * FROM person WHERE marital_status != 'married'

Unfortunately, because of three-valued logic and SQL’s NULL, the above query will not return those values that do not have any explicit marital_status. Hence, we’ll need to write an additional, explicit predicate:

SELECT * FROM person WHERE marital_status != 'married' OR marital_status IS NULL

Or, we coerce the value to some NOT NULL value prior to comparing it

SELECT * FROM person WHERE COALESCE(marital_status, 'null') != 'married'

Three valued logic is hard. And it isn’t the only problem with NULL in SQL. Here are more disadvantages of using NULL:

There is only one NULL, when we really wanted to encode several different “absent” or “special” values. The range of useful special values highly depends on the domain and the data types that are used. Yet, domain knowledge is always required to correctly interpret the meaning of a nullable column, and queries must be designed carefully in order to prevent the wrong results from being returned, as we saw above.

Again, three-valued logic is very hard to get right. While the above example is still rather simple, what do you think the following query will yield? SELECT * FROM person WHERE marital_status NOT IN ('married', NULL) Exactly. It will not yield anything at all, as explained in this article here. In short, the above query is the same as the below one: SELECT * FROM person WHERE marital_status != 'married' AND marital_status != NULL -- This is always NULL / UNKNOWN

The Oracle database treats NULL and the empty string '' as the same thing. This is very tricky as you won’t immediately notice why the following query always returns an empty result: SELECT * FROM person WHERE marital_status NOT IN ('married', '')

Oracle (again) doesn’t put NULL values in indexes. This is the source of many nasty performance issues, e.g., when you’re using a nullable column in a NOT IN predicate as such: SELECT * FROM person WHERE marital_status NOT IN ( SELECT some_nullable_column FROM some_table ) With Oracle, the above anti-join will result in a full table scan, regardless if you have an index on some_nullable_column. Because of three-valued logic and because Oracle doesn’t put NULLs in indexes, the engine will need to hit the table and check every value just to be sure there isn’t at least one NULL value in the set, which would make the whole predicate UNKNOWN.

Conclusion

We haven’t solved the NULL problem yet in most languages and platforms. While I claim that NULL is NOT the billion dollar mistake that Tony Hoare tries to apologize for, NULL is certainly far from perfect either.

If you want to stay on the safe side with your database design, avoid NULLs at all costs, unless you absolutely need one of those special values to encode using NULL. Remember, these values are: “undefined”, “unknown”, “optional”, “deleted”, and “special”, and more: The 50 Shades of NULL. If you are not in such a situation, always default to adding a NOT NULL constraint to every column in your database. Your design will be much cleaner, and your performance much better.

If only NOT NULL were the default in DDL, and NULLABLE the keyword that needed to be set explicitly…

What are your takes and experiences with NULL? How would a better SQL work in your opinion?