A while ago, I blogged about the importance of avoiding unnecessary COUNT(*) queries:

https://blog.jooq.org/2014/08/08/sql-tip-of-the-day-be-wary-of-select-count

… and how to replace them with equivalent EXISTS queries

As I’m updating the jOOQ SQL Masterclass to show also PostgreSQL performance characteristics in addition to Oracle, I really have to reiterate this topic. Please repeat after me:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

The rationale is simple

COUNT(*) needs to return the exact number of rows. EXISTS only needs to answer a question like:

“Are there any rows at all?”

In other words, EXISTS can short-circuit after having found the first matching row. If your client code (e.g. written in Java or in PL/SQL, or any other client language) needs to know something like:

“Did actors called “Wahlberg” play in any films at all?”

Then you have two options to write that query:

Very very bad: Use COUNT(*)

Using PostgreSQL syntax:

SELECT count(*) FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG'

The above query will return a number > 0 if we any Wahlberg played in a film, or 0 if not. Notice that we don’t care how many films all the Wahlbergs played in, yet we ask the database to calculate the precise number.

Let’s run the above query against the Sakila database. The execution plans for the above query in Oracle:

And in PostgreSQL:

Much much better: Use EXISTS()

Using PostgreSQL syntax:

SELECT EXISTS ( SELECT * FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' )

The execution plans for the above query in Oracle:

And in PostgreSQL:

How to read this?

As you can see from the above execution plans, the cost in Oracle is slightly better (going from 3 to 2) when using EXISTS than when using COUNT(*), because of a much better cardinality estimate in the middle of the execution plan. In other words, Oracle “knows” that we’re looking for only one record and as soon as that record has been found, we can stop looking.

In PostgreSQL, things are more drastic (going from 123 to 3.4). The EXISTS version has an associated cost that is almost 30x lower than the version that uses COUNT(*) for the same result.

You can gaze at the plan for a while and figure out what the exact difference is, or you can believe me that this is true:

It is obviously much faster to check for existence rather than to count all results, if what you’re really trying to do is checking for existence

Duh.

Does this apply to me?

Yes. I’m taking bets. Many many code bases out there get this wrong all the time. Checking for sizes to be zero is just too convenient. Not only in SQL, but also in Java. Consider this. Which one is better?

Collection<?> collection = ... // EXISTS if (!collection.isEmpty()) doSomething(); // COUNT(*) if (collection.size() == 0) doSomething();

Sometimes, this doesn’t really matter, e.g. in ArrayList , whose isEmpty() method reads:

public boolean isEmpty() { return size == 0; }

But what if your collection is a lazy loaded Hibernate collection? Not all collections cache this size value, and even if they do, they may still produce overhead in the source system in order to calculate the exact size. In fact, they might even run a completely unnecessary query fetching all the child entities from the database just to check for existence.

Bonus exercise for my Hibernate-aficionado readers out there: Do the exercise with Hibernate. Because at this point, I for one would say: Just use SQL™

OK, costs. But what does it mean?

Let’s benchmark these two statements in Oracle and PostgreSQL.

Oracle

SET SERVEROUTPUT ON DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 10000; BEGIN v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT CASE WHEN EXISTS ( SELECT * FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' ) THEN 1 ELSE 0 END FROM dual ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts)); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT count(*) FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts)); END; /

We get a slight but significant performance improvement of factor 1.3x:

Statement 1 : 3 Statement 2 : 4

(not actual times, because thank you Oracle legal for prohibiting all sorts of stuff). But you can check out the Sakila database yourself and run the above benchmark on your machine.

PostgreSQL

DO $$ DECLARE v_ts TIMESTAMP; v_repeat CONSTANT INT := 1000; rec RECORD; BEGIN v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT CASE WHEN EXISTS ( SELECT * FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' ) THEN 1 ELSE 0 END ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Statement 1: %', (clock_timestamp() - v_ts); v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT count(*) FROM actor a JOIN film_actor fa USING (actor_id) WHERE a.last_name = 'WAHLBERG' ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Statement 2: %', (clock_timestamp() - v_ts); END$$;

A whopping factor 40x in terms of wallclock time gain!

INFO: Statement 1: 00:00:00.023656 INFO: Statement 2: 00:00:00.7944

Let me repeat this:

Factor 40x on PostgreSQL

That’s something! It looks as though COUNT(*) is much better optimised on Oracle (e.g. by counting leaf nodes in an index) than on PostgreSQL, but in any case, the amount of extra work is prohibitive in both databases.

Conclusion

I’m repeating myself, but this is important. Print it out and put it on your office wall:

Thou shalt not use COUNT(*) when EXISTS sufficeth thy needs

Thank you.