In PostgreSQL (as of 8.3, at least), performance of DISTINCT clause in SELECT list is quite poor.

Probably because DISTINCT code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.

GROUP BY that can be used for the same purpose is more smart, as it employs more efficient HashAggregate , but its performance is still poor for large dataset.

All major RDBMS 's, including MySQL, are able to jump over index keys to select DISTINCT values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT values.

This behavior can be emulated in PostgreSQL too.

Let's create a sample table:



CREATE TABLE t_distinct ( id INT NOT NULL PRIMARY KEY, grouper INT NOT NULL, value VARCHAR(50) NOT NULL ); CREATE INDEX ix_distinct_grouper ON t_distinct (grouper); BEGIN TRANSACTION; SELECT SETSEED(0.20090503); INSERT INTO t_distinct (id, grouper, value) SELECT s, (s - 1) % 10 + 1, 'Value ' || s FROM generate_series(1, 1000000) s; COMMIT; ANALYZE t_distinct;

This table has 1,000,000 rows and 10 DISTINCT grouper 's. There is an index on grouper .

The built-in DISTINCT clause:

SELECT DISTINCT grouper FROM t_distinct

grouper 1 2 3 4 5 6 7 8 9 10 10 rows fetched in 0.0003s (1.3437s)

Unique (cost=0.00..56583.08 rows=10 width=4) -> Index Scan using ix_distinct_grouper on t_distinct (cost=0.00..54082.85 rows=1000090 width=4)

works for 1.34 seconds, while GROUP BY :

SELECT grouper FROM t_distinct GROUP BY grouper

grouper 6 9 3 5 7 1 2 10 4 8 10 rows fetched in 0.0002s (0.4325s)

HashAggregate (cost=18871.13..18871.22 rows=10 width=4) -> Seq Scan on t_distinct (cost=0.00..16370.90 rows=1000090 width=4)

is little more efficient, but still takes 432 milliseconds.

To emulate the INDEX GROUP BY used by MySQL , we just need to jump over the index keys, issuing

SELECT MIN(grouper) FROM t_distinct WHERE grouper > current_grouper

PostgreSQL is smart enough to use the indexes for such a query.

Unfortunately, PostgreSQL doesn't offer a simple way to use session variables (you need to edit configuration files to access them which can be a pain on a database you have no control over).

Even PL/PgSQL doesn't get installed by default on some distributions and cannot be relied upon.

Fortunately, there is a workaround using recursive set-returning functions.

PostgreSQL allows recursion for SQL functions, and the recursion stack can be used to record the current_grouper used in the query.

The function looks like this:

CREATE OR REPLACE FUNCTION "20090503_distinct".skipscan_distinct(INT) RETURNS SETOF "20090503_distinct".t_distinct.grouper%TYPE AS $$ SELECT MIN(grouper) FROM "20090503_distinct".t_distinct WHERE grouper = $1 HAVING MIN(grouper) IS NOT NULL UNION ALL SELECT * FROM "20090503_distinct".skipscan_distinct ( ( SELECT MIN(grouper) FROM "20090503_distinct".t_distinct WHERE grouper > $1 HAVING MIN(grouper) IS NOT NULL ) ) WHERE $1 IS NOT NULL $$ LANGUAGE SQL;

This function takes a current_grouper as a parameter, and selects the current_grouper UNION 'ed with the function's result for the next grouper .

Let's see the performance:

SELECT * FROM "20090503_distinct".skipscan_distinct(-1)

skipscan_distinct 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15 rows fetched in 0.0003s (0.0381s)

Function Scan on skipscan_distinct (cost=0.00..260.00 rows=1000 width=4)

38 ms, much faster than GROUP BY .

However, there are side effects:

This function uses recursion, and a stack overflow can occur if there are too many DISTINCT values

values This function is efficient only if there is an index on the DISTINCT field

But for the cases like described above, this function is a real boon.