So, couple of days ago, some guy, from Periscope company wrote a blogpost about getting number of distinct elements, per group, faster using subqueries.

This was then submitted to Hacker News and r/Programming on Reddit.

Then, the original authors submitted second blogpost comparing speed between four different DB engines. Which, in turn, was also commented on Reddit.

I found the numbers presented by Periscope (as their improvement) as not that great.

Unfortunately – their blog doesn't allow for comments, so I decided to test it, and write on my own blog, what I can find about it.

First, what I gathered from their blogposts:

2 tables: dashboards (with columns: name, id and possibly something else) and time_on_site_logs (with columns: user_id, dashboard_id, and possibly something else)

dashboards has 1200 rows

time_on_site_logs has 14 million rows

there is 1700 distinct (dashboard_id, user_id) pairs in time_on_site_logs

Since, I have no idea about table width (which can be important), I will make some guesses about number of columns that should be there. There is also no information about data distribution in time_on_site_logs (as in: average number of dashboards per user, average number of rows per user, and so on). So I'll have to improvise.

Dashboards:

$ CREATE TABLE dashboards ( id serial PRIMARY KEY , name text NOT NULL UNIQUE , created timestamptz NOT NULL DEFAULT now ( ) , visited int8 NOT NULL DEFAULT 0 ) ; $ INSERT INTO dashboards ( name , created , visited ) WITH x AS ( SELECT now ( ) - '1 year' :: INTERVAL * random ( ) AS c FROM generate_series ( 1 , 1200 ) ORDER BY c ) SELECT 'Dashboard #' || ROW_NUMBER ( ) OVER ( ORDER BY c ) , c , random ( ) * 10000 FROM x;

This gives me data like these:

SELECT * FROM dashboards ORDER BY id DESC LIMIT 20 ; id | name | created | visited ------+-----------------+-------------------------------+--------- 1200 | Dashboard # 1200 | 2014 - 01 - 29 09: 16 : 01.894632 + 01 | 3264 1199 | Dashboard # 1199 | 2014 - 01 - 29 05: 27 : 19.501032 + 01 | 5889 1198 | Dashboard # 1198 | 2014 - 01 - 29 02: 52 : 03.939432 + 01 | 595 1197 | Dashboard # 1197 | 2014 - 01 - 28 11 : 22 : 32.134632 + 01 | 8283 1196 | Dashboard # 1196 | 2014 - 01 - 28 07: 43 : 48.406632 + 01 | 5713 1195 | Dashboard # 1195 | 2014 - 01 - 28 03: 16 : 50.537832 + 01 | 4082 1194 | Dashboard # 1194 | 2014 - 01 - 27 23 : 22 : 24.877032 + 01 | 7836 1193 | Dashboard # 1193 | 2014 - 01 - 27 11 : 16 : 10.765032 + 01 | 9849 1192 | Dashboard # 1192 | 2014 - 01 - 27 10 : 34 : 12.032232 + 01 | 7207 1191 | Dashboard # 1191 | 2014 - 01 - 27 10 : 34 : 10.304232 + 01 | 2657 1190 | Dashboard # 1190 | 2014 - 01 - 27 03: 49 : 03.565032 + 01 | 1996 1189 | Dashboard # 1189 | 2014 - 01 - 26 08:06: 21.257832 + 01 | 9084 1188 | Dashboard # 1188 | 2014 - 01 - 26 02: 46 : 08.921832 + 01 | 9549 1187 | Dashboard # 1187 | 2014 - 01 - 25 21 : 19 : 41.869032 + 01 | 7073 1186 | Dashboard # 1186 | 2014 - 01 - 25 15 :07: 59.264232 + 01 | 9089 1185 | Dashboard # 1185 | 2014 - 01 - 25 14 : 12 : 43.750632 + 01 | 3228 1184 | Dashboard # 1184 | 2014 - 01 - 25 10 :03: 10.198632 + 01 | 9176 1183 | Dashboard # 1183 | 2014 - 01 - 25 07: 31 : 07.395432 + 01 | 5257 1182 | Dashboard # 1182 | 2014 - 01 - 25 01: 29 : 44.710632 + 01 | 633 1181 | Dashboard # 1181 | 2014 - 01 - 24 21 : 51 : 53.773032 + 01 | 9663 ( 20 ROWS )

Now, for the time_on_site_logs table:

$ CREATE TABLE time_on_site_logs ( id serial PRIMARY KEY , dashboard_id int4 NOT NULL REFERENCES dashboards ( id ) , user_id int4 NOT NULL , session_started timestamptz NOT NULL , session_time INTERVAL NOT NULL ) ;

Schema is ready. Now for the data. Since I have no idea about data distribution or anything like this, let's just create 1700 random dashboard/user pairs, and repeat the data 14000000/1700 times, and call it a day:

INSERT INTO time_on_site_logs ( dashboard_id , user_id , session_started , session_time ) WITH x AS ( SELECT d , u FROM generate_series ( 1 , 1200 ) AS d , generate_series ( 1 , 10 ) AS u ORDER BY random ( ) LIMIT 1700 ) SELECT d , u , now ( ) - random ( ) * '2 years' :: INTERVAL , random ( ) * '30 minutes' :: INTERVAL FROM x , generate_series ( 1 , 8236 ) AS q;

Stats afterwards:

WITH x AS ( SELECT dashboard_id , user_id , COUNT ( * ) FROM time_on_site_logs GROUP BY dashboard_id , user_id ) SELECT SUM ( COUNT ) AS all_rows , COUNT ( * ) AS distinct_pairs FROM x; all_rows | distinct_pairs ----------+---------------- 14001200 | 1700 ( 1 ROW )

Some of the comments (very ill-informed, in my opinion) on Reddit suggested that the speed of the queries (as shown in original blogposts) depends on indexes, and the fact that the query plan showed seq scans means that there are no indexes.

I don't believe this explanation, so let me add some additional indexes:

$ CREATE INDEX i1 ON time_on_site_logs ( dashboard_id ) ; CREATE INDEX $ CREATE INDEX i2 ON time_on_site_logs ( user_id ) ; CREATE INDEX $ CREATE INDEX i3 ON time_on_site_logs ( dashboard_id , user_id ) ; CREATE INDEX $ CREATE INDEX i4 ON time_on_site_logs ( user_id , dashboard_id ) ; CREATE INDEX

Now tables look like this:

$ \d dashboards TABLE "public.dashboards" COLUMN | TYPE | Modifiers ---------+--------------------------+--------------------------------------------------------- id | INTEGER | NOT NULL DEFAULT NEXTVAL ( 'dashboards_id_seq' ::regclass ) name | text | NOT NULL created | TIMESTAMP WITH TIME zone | NOT NULL DEFAULT now ( ) visited | BIGINT | NOT NULL DEFAULT 0 Indexes: "dashboards_pkey" PRIMARY KEY , btree ( id ) "dashboards_name_key" UNIQUE CONSTRAINT , btree ( name ) Referenced BY : TABLE "time_on_site_logs" CONSTRAINT "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY ( dashboard_id ) REFERENCES dashboards ( id ) $ \d time_on_site_logs TABLE "public.time_on_site_logs" COLUMN | TYPE | Modifiers -----------------+--------------------------+---------------------------------------------------------------- id | INTEGER | NOT NULL DEFAULT NEXTVAL ( 'time_on_site_logs_id_seq' ::regclass ) dashboard_id | INTEGER | NOT NULL user_id | INTEGER | NOT NULL session_started | TIMESTAMP WITH TIME zone | NOT NULL session_time | INTERVAL | NOT NULL Indexes: "time_on_site_logs_pkey" PRIMARY KEY , btree ( id ) "i1" btree ( dashboard_id ) "i2" btree ( user_id ) "i3" btree ( dashboard_id , user_id ) "i4" btree ( user_id , dashboard_id ) Foreign - KEY constraints: "time_on_site_logs_dashboard_id_fkey" FOREIGN KEY ( dashboard_id ) REFERENCES dashboards ( id )

That should handle all indexing needs for now 🙂

Now – Periscope guys wrote that they tested it on Amazon EC2 instance. I'm cheap, so I'm testing it on my desktop. So the numbers will not be directly comparable. But the ratios should be.

So, let's test the queries.

First, the naive approach:

SELECT dashboards . name , COUNT ( DISTINCT time_on_site_logs . user_id ) FROM time_on_site_logs JOIN dashboards ON time_on_site_logs . dashboard_id = dashboards . id GROUP BY name ORDER BY COUNT DESC ;

I ran it 3 times, and got best result: 492 seconds.

Their (Periscope's) second approach was:

SELECT dashboards . name , log_counts . ct FROM dashboards JOIN ( SELECT dashboard_id , COUNT ( DISTINCT user_id ) AS ct FROM time_on_site_logs GROUP BY dashboard_id ) AS log_counts ON log_counts . dashboard_id = dashboards . id ORDER BY log_counts . ct DESC ;

Again, best of three runs was 23.1 second.

Third query by Periscope:

SELECT dashboards . name , log_counts . ct FROM dashboards JOIN ( SELECT distinct_logs . dashboard_id , COUNT ( 1 ) AS ct FROM ( SELECT DISTINCT dashboard_id , user_id FROM time_on_site_logs ) AS distinct_logs GROUP BY distinct_logs . dashboard_id ) AS log_counts ON log_counts . dashboard_id = dashboards . id ORDER BY log_counts . ct DESC

Yielded best time of slighly less than 4.6s.

So far, we have:

Query Periscope test depesz test Naive 348s 492s Aggregate, Then Join 10.6s 23.1s First, Reduce The Data Set 7.13s 4.6s

This wraps me re-testing what Periscope wrote.

Couple of comments:

to the person that said (on Reddit) that PostgreSQL cannot use Hash for count(distinct …) – well, true. But rewriting the query so that it will use hashes is trivial, as shown above

to the people saying that “Yet again they say the query plans include table scans, which imply either no indexes or an awful index design" – well, look at this – I have indexes on everything, yet PostgreSQL doesn't use them in such queries. You can of course provide information how do you think indexes can help us with these queries. With sample schema, data and queries.

But – all things said – the numbers are, in my opinion, still too large.

I mean – if I was to get how many rows there are in time_on_site_logs for every (dashboard/user) – sure. It can take a while, as it has to scan all of the table. But just getting distinct count? Especially so few rows (946 rows of output)? That's got to be optimizable.

Luckily, it is:

I can define a simple function which gives me list of distinct (dashboard_id, user_id):

CREATE OR REPLACE FUNCTION get_list_of_unique_pairs ( OUT dashboard_id INT4 , OUT user_id INT4 ) RETURNS setof record AS $$ DECLARE v_dash ALIAS FOR dashboard_id; v_user ALIAS FOR user_id; BEGIN SELECT l . dashboard_id , l . user_id INTO v_dash , v_user FROM time_on_site_logs l ORDER BY l . dashboard_id , l . user_id LIMIT 1 ; LOOP EXIT WHEN NOT FOUND; RETURN NEXT ; SELECT l . dashboard_id , l . user_id INTO v_dash , v_user FROM time_on_site_logs l WHERE ( l . dashboard_id , l . user_id ) > ( v_dash , v_user ) ORDER BY l . dashboard_id , l . user_id LIMIT 1 ; END LOOP; RETURN ; END ; $$ LANGUAGE plpgsql;

With this function in place, I can just:

WITH x AS ( SELECT dashboard_id , COUNT ( * ) FROM get_list_of_unique_pairs ( ) GROUP BY dashboard_id ) SELECT d . name , x . count FROM x JOIN dashboards d ON x . dashboard_id = d . id ORDER BY x . count DESC ;

This query runs faster than 4.6s. It runs faster than 2.13s. It runs faster than 1s. It runs in 39 ms (0.039s).

Of course – you can say that using functions is cheating. Well, it can be done without functions, but the query is more complicated then:

WITH recursive distinct_pairs AS ( ( SELECT l AS rl FROM time_on_site_logs l ORDER BY l . dashboard_id , l . user_id LIMIT 1 ) UNION ALL SELECT ( SELECT l FROM time_on_site_logs l WHERE ( l . dashboard_id , l . user_id ) > ( ( p . rl ) . dashboard_id , ( p . rl ) . user_id ) ORDER BY l . dashboard_id , l . user_id LIMIT 1 ) FROM distinct_pairs p WHERE ( p . rl ) . id IS NOT NULL ) , unpacked_counts AS ( SELECT ( rl ) . dashboard_id , COUNT ( * ) FROM distinct_pairs WHERE ( rl ) . dashboard_id IS NOT NULL GROUP BY ( rl ) . dashboard_id ) SELECT d . name , uc . count FROM dashboards d JOIN unpacked_counts uc ON d . id = uc . dashboard_id ORDER BY uc . count DESC ;

This query (which I didn't write myself, RhodiumToad helped me a lot), runs in just a bit below 23ms.

Final comments:

I would really expect more from a company that writes: “We make … tool that makes SQL data analysis really fast." or “building a much faster way to do SQL data analysis." – sorry, but it can be done faster. Way faster.

The queries that I wrote are based on the fact that there is little (comparatively) distinct (dashboard_id, user_id) values. If there were more (for example – 50% of row count of time_on_site_logs rows) – it wouldn't work nicely.

Doing this kind of analysis should be done using rollup tables, which gather information in some side tables, and then you just query these side tables to get results. Otherwise – it might be fast for 14M rows, but for 14B rows, it will be slow again

The whole process, as shown in Periscope posts can be, relatively simply, parallelized – even in PostgreSQL

and finally – that was fun.