This question (and its variants) show quite often on #postgresql on IRC. People get sequential scans, and are worried that it's slow and bad.

So, I hope that this blogpost will shed some light on the subject why indexes are being chosen to be used, or not.

Before I will go deep into explain plans, queries, statistics, let's ask: what is an index? Is it magic? Is it free?

Index is data structure, created alongside of table, that is used to speed up searches or sorts.

It's not free (index has to be kept up to date, so every insert/update/delete operation gets a bit more expensive.

So, how does PostgreSQL use index, when it chooses to?

Let's assume we have some table, with column “whatever" (integer, not unique), and index on it.

If PostgreSQL will choose to use it (let's assume the query is: SELECT * FROM table WHERE whatever = 123), it will:

find record in index

based on information from index, it will find appropriate page and record in table (heap data)

It looks great, but we have to understand that “find record in index" is relatively complicated (although usually fast) operation, which can easily involve multiple random page reads.

What do I mean by random page reads? When we'll go down to disk level, we can basically have 2 different kinds of access:

sequential – next block of data is read just after previous, so no disk head movement is needed

random – next block is read from random place on disk, which required head movement, and thus is slower (think: average seek time in disk technical specification)

So, even in best case, index scan requires (from disk):

seek to where index is

fetch page from index

seek to where table is

fetch page from table

Of course disk cache and/or shared_buffers help with this, but it's still “a lot of" operations.

On the other hand – best case for sequential scan is simpler:

seek to where table is

fetch page from table

Much nicer.

This can mean that if table is smaller than 2 pages (page is 8192 bytes, at least unless you modified it), PostgreSQL shouldn't use index at all! In reality it's probably a bit more complicated, so the easiest way to tell, is simply to test it:

$ CREATE TABLE test ( whatever INTEGER ) ; CREATE TABLE $ CREATE INDEX testi ON test ( whatever ) ; CREATE INDEX $ INSERT INTO test ( whatever ) SELECT i FROM generate_series ( 1 , 10 ) i; INSERT 0 10 $ SELECT pg_relation_size ( 'test' ) ; pg_relation_size ------------------ 8192 ( 1 ROW ) $ analyze test; ANALYZE $ EXPLAIN SELECT * FROM test WHERE whatever = 5 ; QUERY PLAN ---------------------------------------------------- Seq Scan ON test ( cost = 0.00 .. 1 . 12 ROWS = 1 width = 4 ) FILTER : ( whatever = 5 ) ( 2 ROWS )

As you can see, I ran analyze manually (to avoid having to wait for pg_autovacuum. Analyze is very important, and I will show it a bit later.

As expected – for such small table – index was not used.

After some test with various row counts, I found the exact threshold for this particular Pg to switch to index scan:

$ TRUNCATE test; TRUNCATE TABLE $ INSERT INTO test ( whatever ) SELECT i FROM generate_series ( 1 , 501 ) i; INSERT 0 501 $ analyze test; ANALYZE $ EXPLAIN SELECT * FROM test WHERE whatever = 5 ; QUERY PLAN ---------------------------------------------------- Seq Scan ON test ( cost = 0.00 .. 8 . 26 ROWS = 1 width = 4 ) FILTER : ( whatever = 5 ) ( 2 ROWS ) $ SELECT pg_relation_size ( 'test' ) ; pg_relation_size ------------------ 16384 ( 1 ROW ) $ INSERT INTO test ( whatever ) VALUES ( 502 ) ; INSERT 0 1 $ analyze test; ANALYZE $ SELECT pg_relation_size ( 'test' ) ; pg_relation_size ------------------ 16384 ( 1 ROW ) $ EXPLAIN SELECT * FROM test WHERE whatever = 5 ; QUERY PLAN ------------------------------------------------------------------ INDEX Scan USING testi ON test ( cost = 0.00 .. 8 . 27 ROWS = 1 width = 4 ) INDEX Cond: ( whatever = 5 ) ( 2 ROWS )

That was the first situation where PostgreSQL can sensibly choose not to use index.

Next situation. Remember that random reads from disk are much more expensive than sequential ones (due to seeks). This means that if we are fetching more than some percent of the page – usage of the index is not sensible.

Let's test this threshold. For this I'll need much larger test table:

$ TRUNCATE test; TRUNCATE TABLE $ INSERT INTO test ( whatever ) SELECT generate_series ( 1 , 10000000 ) ; INSERT 0 10000000 $ analyze test ANALYZE $ SELECT pg_relation_size ( 'test' ) ; pg_relation_size ------------------ 321257472 ( 1 ROW )

With this table, and this simple one-liner:

perl -e 'for (1..100) { printf "%3u : ",$_; my $thr = 10000000 / 100 * $_; system(qq{psql -qAt -c "explain analyze select * from test where whatever <= $thr" | head -n 1})}'

I can test when index is being used, and when it's switching to seq scan.

And it seems that the threshold in my case was ~ 55% of the table:

$ EXPLAIN analyze SELECT * FROM test WHERE whatever < 5400000 ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- INDEX Scan USING testi ON test ( cost = 0.00 .. 163836 . 29 ROWS = 5424809 width = 4 ) ( actual TIME = 0.100 .. 9531 . 022 ROWS = 5399999 loops = 1 ) INDEX Cond: ( whatever < 5400000 ) Total runtime: 17338.177 ms ( 3 ROWS ) $ EXPLAIN analyze SELECT * FROM test WHERE whatever < 5500000 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Seq Scan ON test ( cost = 0.00 .. 164217 . 00 ROWS = 5525712 width = 4 ) ( actual TIME = 0.056 .. 10039 . 352 ROWS = 5499999 loops = 1 ) FILTER : ( whatever < 5500000 ) Total runtime: 17824.766 ms ( 3 ROWS )

Please note how close the times are. With this in mind – let's force index scan and test 55%:

$ EXPLAIN analyze SELECT * FROM test WHERE whatever < 5500000 ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- INDEX Scan USING testi ON test ( cost = 0.00 .. 166882 . 09 ROWS = 5525712 width = 4 ) ( actual TIME = 0.145 .. 9589 . 029 ROWS = 5499999 loops = 1 ) INDEX Cond: ( whatever < 5500000 ) Total runtime: 17553.429 ms ( 3 ROWS )

Nice. Time for index scan is actually a bit faster, but not really much. I will tell how to fix this small “problem" (i.e. seq scan should be chosen a bit later, perhaps at 60%) later in the post.

So, right now we know that index will not be used if:

table is too small

we get too many records (as percent of the table – getting 7million rows from 10 million rows is faster with seq scan, but getting the same 7 million rows from 1 billion row table should use index).

Another situation is that not all operations are indexable. For the very simple case – like operator, with ‘%' in the beginning of pattern is not indexable (you can work around using reverse, but still – with like ‘%whatever%' – you can't use index).

So, when you're using special data types, or non-trivial operators – always check if it's at all possible to index it.

Last situation is simple to explain, but not always simple to understand.

Let's assume that you have:

SELECT * FROM test WHERE whatever + 2 < 5 ;

That's the same as:

SELECT * FROM test WHERE whatever < 3 ;

Right? No!

$ EXPLAIN SELECT * FROM test WHERE whatever + 2 < 5 ; QUERY PLAN ------------------------------------------------------------------------------ Seq Scan ON test ( cost = 10000000000.00 .. 10000189217 . 20 ROWS = 3333360 width = 4 ) FILTER : ( ( whatever + 2 ) < 5 ) ( 2 ROWS ) $ EXPLAIN SELECT * FROM test WHERE whatever < 3 ; QUERY PLAN ---------------------------------------------------------------------- INDEX Scan USING testi ON test ( cost = 0.00 .. 37 . 63 ROWS = 1000 width = 4 ) INDEX Cond: ( whatever < 3 ) ( 2 ROWS )

Why is that so?

Operations are done using functions. So basically “whatever + 2" is the same as “int4pl(whatever, 2)" – i.e. call to function int4pl, with 2 arguments.

PostgreSQL doesn't know what which function does. So, knowing that name of the function is int4pl, doesn't mean anything to it, so it can't modify the query to the later form, and it has to assume that int4pl() function can return anything. Which means – we can't use index on (column) for search for functionname(column)!

This means that if you want your index to be used – you have to keep the indexed operation “alone" on it's side of comparison operator.

While example with “+2" might sound trivial, I have seen more than often queries like:

SELECT * FROM TABLE WHERE timestamp_column + '5 minutes' :: INTERVAL < now ( )

Which has exactly the same problem!

Now, while we are at functions – let's assume you want case insensitive search. This is done using:

SELECT * FROM TABLE WHERE LOWER ( text_field ) = LOWER ( 'some_value' )

Which is all nice, but index on text_field cannot be used for it! Luckily there is solution: functional index:

CREATE INDEX some_name ON TABLE ( LOWER ( text_field ) )

As you can see this creates index not on column, but on result of function call with value from the column. This is very cool. But does it always help?

Let's assume you want to find rows from given week. And by given I mean: you give (as parameter to the query) some timestamp, and you want all records that happened in the same week as the timestamp you gave. Example query:

$ EXPLAIN analyze SELECT * FROM test WHERE date_trunc ( 'week' , whatever ) = date_trunc ( 'week' , '2010-02-04 12:34:56+02' ::timestamptz ) ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Seq Scan ON test ( cost = 10000000000.00 .. 10000002193 . 00 ROWS = 500 width = 8 ) ( actual TIME = 0.727 .. 208 . 189 ROWS = 425 loops = 1 ) FILTER : ( date_trunc ( 'week' ::text , whatever ) = date_trunc ( 'week' ::text , '2010-02-04 11:34:56+01' :: TIMESTAMP WITH TIME zone ) ) Total runtime: 208.888 ms ( 3 ROWS )

Table now looks like this:

$ \d test TABLE "public.test" COLUMN | TYPE | Modifiers ----------+--------------------------+----------- whatever | TIMESTAMP WITH TIME zone | Indexes: "test1" btree ( whatever ) $ SELECT COUNT ( * ) , MIN ( whatever ) , MAX ( whatever ) FROM test; COUNT | MIN | MAX --------+-------------------------------+------------------------------- 100000 | 2005 - 09 - 09 09: 48 : 57.190155 + 02 | 2010 - 09 - 09 08: 56 : 04.841355 + 02 ( 1 ROW )

Now let me ask you – what index to crate to be able to use it?

If you said: it's trivial: index on date_trunc( ‘week', whatever ) – well, you failed:

$ CREATE INDEX w ON test ( date_trunc ( 'week' , whatever ) ) ; ERROR: functions IN INDEX expression must be marked IMMUTABLE

What is this? Well, let's assume extreme case – can you sensibly make index on random() ? No. It changes all the time. And to be able to make index on something – we need to make sure that the value of this something doesn't change on its own. That's why you can't index on random().

But why you can't index on date_trunc( ‘week', whatever ) ? It cannot change on its own (without change in table). Right? Wrong!

$ SELECT date_trunc ( 'week' , '2010-09-06 08:56:04.841355+02' ::timestamptz ) ; date_trunc ------------------------ 2010 - 09 - 06 00:00:00 + 02 ( 1 ROW ) $ SET timezone = 'America/Los_Angeles' ; SET $ SELECT date_trunc ( 'week' , '2010-09-06 08:56:04.841355+02' ::timestamptz ) ; date_trunc ------------------------ 2010 - 08 - 30 00:00:00 - 07 ( 1 ROW )

As you can see – value of date_trunc, with the same arguments, depend on some external data (namely: client time zone).

Some people in this case create wrapper function, which calls this function (date_trunc), but they mark the wrapper function as immutable.

It will work, but it's a really nice, and hard to debug, way to shoot yourself in the foot, in case your application will get changed later to accommodate various time zones.

What is the correct way to solve it? Unfortunately, it's a bit more convoluted, but it works:

$ CREATE INDEX w ON test ( date_trunc ( 'week' , whatever at TIME zone 'UTC' ) ) ; CREATE INDEX

Instead of using utc, you can use any time zone, as long as it's one, hardcoded timezone.

The problem with it is that you have to use exactly the same expression later on in queries:

$ EXPLAIN analyze SELECT * FROM test WHERE date_trunc ( 'week' , whatever ) = date_trunc ( 'week' , '2010-02-04 12:34:56+02' ::timestamptz ) ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Seq Scan ON test ( cost = 0.00 .. 2193 . 00 ROWS = 500 width = 8 ) ( actual TIME = 0.727 .. 210 . 051 ROWS = 425 loops = 1 ) FILTER : ( date_trunc ( 'week' ::text , whatever ) = date_trunc ( 'week' ::text , '2010-02-04 11:34:56+01' :: TIMESTAMP WITH TIME zone ) ) Total runtime: 210.890 ms ( 3 ROWS ) $ EXPLAIN analyze SELECT * FROM test WHERE date_trunc ( 'week' , whatever at TIME zone 'UTC' ) = date_trunc ( 'week' , '2010-02-04 12:34:56+02' ::timestamptz ) ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON test ( cost = 12.14 .. 486 . 23 ROWS = 500 width = 8 ) ( actual TIME = 1.060 .. 3 . 214 ROWS = 426 loops = 1 ) Recheck Cond: ( date_trunc ( 'week' ::text , timezone ( 'UTC' ::text , whatever ) ) = date_trunc ( 'week' ::text , '2010-02-04 11:34:56+01' :: TIMESTAMP WITH TIME zone ) ) -> Bitmap INDEX Scan ON w ( cost = 0.00 .. 12 . 02 ROWS = 500 width = 0 ) ( actual TIME = 0.915 .. 0 . 915 ROWS = 426 loops = 1 ) INDEX Cond: ( date_trunc ( 'week' ::text , timezone ( 'UTC' ::text , whatever ) ) = date_trunc ( 'week' ::text , '2010-02-04 11:34:56+01' :: TIMESTAMP WITH TIME zone ) ) Total runtime: 4.289 ms ( 5 ROWS )

Luckily you can wrap it in simple SQL function:

$ CREATE FUNCTION start_of_utc_week ( timestamptz ) RETURNS TIMESTAMP AS $$ SELECT date_trunc ( 'week' , $1 at TIME zone 'UTC' ) $$ LANGUAGE SQL immutable; CREATE FUNCTION $ CREATE INDEX w ON test ( start_of_utc_week ( whatever ) ) ; CREATE INDEX $ EXPLAIN analyze SELECT * FROM test WHERE start_of_utc_week ( whatever ) = start_of_utc_week ( '2010-02-04 12:34:56+02' ::timestamptz ) ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan ON test ( cost = 12.14 .. 484 . 98 ROWS = 500 width = 8 ) ( actual TIME = 0.561 .. 2 . 699 ROWS = 426 loops = 1 ) Recheck Cond: ( date_trunc ( 'week' ::text , timezone ( 'UTC' ::text , whatever ) ) = '2010-02-01 00:00:00' :: TIMESTAMP WITHOUT TIME zone ) -> Bitmap INDEX Scan ON w ( cost = 0.00 .. 12 . 02 ROWS = 500 width = 0 ) ( actual TIME = 0.409 .. 0 . 409 ROWS = 426 loops = 1 ) INDEX Cond: ( date_trunc ( 'week' ::text , timezone ( 'UTC' ::text , whatever ) ) = '2010-02-01 00:00:00' :: TIMESTAMP WITHOUT TIME zone ) Total runtime: 3.770 ms ( 5 ROWS )

( of course I don't have to use start_of_utc_week call on the right side of comparison, but I choose to, to make it simpler ).

In the beginning of the post I wrote that analyze is important. Why is it? Let me show you simple example. I turned off autovacuum, and restarted PostgreSQL, and then ran this:

$ CREATE TABLE test ( whatever int4 ) ; CREATE TABLE $ CREATE INDEX test1 ON test ( whatever ) ; CREATE INDEX $ INSERT INTO test ( whatever ) SELECT generate_series ( 1 , 100000 ) ; INSERT 0 100000

With this table, and no analyze yet, I ran 3 explains:

$ EXPLAIN SELECT * FROM test WHERE whatever > 1000000 ; QUERY PLAN -------------------------------------------------------------------------- Bitmap Heap Scan ON test ( cost = 539.92 .. 1325 . 92 ROWS = 31440 width = 4 ) Recheck Cond: ( whatever > 1000000 ) -> Bitmap INDEX Scan ON test1 ( cost = 0.00 .. 532 . 06 ROWS = 31440 width = 0 ) INDEX Cond: ( whatever > 1000000 ) ( 4 ROWS ) $ EXPLAIN SELECT * FROM test WHERE whatever < 0 ; QUERY PLAN -------------------------------------------------------------------------- Bitmap Heap Scan ON test ( cost = 539.92 .. 1325 . 92 ROWS = 31440 width = 4 ) Recheck Cond: ( whatever < 0 ) -> Bitmap INDEX Scan ON test1 ( cost = 0.00 .. 532 . 06 ROWS = 31440 width = 0 ) INDEX Cond: ( whatever < 0 ) ( 4 ROWS ) $ EXPLAIN SELECT * FROM test WHERE whatever < 123 ; QUERY PLAN -------------------------------------------------------------------------- Bitmap Heap Scan ON test ( cost = 539.92 .. 1325 . 92 ROWS = 31440 width = 4 ) Recheck Cond: ( whatever < 123 ) -> Bitmap INDEX Scan ON test1 ( cost = 0.00 .. 532 . 06 ROWS = 31440 width = 0 ) INDEX Cond: ( whatever < 123 ) ( 4 ROWS )

Please note that rows= values are the same, and they don't make any sense!

Now, let's analyze the table, and repeat these explains:

$ analyze test; ANALYZE $ EXPLAIN SELECT * FROM test WHERE whatever > 1000000 ; QUERY PLAN ------------------------------------------------------------------- INDEX Scan USING test1 ON test ( cost = 0.00 .. 8 . 43 ROWS = 10 width = 4 ) INDEX Cond: ( whatever > 1000000 ) ( 2 ROWS ) $ EXPLAIN SELECT * FROM test WHERE whatever < 0 ; QUERY PLAN ------------------------------------------------------------------- INDEX Scan USING test1 ON test ( cost = 0.00 .. 8 . 43 ROWS = 10 width = 4 ) INDEX Cond: ( whatever < 0 ) ( 2 ROWS ) $ EXPLAIN SELECT * FROM test WHERE whatever < 123 ; QUERY PLAN --------------------------------------------------------------------- INDEX Scan USING test1 ON test ( cost = 0.00 .. 10 . 36 ROWS = 120 width = 4 ) INDEX Cond: ( whatever < 123 ) ( 2 ROWS )

Much better. Why is it so? PostgreSQL contains some statistics about values in your table. These statistics are updated with analyze command (either run by hand, or via autovacuum). If these statistics are bad or missing you can get bad results. You saw what happens when statistics are missing. But what happens with bad stats?

$ UPDATE test SET whatever = - 1 * whatever; UPDATE 100000

And now, let's test the query that previously returned ~ 120 rows, but now would return whole table (100% of rows). It should use sequential scan, but:

$ EXPLAIN SELECT * FROM test WHERE whatever < 123 ; QUERY PLAN --------------------------------------------------------------------- INDEX Scan USING test1 ON test ( cost = 0.00 .. 12 . 46 ROWS = 239 width = 4 ) INDEX Cond: ( whatever < 123 ) ( 2 ROWS )

Oooops.

Of course, it's also true “the other way around". Query that would previously return 90% of rows ( whatever > 10000 ), will now not return anything, so it should use index, but:

$ EXPLAIN analyze SELECT * FROM test WHERE whatever > 10000 ; QUERY PLAN -------------------------------------------------------------------------------------------------------- Seq Scan ON test ( cost = 0.00 .. 3281 . 83 ROWS = 180104 width = 4 ) ( actual TIME = 38.424 .. 38 . 424 ROWS = 0 loops = 1 ) FILTER : ( whatever > 10000 ) Total runtime: 38.468 ms ( 3 ROWS )

Ooops. Of course with such small table its not really big difference, but usually your tables will be larger than 100k rows with single column of int4 type 🙂

Also – please remember that analyze stores only some certain number of statistics, and you can still get bad estimates. If that will happen – consider changing default_statistics_target, or just alter table … alter column … set statistics to some higher value.

Last note – as I shown earlier, PostgreSQL doesn't always finds perfectly what is the best threshold to change from given plan to another. You can influence this by modifying settings like cpu_*_cost, effective_cache_size and random_page_cost. By default random_page_cost is set to 4 – which means that getting random page is 4 times as expensive as getting page sequentially. This might seem too harsh, but actually it's pretty good approximation. If you do have really cool storage (think: lots of disks in raid10, or ssd storage) you can modify random_page_cost to better suit your needs.

As for other parameters – Josh Berkus once said that we should lower all cpu_*_cost parameters by 80% due to the fact that new cpus are much faster. I did test it, and haven't seen improvements, but your mileage my vary.

Effective_cache_size – this should be simply set to real value – how much data from disk is in your operating system disk cache. For example, on Linux you can use program “free":

=$ free total used free shared buffers cached Mem: 3088560 745012 2343548 0 121588 373008 -/+ buffers/cache: 250416 2838144 Swap: 1952760 0 1952760

`

And in here we see that right now my laptop is using ~ 370MB for disk cache. Remember to always check the value with all necessary programs (like – httpd/monitoring or any other tools you're running on your db server) running, and after some usage of database/system. This is important, as just after reboot the value will be (usually) much lower than after a week of normal work.

Hope this post will clear some confusion, and help others with diagnosing their PostgreSQL installations.