This document showcases many of the latest developments in PostgreSQL 9.0, compared to the last major release – PostgreSQL 8.4. There are more than 200 improvements in this release, so this wiki page covers many of the more important changes in detail. The full list of changes is itemised in Release Notes.

The two features you can't ignore

Hot Standby and Streaming Replication are the two new features that mark Version 9.0 as a landmark in PostgreSQL's development and the motivation for allocating a full version number to this release – 9.0 (instead of 8.5). Combined, these two features add built-in, "binary replication" to PostgreSQL.

There is further documention on how to use Hot Standby on its page, and an extensive Binary Replication Tutorial is in progress.

Hot Standby

This feature allows users to create a 'Standby' database – that is, a second database instance (normally on a separate server) replaying the primary's binary log, while making that standby server available for read-only queries. It is similar to the standby database features of proprietary databases, such as Oracle's Active DataGuard. Queries execute normally while the standby database continuously replays the stream of binary modifications coming from the primary database. Visibility of new data changes follows the MVCC model, so that new changes do not lock out queries.

Enabling Hot Standby is a simple process. On the primary database server add this to the postgresql.conf file:

wal_level = 'hot_standby' # Adds the required data in the WAL logs

And on the standby server, add this to its postgresql.conf file:

hot_standby = on

Hot Standby works well with the new Streaming Replication feature, though it can also be used with file-based log shipping as available in previous versions and also to create standalone copies that receive no updates at all.

In some cases, changes from the primary database can conflict with queries on the standby. A simple example is when DROP TABLE executes on the master, but the standby is still executing a query against that table. The standby cannot process that DROP statement without first canceling the running query, and the longer it delays doing that the further behind current replication the standby will become. The two options here are to pause the replay or cancel the read-only queries and move forward.

A variety of parameters allow adjusting the conflict resolution mechanism used.

max_standby_archive_delay = 30s # -1= always wait, 0= never wait, else wait for this max_standby_streaming_delay = 30s # -1= always wait, 0= never wait, else wait for this

The two max_standby_{archive,streaming}_delay settings determine the behaviour of the standby database when conflicts between replay and read-only queries occur. In this situation, the standby database will wait at most until it's lagging behind the primary database by that amount before canceling the conflicting read-only queries. The two parameters allow different lag time tolerance levels for files appearing via regular file archive shipping vs. ones that are streamed via the new 9.0 feature for streaming replication.

On the master it is also possible to avoid conflicts by increasing this parameter

vacuum_defer_cleanup_age = 10000 # Adjust updwards slowly to reduce conflicts

This feature is rich and complex, so it's advisable to read the documentation before planning your server deployments.

Streaming Replication

Complementing Hot Standby, Streaming Replication is the second half of the "great leap forward" for PostgreSQL. While there are several third-party replication solutions available for PostgreSQL that meet a range of specific needs, this new release brings a simple, sturdy and integrated version that will probably be used as a default in most High Availability installations using PostgreSQL.

This time, the goal is improving the archiving mechanism to make it as continuous as possible and to not rely on log file shipping. Standby servers can now connect to the master/primary and get sent, whenever they want, what they are missing from the Write Ahead Log, not in terms of complete files ('wal segments'), but in terms of records in the WAL (you can think of them as fragments of these files).

Streaming Replication is an asynchronous mechanism; the standby server lags behind the master. But unlike other replication methods, this lag is very short, and can be as little as a single transaction, depending on network speed, database activity, and Hot Standby settings. Also, the load on the master for each slave is minimal, allowing a single master to support dozens of slaves.

Primary and standby databases are identical at the binary level (well, almost; but don't worry if your datafiles don't have the same checksum).

For Streaming Replication, wal_level should be 'archive' or 'hot standby'.

postgresql.conf , Primary:

max_wal_senders = 5 # Maximum 'wal_senders', processes responsible for managing a connection with a standby server wal_keep_segments # How many WAL segments (=files) should be kept on the primary, whatever may happen (you won't have to copy them manually on the standby if the standby gets too far behind)

On the standby server:

recovery.conf , Standby:

standby_mode = 'on' primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' # connection string to reach the primary database

postgresql.conf , Secondary:

wal_level # same value as on the primary (you'll need this after a failover, to build a new standby) hot_standby=on/off # Do you want to use Hot Standby at the same time ?

pg_hba.conf file:

There must be an entry here for the replication connections. The fake database is 'replication', the designated user should be superuser. Be careful not to give broad access to this account: a lot of privileged data can be extracted from WAL records.

pg_hba.conf , Primary:

host replication foo 192.168.1.100/32 md5

As for Hot Standby, this feature is rich and complex. It's advised to read the documentation. And to perform failover and switchover tests when everything is in place.

One thing should be stressed about these two features: you can use them together. This means you can have a near-realtime standby database, and run read-only queries on it, such as reporting queries. You can also use them independently; a standby database can be Hot Standby with file shipping only, and a Streaming Replication database can stream without accepting queries.

Other New features

There are literally hundreds of improvements, updates, and new features in 9.0 ... enough to make it a major release even without binary replication. We'll tour a few of them below, by category, with details on how to use them.

Security and Authentication

Of course, as the most secure SQL database (according to the SQL Hacker's Handbook) we're always eager to improve our data security. 9.0 adds several new features in this realm.

GRANT/REVOKE IN SCHEMA

One annoying limitation in PostgreSQL has been the lack of global GRANT/REVOKE capabilities. With 9.0 it's now possible to set privileges on all tables, sequences and functions within a schema using without having to write a script or a stored procedure:

GRANT SELECT ON ALL TABLES IN SCHEMA public TO toto;

And reverting this:

REVOKE SELECT ON ALL TABLES IN SCHEMA public FROM toto;

See the GRANT documentation page for further details.

Note that the above only works for existing objects. However, it's now also possible to define default permissions for new objects:

ALTER DEFAULT PRIVILEGES

This feature also makes permission management more efficient.

ALTER DEFAULT PRIVILEGES FOR ROLE marc GRANT SELECT ON TABLES TO public; CREATE TABLE test_priv (a int); \z test_priv Access privileges Schema | Name | Type | Access privileges | Column access privileges --------+------------+-------+-------------------+-------------------------- public | test_priv | table | =r/marc +| | | | marc=arwdDxt/marc |

This new information is stored in the pg_default_acl system table.

passwordcheck

This contrib module can check passwords, and prevent the worst of them from getting in. After having it installed and set up as described in the documentation, here is the result:

marc=# ALTER USER marc password 'marc12'; <marc%marc> ERROR: password is too short <marc%marc> STATEMENT: ALTER USER marc password 'marc12'; ERROR: password is too short marc=# ALTER USER marc password 'marc123456'; <marc%marc> ERROR: password must not contain user name <marc%marc> STATEMENT: ALTER USER marc password 'marc123456'; ERROR: password must not contain user name

This module has limitations, mostly due to PostgreSQL accepting already encrypted passwords to be declared, making correct verification impossible.

Moreover, its code is well documented, and can be easily adapted to suit specific needs (one can activate cracklib very easily, for instance)





SQL Features

SQL03 has a huge array of functionality, more than any one DBMS currently implements. But we keep adding SQL features, as well as extending SQL in a compatible way with various little things to make writing queries easier and more powerful.

Column triggers

Column triggers fire only when a specific column is explicitly UPDATED. They allow you to avoid adding lots of conditional logic and value comparisons in your trigger code.

Example:

CREATE TRIGGER foo BEFORE UPDATE OF some_column ON table1 FOR EACH ROW EXECUTE PROCEDURE my_trigger();

This trigger fires only when ' some_column ' column of ' table1 ' table has been updated.

Column triggers are not executed if columns are set to DEFAULT.

WHEN Triggers

Completing PostgreSQL's effort to limit IF ... THEN code in triggers, conditional triggers define simple conditions under which the trigger will be executed. This can dramatically decrease the number of trigger executions and reduce CPU load on the database server.

For example, this trigger would check that an account was correctly balanced only when the balance changes:

CREATE TRIGGER check_update BEFORE UPDATE ON accounts FOR EACH ROW WHEN (OLD.balance IS DISTINCT FROM NEW.balance) EXECUTE PROCEDURE check_account_update();

And this trigger will only log a row update when the row actually changes. It's very helpful with framework or ORM applications, which may attempt to save unchanged rows:

CREATE TRIGGER log_update AFTER UPDATE ON accounts FOR EACH ROW WHEN (OLD.* IS DISTINCT FROM NEW.*) EXECUTE PROCEDURE log_account_update();

You could even go further than this and decide not to save a row at all if it hasn't changed:

CREATE TRIGGER log_update BEFORE UPDATE ON accounts FOR EACH ROW WHEN (OLD.* IS NOT DISTINCT FROM NEW.*) EXECUTE PROCEDURE no_op();

DEFERRABLE UNIQUE CONSTRAINTS

This feature will also be very useful. Here is an example, using a primary key instead of a simple unique key:

marc=# CREATE TABLE test (a int primary key); marc=# INSERT INTO test values (1), (2); marc=# UPDATE test set a = a+1; ERROR: duplicate key value violates unique constraint "test_pkey" DETAIL: Key (a)=(2) already exists.

That's a pity: at the end of the statement, my data would have been consistent, so far as this constraint is concerned. Even worse, if the table had been physically sorted by descending order, the query would have worked! With 8.4, there was no easy way out; we had to find a trick to update the records in the right order.

We can now do this:

marc=# CREATE TABLE test (a int primary key deferrable); marc=# INSERT INTO test values (1), (2); marc=# UPDATE test set a = a+1; UPDATE 2

With a DEFERRABLE unique index, uniqueness is enforced as of the end of the statement, rather than after each row update as with a simple index. This is a bit slower sometimes but is a lifesaver if you need to do this sort of update.

It is also possible to have the uniqueness check enforced as of the end of the transaction, rather than after each statement. This helps if you need to do "conflicting" updates that require more than one SQL statement to complete. For example:

marc=# CREATE TABLE test (a int primary key deferrable, b text); marc=# INSERT INTO test values (1, 'x'), (2, 'y'); marc=# BEGIN; marc=# SET CONSTRAINTS ALL DEFERRED; marc=# UPDATE test SET a = 2 WHERE b = 'x'; marc=# UPDATE test SET a = 1 WHERE b = 'y'; marc=# COMMIT;

If one doesn't want to perform a SET CONSTRAINTS each time, the constraint can also be declared as INITIALLY DEFERRED:

CREATE TABLE test (a int PRIMARY KEY DEFERRABLE INITIALLY DEFERRED);

Keep in mind that the list of records to be checked at the end of the statement or transaction has to be stored somewhere. So be careful of not doing this for millions of records at once. This is one of the reasons that unique indexes aren't DEFERRABLE by default, even though a strict reading of the SQL spec would require it.

New frame options for window functions

If you don't know window functions yet, you'd better learn about them. You can start here : waiting-for-84-window-functions. They make writing certain kind of queries much easier.

New options have been added for declaring frames of windowing functions. Let's use this table (not having a better example…)

marc=# SELECT * FROM salary ; entity | name | salary | start_date -----------+-----------+---------+--------------- R&D | marc | 700.00 | 2010-02-15 Accounting | jack | 800.00 | 2010-05-01 R&D | maria | 700.00 | 2009-01-01 R&D | kevin | 500.00 | 2009-05-01 R&D | john | 1000.00 | 2008-07-01 R&D | tom | 1100.00 | 2005-01-01 Accounting | millicent | 850.00 | 2006-01-01

Here is a window function example, without declaring the frame:

marc=# SELECT entity, name, salary, start_date, avg(salary) OVER (PARTITION BY entity ORDER BY start_date) FROM salary; entity | name | salary | start_date | avg -----------+-----------+---------+---------------+----------------------- Accounting | millicent | 850.00 | 2006-01-01 | 850.0000000000000000 Accounting | jack | 800.00 | 2010-05-01 | 825.0000000000000000 R&D | tom | 1100.00 | 2005-01-01 | 1100.0000000000000000 R&D | john | 1000.00 | 2008-07-01 | 1050.0000000000000000 R&D | maria | 700.00 | 2009-01-01 | 933.3333333333333333 R&D | kevin | 500.00 | 2009-05-01 | 825.0000000000000000 R&D | marc | 700.00 | 2010-02-15 | 800.0000000000000000

The frame is the group of records over which the window function is run. Of course, if the frame isn't explicitly declared, there is a default one.

Here is the same query, with an explicit frame:

marc=# SELECT entity, name, salary, start_date, avg(salary) OVER (PARTITION BY entity ORDER BY start_date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM salary; entity | name | salary | start_date | avg -----------+-----------+---------+---------------+----------------------- Accounting | millicent | 850.00 | 2006-01-01 | 850.0000000000000000 Accounting | jack | 800.00 | 2010-05-01 | 825.0000000000000000 R&D | tom | 1100.00 | 2005-01-01 | 1100.0000000000000000 R&D | john | 1000.00 | 2008-07-01 | 1050.0000000000000000 R&D | maria | 700.00 | 2009-01-01 | 933.3333333333333333 R&D | kevin | 500.00 | 2009-05-01 | 825.0000000000000000 R&D | marc | 700.00 | 2010-02-15 | 800.0000000000000000

In this example, the frame is a 'range' frame, between the start of the partition (the group of similar rows) and the current row (not exactly the current row, but let's put that aside for now, read the documentation if you want to learn more). One can see, the average (avg) function is evaluated from the frame's first row (grouped together records) and the current row.

First new feature: as of 9.0, the frame can be declared to be between the current row and the end of the partition:

marc=# SELECT entity, name, salary, start_date, avg(salary) OVER (PARTITION BY entity ORDER BY start_date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM salary; entity | name | salary | start_date | avg -----------+-----------+---------+---------------+---------------------- Accounting | millicent | 850.00 | 2006-01-01 | 825.0000000000000000 Accounting | jack | 800.00 | 2010-05-01 | 800.0000000000000000 R&D | tom | 1100.00 | 2005-01-01 | 800.0000000000000000 R&D | john | 1000.00 | 2008-07-01 | 725.0000000000000000 R&D | maria | 700.00 | 2009-01-01 | 633.3333333333333333 R&D | kevin | 500.00 | 2009-05-01 | 600.0000000000000000 R&D | marc | 700.00 | 2010-02-15 | 700.0000000000000000

Second new feature: frames can be declared as 'x previous records to y next records'. There is no point with this example, but let's do it anyway::

marc=# SELECT entity, name, salary, start_date, avg(salary) OVER (PARTITION BY entity ORDER BY start_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM salary; entity | name | salary | start_date | avg -----------+-----------+---------+---------------+----------------------- Accounting | millicent | 850.00 | 2006-01-01 | 825.0000000000000000 Accounting | jack | 800.00 | 2010-05-01 | 825.0000000000000000 R&D | tom | 1100.00 | 2005-01-01 | 1050.0000000000000000 R&D | john | 1000.00 | 2008-07-01 | 933.3333333333333333 R&D | maria | 700.00 | 2009-01-01 | 733.3333333333333333 R&D | kevin | 500.00 | 2009-05-01 | 633.3333333333333333 R&D | marc | 700.00 | 2010-02-15 | 600.0000000000000000

The frame is still limited to the partition (see tom's record, for instance: jack's record isn't use for it's average).

If one wanted the same query, with a moving average on three rows, not reset on each partition switch (still no practical use):

marc=# SELECT entity, name, salary, start_date, avg(salary) OVER (ORDER BY entity, start_date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM salary; entity | name | salary | start_date | avg -----------+-----------+---------+---------------+---------------------- Accounting | millicent | 850.00 | 2006-01-01 | 825.0000000000000000 Accounting | jack | 800.00 | 2010-05-01 | 916.6666666666666667 R&D | tom | 1100.00 | 2005-01-01 | 966.6666666666666667 R&D | john | 1000.00 | 2008-07-01 | 933.3333333333333333 R&D | maria | 700.00 | 2009-01-01 | 733.3333333333333333 R&D | kevin | 500.00 | 2009-05-01 | 633.3333333333333333 R&D | marc | 700.00 | 2010-02-15 | 600.0000000000000000

In short, a powerful tool to be mastered, even if I couldn't provide a good example.

Sort in aggregates

This feature is a subtle one: the result of an aggregate function may depend on the order it receives the data.

Of course, we're not talking about count, avg, but of array_agg, string_agg…

This is nice, as this will showcase string_agg, which is another 9.0 feature, killing two birds with one stone.

Let's start again with our salary table. We want the list of employees, concatenated as a single value, grouped by entity. It's going into a spreadsheet…

marc=# SELECT entity,string_agg(name,', ') FROM salary GROUP BY entity; entity | string_agg -----------+------------------------------- Accounting | stephanie, etienne R&D | marc, maria, kevin, john, tom

That's already nice. But I want them sorted in alphabetical order, because I don't know how to write a macro in my spreadsheet to sort this data.

marc=# SELECT entity,string_agg(name,', ' ORDER BY name) FROM salary GROUP BY entity; entity | string_agg -----------+------------------------------- Accounting | etienne, stephanie R&D | john, kevin, marc, maria, tom

To use this new feature, the sort clause must be inserted inside the aggregate function, without a comma to separate it from the parameters.

Database Administration

DBA is a hard and often thankless job -- especially if that's not your job title. 9.0 includes new and improved features to make that job a bit easier.

Better VACUUM FULL

Until now, VACUUM FULL was very slow. This statement can recover free space from a table to reduce its size, mostly when VACUUM itself hasn't been run frequently enough.

It was slow because of the way it operated: records were read and moved one by one from their source bloc to a bloc closer to the beginning of the table. Once the end of the table was emptied, this empty part was removed.

This strategy was very inefficient: moving records one by one creates a lot of random IO. Moreover, during this reorganization, indexes had to be maintained, making everything even more costly, and fragmenting indexes. It was therefore advised to reindex a table just after a VACUUM FULL.

The VACUUM FULL statement, as of version 9.0, creates a new table from the current one, copying all the records sequentially. Once all records are copied, index are created back, and the old table is destroyed and replaced.

This has the advantage of being much faster. VACUUM FULL still needs an AccessExclusiveLock while running though. The only drawback of this method compared to the old one, is that VACUUM FULL can use as much as two times the size of the table on disk, as it is creating a new version of it.

Let's now compare the runtimes of the two methods. In both cases, we prepare the test data as follows (for 8.4 and 9.0)

marc=# CREATE TABLE test (a int); CREATE TABLE marc=# CREATE INDEX idxtsta on test (a); CREATE INDEX marc=# INSERT INTO test SELECT generate_series(1,1000000); INSERT 0 1000000 marc=# DELETE FROM test where a%3=0; -- making holes everywhere DELETE 333333 marc=# VACUUM test; VACUUM

With 8.4:

marc=# \timing Timing is on. marc=# VACUUM FULL test; VACUUM Time: 6306,603 ms marc=# REINDEX TABLE test; REINDEX Time: 1799,998 ms

So around 8 seconds. With 9.0:

marc=# \timing Timing is on. marc=# VACUUM FULL test; VACUUM Time: 2563,467 ms

That still doesn't mean that VACUUM FULL is a good idea in production. If you need it, it's probably because your VACUUM policy isn't appropriate.

application_name in pg_stat_activity

In a monitoring session:

marc=# SELECT * from pg_stat_activity where procpid= 5991; datid | datname | procpid | usesysid | usename | application_name | client_addr | client_port | backend_start | xact_start | query_start | waiting | current_query ------+---------+---------+----------+---------+------------------+-------------+-------------+-------------------------------+------------+-------------+---------+---------------- 16384 | marc | 5991 | 10 | marc | psql | | -1 | 2010-05-16 13:48:10.154113+02 | | | f | <IDLE> (1 row)

In the '5991' session:

marc=# SET application_name TO 'my_app'; SET

Back to the monitoring session:

>marc=# SELECT * from pg_stat_activity where procpid= 5991; datid | datname | procpid | usesysid | usename | application_name | client_addr | client_port | backend_start | xact_start | query_start | waiting | current_query ------+---------+---------+----------+---------+------------------+-------------+-------------+-------------------------------+------------+-------------+---------+-----------------+---------------- 16384 | marc | 5991 | 10 | marc | my_app | | -1 | 2010-05-16 13:48:10.154113+02 | | 2010-05-16 13:49:13.107413+02 | f | <IDLE> (1 row)

It's your job to set this up correctly in your program or your sessions. Your DBA will thank you for this, at last knowing who runs what on the database easily.

Per database+role configuration

Instead of being able to set up configuration variables per database or per user, one can now set them up for a certain user in a certain database:

marc=# ALTER ROLE marc IN database marc set log_statement to 'all'; ALTER ROLE

To know who has which variables set-up in which user+database, there is a new psql command:

marc=# \drds List of settings role | database | settings -----+----------+----------------- marc | marc | log_statement=all (1 row)

There was a catalog change to store this:

Table "pg_catalog.pg_db_role_setting" Column | Type | Modifier ------------+--------+---------- setdatabase | oid | not null setrole | oid | not null setconfig | text |

Log all changed parameters on a postgresql.conf reload

Here is an example, the log_line_prefix parameter has been changed:

LOG: received SIGHUP, reloading configuration files <%> LOG: parameter "log_line_prefix" changed to "<%u%%%d> "

Better unique constraints error messages

With 8.4:

marc=# INSERT INTO test VALUES (1); ERROR: duplicate key value violates unique constraint "test_a_key"

With 9.0:

marc=# INSERT INTO test VALUES (1); ERROR: duplicate key value violates unique constraint "test_a_key" DETAIL: Key (a)=(1) already exists.

This will make diagnosing constraint violation errors much easier.

vacuumdb --analyze-only

As the parameter indicates, one can now use vacuumdb to run analyze only. It may be useful for cronjobs for instance.





Performance

It wouldn't be a new version of PostgreSQL if it didn't get faster, now would it? While 9.0 is not a "performance release", it does add new features which make some specific operations up to 1000% faster.

64 bit binaries for Windows

It is now possible to compile PostgreSQL for Windows as a 64-bit binary, and the PostgreSQL project is releasing 64-bit packages.

This has a number of advantages for Windows users: better performance on 64-bit number operations (like BIGINT and BIGSERIAL), the ability to use over 2GB of work_mem, and enhanced compatibility with 64-bit versions of PostgreSQL running on Linux. This last is particularly important given Hot Standby.

Note, however, that there is no evidence for now the 500MB shared_buffers size limit before performance degrades seen on the 32 bits version for Windows is solved with this 64 bit version, though. There is also the limitation that many 3rd-party open-source libraries are not available in 64-bit for Windows, so you may not be able to add all PostgreSQL extensions. Test reports welcome!

Join Removal

This new optimization allows us to remove unnecessary joins from SQL execution plans.

When using automatically generated SQL, such as from ORM (Object Relation Mapping) tools it is possible for the SQL to be sub-optimal. Removing unnecessary joins can improve query plans by an order of magnitude in some cases.

This is particularly important for databases that use many joins and nested views.

marc=# CREATE TABLE t1 (a int); CREATE TABLE marc=# CREATE TABLE t2 (b int); CREATE TABLE marc=# CREATE TABLE t3 (c int); CREATE TABLE

We put a little bit of data with a generate_series…

marc=# EXPLAIN SELECT t1.a,t2.b from t1 join t2 on (t1.a=t2.b) left join t3 on (t1.a=t3.c); QUERY PLAN ------------------------------------------------------------------------------ Merge Right Join (cost=506.24..6146.24 rows=345600 width=8) Merge Cond: (t3.c = t1.a) -> Sort (cost=168.75..174.75 rows=2400 width=4) Sort Key: t3.c -> Seq Scan on t3 (cost=0.00..34.00 rows=2400 width=4) -> Materialize (cost=337.49..853.49 rows=28800 width=8) -> Merge Join (cost=337.49..781.49 rows=28800 width=8) Merge Cond: (t1.a = t2.b) -> Sort (cost=168.75..174.75 rows=2400 width=4) Sort Key: t1.a -> Seq Scan on t1 (cost=0.00..34.00 rows=2400 width=4) -> Sort (cost=168.75..174.75 rows=2400 width=4) Sort Key: t2.b -> Seq Scan on t2 (cost=0.00..34.00 rows=2400 width=4)

For now, everything is normal, and we have the same behavior in 8.4. But let's imagine that on t3, there is a UNIQUE constraint on the 'c' column. In this case, the join on t3 doesn't serve any purpose, theoretically speaking: the number of rows returned won't change, neither will their content. It's because the column is UNIQUE, the join is a LEFT JOIN, and no column of t3 is retrieved. If the column wasn't UNIQUE, the join could bring more rows. If that wasn't a LEFT JOIN, the join could ignore some rows.

With 9.0:

marc=# ALTER TABLE t3 ADD UNIQUE (c); NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index "t3_c_key" for table "t3" ALTER TABLE marc=# EXPLAIN SELECT t1.a,t2.b from t1 join t2 on (t1.a=t2.b) left join t3 on (t1.a=t3.c); QUERY PLAN ------------------------------------------------------------------ Merge Join (cost=337.49..781.49 rows=28800 width=8) Merge Cond: (t1.a = t2.b) -> Sort (cost=168.75..174.75 rows=2400 width=4) Sort Key: t1.a -> Seq Scan on t1 (cost=0.00..34.00 rows=2400 width=4) -> Sort (cost=168.75..174.75 rows=2400 width=4) Sort Key: t2.b -> Seq Scan on t2 (cost=0.00..34.00 rows=2400 width=4) (8 rows)

IS NOT NULL can now use indexes

For this demonstration, we will compare the 8.4 and 9.0 versions (the table I created contains mostly nulls):

With 8.4:

marc=# EXPLAIN ANALYZE SELECT max(a) from test; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=0.03..0.04 rows=1 width=0) (actual time=281.320..281.321 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.00..0.03 rows=1 width=4) (actual time=281.311..281.313 rows=1 loops=1) -> Index Scan Backward using idxa on test (cost=0.00..29447.36 rows=1001000 width=4) (actual time=281.307..281.307 rows=1 loops=1) Filter: (a IS NOT NULL) Total runtime: 281.360 ms (6 rows)

With 9.0:

marc=# EXPLAIN ANALYZE SELECT max(a) from test; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=0.08..0.09 rows=1 width=0) (actual time=0.100..0.102 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (cost=0.00..0.08 rows=1 width=4) (actual time=0.092..0.093 rows=1 loops=1) -> Index Scan Backward using idxa on test (cost=0.00..84148.06 rows=1001164 width=4) (actual time=0.089..0.089 rows=1 loops=1) Index Cond: (a IS NOT NULL) Total runtime: 0.139 ms (6 rows)

The difference is that 9.0 only scans the not-null keys in the index. 8.4 has to go check in the table (Filter step, when 9.0 uses an index condition). In this precise use case, the gain is really big.

Use of index to get better statistics on the fly

Before starting to explain this new feature, let's talk about histograms: PostgreSQL, like some other databases, uses a statistical optimizer. This means that when planning a query it has (or should have) an approximately correct idea of how many records each step of the query will bring back. In order to do this, it uses statistics, such as the approximate number of records in a table, its size, most common values, and histograms. PostgreSQL use these to get estimates about the number of records brought back by a WHERE clause on a column, depending on the value or range asked in this WHERE clause.

In some cases, these histograms are rapidly out of date, and become a problem, for certain SQL queries. For instance, a log table in which timestamped records would be inserted, and from which we would most of the time want to get the records from the last 5 minutes.

In this specific case, it was impossible before 9.0 to get correct statistics. Now, when PostgreSQL detects while planning that a query asks for a 'range scan' on a value larger than the largest of the histogram (or smaller than the smallest), that is, the largest detected value during the last statistics calculation, and this column has an index, it gets the max (or min) value for this column using the index BEFORE really executing the query, in order to get more realistic statistics. As PostgreSQL uses an index for this, there HAS to be an index, of course.

Here comes an example. The a column of the test table has already been filled with a lot of dates, all in the past. It's statistics are up to date.

It's 13:37, and I haven't inserted anything after 13:37 yet.

marc=# EXPLAIN ANALYZE select * from test where a > '2010-06-03 13:37:00'; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Index Scan using idxtsta on test (cost=0.00..8.30 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (a > '2010-06-03 13:37:00'::timestamp without time zone) Total runtime: 0.027 ms (3 rows)

Everything's normal. The upper boundary of the histogram is '2010-06-03 13:36:16.830007' (this information comes from pg_stats). There is no way of guessing how many records are larger than 13:37, and with 8.4, PostgreSQL would have continued estimating '1' until the next analyze.

marc=# DO LANGUAGE plpgsql $$ DECLARE i int; BEGIN FOR i IN 1..10000 LOOP INSERT INTO test VALUES (clock_timestamp()); END LOOP; END $$ ;

(I must say I really like 'DO'). We just inserted 10000 records with a date larger than 13:37.

marc=# EXPLAIN ANALYZE SELECT * FROM test WHERE a > '2010-06-03 13:37:00'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using idxtsta on test (cost=0.00..43.98 rows=1125 width=8) (actual time=0.012..13.590 rows=10000 loops=1) Index Cond: (a > '2010-06-03 13:37:00'::timestamp without time zone) Total runtime: 23.567 ms (3 rows)

The estimated rows isn't 0 or 1 anymore. The statistics haven't been updated, though:

marc=# SELECT last_autoanalyze FROM pg_stat_user_tables WHERE relname = 'test'; last_autoanalyze ------------------------------- 2010-06-03 13:36:21.553477+02 (1 row)

We still have a one magnitude error in the evaluation (10 times). But it's not that bad: without this enhancement, it would be of four magnitudes (10,000). Anyway, a much smaller error makes it more likely we'll get a good plan out of this kind of queries.

Explain buffers, hashing statistics, xml, json, yaml, new optional explain syntax

Here is EXPLAIN ANALYZE as we all know it:

marc=# EXPLAIN ANALYZE SELECT a, sum(c) FROM pere JOIN fils ON (pere.a = fils.b) WHERE b BETWEEN 1000 AND 300000 GROUP BY a; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=905.48..905.86 rows=31 width=8) (actual time=0.444..0.453 rows=6 loops=1) -> Nested Loop (cost=10.70..905.32 rows=31 width=8) (actual time=0.104..0.423 rows=6 loops=1) -> Bitmap Heap Scan on fils (cost=10.70..295.78 rows=31 width=8) (actual time=0.040..0.154 rows=30 loops=1) Recheck Cond: ((b >= 1000) AND (b <= 300000)) -> Bitmap Index Scan on fils_pkey (cost=0.00..10.69 rows=31 width=0) (actual time=0.023..0.023 rows=30 loops=1) Index Cond: ((b >= 1000) AND (b <= 300000)) -> Index Scan using pere_pkey on pere (cost=0.00..19.65 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=30) Index Cond: (pere.a = fils.b) Total runtime: 0.560 ms (9 rows)

To get access to the new available information, use the new syntax::

EXPLAIN [ ( { ANALYZE boolean | VERBOSE boolean | COSTS boolean | BUFFERS boolean | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] instruction

For instance:

marc=# EXPLAIN (ANALYZE true, VERBOSE true, BUFFERS true) SELECT a, sum(c) FROM pere JOIN fils ON (pere.a = fils.b) WHERE b BETWEEN 1000 AND 300000 GROUP BY a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=905.48..905.86 rows=31 width=8) (actual time=1.326..1.336 rows=6 loops=1) Output: pere.a, sum(fils.c) Buffers: shared hit=58 read=40 -> Nested Loop (cost=10.70..905.32 rows=31 width=8) (actual time=0.278..1.288 rows=6 loops=1) Output: pere.a, fils.c Buffers: shared hit=58 read=40 -> Bitmap Heap Scan on public.fils (cost=10.70..295.78 rows=31 width=8) (actual time=0.073..0.737 rows=30 loops=1) Output: fils.b, fils.c Recheck Cond: ((fils.b >= 1000) AND (fils.b <= 300000)) Buffers: shared hit=4 read=28 -> Bitmap Index Scan on fils_pkey (cost=0.00..10.69 rows=31 width=0) (actual time=0.030..0.030 rows=30 loops=1) Index Cond: ((fils.b >= 1000) AND (fils.b <= 300000)) Buffers: shared hit=3 -> Index Scan using pere_pkey on public.pere (cost=0.00..19.65 rows=1 width=4) (actual time=0.013..0.014 rows=0 loops=30) Output: pere.a Index Cond: (pere.a = fils.b) Buffers: shared hit=54 read=12 Total runtime: 1.526 ms (18 rows)

VERBOSE displays the 'Output' lines (it already existed on 8.4).

BUFFERS displays data about buffers (input-output operations performed by the query): hit is the number of blocks obtained directly from shared_buffers, read is the number of blocs asked to the operating system. Here, there was very little data in shared_buffers.

One can also ask for another formatting than plain text. For a user, it's not useful. For people developing GUIs over EXPLAIN, it simplifies development as they can get rid of an 'explain' parser (and its potential bugs), and use a more standard one, such as XML.

Costs display can also be deactivated with COSTS false.

Per tablespace seq_page_cost/random_page_cost

marc=# ALTER TABLESPACE pg_default SET ( random_page_cost = 10, seq_page_cost=5); ALTER TABLESPACE

We just changed random_page_cost and seq_page_cost for all the objects contained in pg_default. What for ?

The use case is when different tablespaces have different performance: for instance, you have some critical data on a SSD drive, or historical data on an older disk array, slower than the brand new array you use for active data. This makes it possible to tell PostgreSQL that all your tablespaces don't always behave the same way, from a performance point of view. This is only useful, of course, for quite big databases.

Force distinct statistics on a column

This makes it possible to set the number of different values for a column. This mustn't be used lightly, but only when ANALYZE on this column can't get a good value.

Here's how to do this:

marc=# ALTER TABLE test ALTER COLUMN a SET (n_distinct = 2); ALTER TABLE

ANALYZE has to be run again for this to be taken into account:

marc=# ANALYZE test; ANALYZE

Let's try now:

marc=# EXPLAIN SELECT distinct * from test; QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=6263.00..6263.02 rows=2 width=8) -> Seq Scan on test (cost=0.00..5338.00 rows=370000 width=8) (2 rows)

This is an example of what SHOULDN'T be done : there REALLY is 370 000 distinct values in my table. Now my execution plans may be very bad.

If n_distinct is positive, it's the number of distinct values.

If it's negative (between 0 and -1), it's the multiplying factor regarding the number of estimated records in the table: for instance, -0.2 means that there is a distinct value for each 5 records of the table.

0 brings the behavior back to normal (ANALYZE estimates distinct by itself).

Don't change this parameter, unless you are completely sure you have correctly diagnosed your problem. Else, be assured performance will be degraded.

Statement logged by auto_explain

auto_explain contrib module will now print the statement with its plan, which will make it much easier to use.

Buffers accounting for pg_stat_statements

This already very useful contrib module now also provides data about buffers. pg_stat_statements, as a reminder, collects statistics on the queries run on the database. Until now, it stored the query's code, number of executions, accumulated runtime, accumulated returned records. It now collects buffer operations too.

marc=# SELECT * from pg_stat_statements order by total_time desc limit 2; -[ RECORD 1 ]-------+--------------------- userid | 10 dbid | 16485 query | SELECT * from table1 ; calls | 2 total_time | 0.491229 rows | 420000 shared_blks_hit | 61 shared_blks_read | 2251 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_written | 0 temp_blks_read | 0 temp_blks_written | 0 -[ RECORD 2 ]-------+--------------------- userid | 10 dbid | 16485 query | SELECT * from table2; calls | 2 total_time | 0.141445 rows | 200000 shared_blks_hit | 443 shared_blks_read | 443 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_written | 0 temp_blks_read | 0 temp_blks_written | 0

When this contrib is installed, one can now answer these questions:

Which query has the biggest accumulated runtime ?

Which query generates the most IO operations ? (we still can't know if data has been found in the Operating System's cache)

Which query uses mostly the cache (and hence won't be faster if we make it bigger) ?

Which query modifies the most blocks ?

Who does sorting ?

'local' and 'temp' are the buffer operations relative to temporary tables and other local operations (sorts, hashes) to a database backend. If there are many reads and writes in them, you might be better to increase temp_buffers (for 'local') or work_mem (for 'temp').

Stored Procedures

PostgreSQL isn't just a database, it's a whole application platform. Many of our users write entire applications using stored procedures and functions. So, it's no surprise that 9.0 brings a number of improvements in database procedural code:

PL/pgSQL by default

You won't have to add PL/pgSQL in databases, as it will be installed by default. This has been requested for a long time.

Many improvements on PL languages.

Many languages have been vastly improved, PLPerl for instance. Read the release notes if you want more details, there are too many to detail here.

Anonymous Functions (aka Anonymous Blocks)

This new feature is for creating run-once functions. Effectively, this allows you to run stored procedure code on the command line or dynamically as you can on SQL Server and Oracle. Unlike those, however, PostgreSQL allows you to run an anonymous function in any procedural language which is installed of the more than a dozen which PostgreSQL supports.

This feature will be very useful for schema upgrade scripts for instance. Here is a slightly different version of the 'GRANT SELECT ON ALL TABLES' that will be seen later in this document, giving SELECT rights to a bunch of tables, depending on the table owner, and excluding two schemas:

DO language plpgsql $$ DECLARE vr record; BEGIN FOR vr IN SELECT tablename FROM pg_tables WHERE tableowner = 'marc' AND schemaname NOT IN ('pg_catalog','information_schema') LOOP EXECUTE 'GRANT SELECT ON ' || vr.tablename || ' TO toto'; END LOOP; END $$;

As of 8.4, this would have required creating a function (with CREATE FUNCTION), running it, then removing it (with DROP FUNCTION). All of this requiring having rights to do this. 9.0 simplifies performing this kind of procedures.

Anonymous functions are also called "anonymous code blocks" in the software industry.

Named Parameter Calls

Combined with the Default Parameters introduced in version 8.4, named parameters allow for dynamic calling of functions with variable numbers of arguments, much as they would be inside a programming language. Named parameters are familiar to users of SQL Server or Sybase, but PostgreSQL does one better by supporting both named parameter calls and function overloading.

The chosen syntax to name parameters is the following:

CREATE FUNCTION test (a int, b text) RETURNS text AS $$ DECLARE value text; BEGIN value := 'a is ' || a::text || ' and b is ' || b; RETURN value; END; $$ LANGUAGE plpgsql;

Until now, we wrote:

SELECT test(1,'foo'); test ------------------------- a is 1 and b is foo (1 row)

Now this explicit syntax can be used:

SELECT test( b:='foo', a:=1); test ------------------------- a is 1 and b is foo (1 row)

Named parameters should eliminate the need to write many overloaded "wrapper" functions. Note that this does add a backwards compatibility issue; you are no longer able to rename function parameters using a REPLACE command, but must now drop and recreate the function.

ALIAS keyword

ALIAS can now be used. As its name suggests, it can be used to alias variable names to other names.

The syntax is new_name ALIAS FOR old_name . This is put in the DECLARE section of PL/pgSQL code.

It has two main use cases:

to give names to PL function variables:

myparam ALIAS FOR $0

to rename potentially conflicting variables. In a trigger for instance:

new_value ALIAS FOR new

(without this, we might have conflicted with the NEW variable in the trigger function).

Advanced Features

Some features in PostgreSQL are cutting-edge database features which are pretty much "PostgreSQL only". It's why we're the "most advanced database". These features enable new types of applications.

Exclusion constraints

Exclusion constraints are very similar to unique constraints. They could be seen as unique constraints using other operators than '=': A unique constraint defines a set of columns for which two records in the table cannot be identical.

To illustrate this, we will use the example provided by this feature's author, using the temporal data type, that he also developed. This datatype stores time ranges, that is 'the time range from 10:15 to 11:15'.

First, we need to retrieve the temporal module here: http://pgfoundry.org/projects/temporal/ , then compile and install it as a contrib (run the provided sql script). We may also need to install the btree_gist module as a contrib. From source, one can run 'make install' in contrib/btree_gist directory for the same.

CREATE TABLE reservation ( room TEXT, professor TEXT, during PERIOD);

ALTER TABLE reservation ADD CONSTRAINT test_exclude EXCLUDE USING gist (room WITH =,during WITH &&);

Doing this, we declare that a record should be rejected (exclusion constraint) if there already is one verifying the two conditions 'the same room' and 'be in intersection for the time range' (the && operator).

marc=# INSERT INTO reservation (professor,room,during) VALUES ( 'mark', 'tech room', period('2010-06-16 09:00:00', '2010-06-16 10:00:00')); INSERT 0 1 marc=# INSERT INTO reservation (professor,room,during) VALUES ( 'john', 'chemistry room', period('2010-06-16 09:00:00', '2010-06-16 11:00:00')); INSERT 0 1 marc=# INSERT INTO reservation (professor,room,during) VALUES ( 'mark', 'chemistry room', period('2010-06-16 10:00:00', '2010-06-16 11:00:00')); ERROR: conflicting key value violates exclusion constraint "test_exclude" DETAIL: Key (room, during)=(chemistry room, [2010-06-16 10:00:00+02, 2010-06-16 11:00:00+02)) conflicts with existing key (room, during)=(chemistry room, [2010-06-16 09:00:00+02, 2010-06-16 11:00:00+02)).

The insert is forbidden, as the chemistry room is already reserved from 9 to 11.

Exclusion constraints may also be used with arrays, geographic data, or other non-scalar data in order to implement advanced scientific and calendaring applications. No other database system has this feature.

Message passing in NOTIFY/pg_notify

Messages can now be passed using NOTIFY. Here is how:

Subscribe in session 1 to the 'instant_messenging' queue.

Session 1:

marc=# LISTEN instant_messenging; LISTEN

Send a notification through 'instant_messenging', from another session

Session 2:

marc=# NOTIFY instant_messenging, 'You just received a message'; NOTIFY

Check the content of the queue in the first session

Session 1:

marc=# LISTEN instant_messenging; LISTEN Asynchronous notification "instant_messenging" with payload "You just received a message" received from server process with PID 5943.

So we can now associate messages (payloads) with notifications, making NOTIFY even more useful.

Let's also mention the new pg_notify function. With it, the second session's code can also be:

SELECT pg_notify('instant_messenging','You just received a message');

This can simplify some code, in the case of a program managing a lot of different queues.

Hstore contrib enhancements

This already powerful contrib module has become even more powerful:

Keys and values size limit has been removed.

GROUP BY and DISTINCT can now be used.

New operators and functions have been added.

An example would take too long, this module has a lot of features. Read the documentation at once !

Unaccent filtering dictionary

Filtering dictionaries can now be set up. This is about Full Text Search dictionaries.

These dictionaries' purpose it applying a first filter on words before lexemizing them. The module presented here is the first one to use this mechanism. Filtering can consist in removing words or modifying them.

Unaccent doesn't remove words, it removes accents (all diacritic signs, as a matter of fact), replacing accentuated characters with non-accentuated ones (many people, at least in French, don't type them). Unaccent is a contrib module.

Installing it, as all contrib modules, is as easy as

psql mydb < contribs_path/unaccent.sql.

We'll now follow unaccent's documentation, the example being filtering french words.

Let's create a new 'fr' dictionary (keeping standard 'french' dictionary clean):

marc=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); CREATE TEXT SEARCH CONFIGURATION

The next statement alters the 'fr' setup for word and alike lexemes. These now have to go through unaccent and french_stem instead of only french_stem.

marc=# ALTER TEXT SEARCH CONFIGURATION fr >ALTER MAPPING FOR hword, hword_part, word >WITH unaccent, french_stem; >ALTER TEXT SEARCH CONFIGURATION SELECT to_tsvector('fr','Hôtels de la Mer'); to_tsvector ------------------- 'hotel':1 'mer':4 (1 row) marc=# SELECT to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); ?column? ---------- t (1 row)

It's now easy, without changing even one line of code in the client application, and keeping accentuated characters in the database, to look up words without taking accents into account.









get_bit and set_bit for bit strings

Here is a very simple example. This tool can manipulate bits in a bit() independently.

marc=# SELECT set_bit('1111'::bit(4),2,0); set_bit --------- 1101 (1 row)





marc=# SELECT get_bit('1101'::bit(4),2); get_bit --------- 0 (1 row)

Backwards Compatibility and Upgrade Issues

The PostgreSQL project has a commitment not to break backwards compatibility when we can possibly avoid doing so. Sometimes, however, we have to break things in order to add new features or fix longstanding problematic behavior. Some of these issues are documented below.

PL/pgSQL changes which may cause regressions

There are two changes in PL/pgSQL which may break code which works in 8.4 or earlier, meaning PL/pgSQL functions should be audited before before migrating to 9.0 to prevent possible runtime errors. A lot of these come about due to uniting the lexer for SQL and PL/pgSQL, which is an important architectural improvement which has made several new features possible.

Removal of column/variable name ambiguity

In 8.4 and earlier, PL/PgSQL variables will take preference over a table or view column with the same name. While this behaviour is consistent, it is a potential source of coding errors. 9.0 will throw a runtime error if this situation occurs:

marc=# DO LANGUAGE plpgsql $$ DECLARE a int; BEGIN SELECT a FROM test; END $$ ; ERROR: column reference "a" is ambiguous LINE 1: select a from test DETAIL: It could refer to either a PL/pgSQL variable or a table column. QUERY: select a from test CONTEXT: PL/pgSQL function "inline_code_block" line 4 at SQL statement

This behaviour can be altered globally in postgresql.conf, or on a per function basis by inserting one of these three options in the function declaration:

#variable_conflict error (default) #variable_conflict use_variable (variable name name takes precedence - pre-9.0 behaviour) #variable_conflict use_column (column name takes precedence)

The manual contains more details.

Reserved words

From 9.0, use of unquoted reserved words as PL/PgSQL variable names is no longer permitted:

marc=# DO LANGUAGE plpgsql $$ DECLARE table int; BEGIN table :=table+1; END $$ ; ERROR: syntax error at or near "table" LINE 6: table :=table+1;

The correct syntax is:

marc=# DO LANGUAGE plpgsql $$ DECLARE "table" int; BEGIN "table" :="table"+1; END $$ ; DO

Best practice is of course to avoid reserved words completely.