"Big data" features coming in PostgreSQL 9.5

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

PostgreSQL 9.5 Alpha 2 is due to be released on August 6. Not only does the new version support UPSERT, more JSON functionality, and other new features we looked at back in July, it also has some major enhancements for "big data" workloads. Among these are faster sorts, TABLESAMPLE , GROUPING SETS and CUBE , BRIN indexes, and Foreign Data Wrapper improvements. Taken together, these features strengthen arguments for using PostgreSQL for data warehouses, and enable users to continue using it with bigger databases.

This release is still called an "alpha" because the PostgreSQL developers don't feel that it's entirely stable yet, and want to reserve the right to remove features that can't be fixed before the beta release. Among others, this includes the TABLESAMPLE feature described below; as of this writing, there are still problems with that patch and it could be dropped from 9.5.

PostgreSQL as a big data database

Some readers may not think of PostgreSQL for big data, but the database system has a long track record of use in data warehouses. For a decade before Hadoop launched, PostgreSQL was the only pure open-source option for large data volumes and complex analytics. Today, it is still heavily used in mid-sized data warehouses and "data marts", meaning databases in the one to ten terabyte range. Big data and analytics users are a major subset of the PostgreSQL community, and thus its contributor base, resulting in the addition of new features for large databases with each release.

More prominently, the European Union has chosen PostgreSQL as a key part of a publicly funded analytics technology project in the EU's Seventh Framework Programme, or "FP7". This project aims to enhance the technology base of the EU. Funding from FP7, channeled through the consulting firm 2nd Quadrant, paid for two of the features covered below, as well as performance testing and review work on others.

In addition to core features, the PostgreSQL project has been a source of code for many big data startups over the years. These include Netezza, Greenplum, ParAccel, Truviso, Aster Data Systems, and CitusDB. Both Yahoo! Everest and Amazon Redshift are PostgreSQL forks as well. In July, startup PipelineDB released an open-source streaming database that was forked from PostgreSQL 9.4. Streaming databases are used to process huge amounts of incoming data. These startups are generally attracted to PostgreSQL because of its sophisticated query planner and executor, liberal license, and well-documented code.

As hardware gets bigger and faster, the definition of "mid-sized data warehouse" keeps getting bigger and user expectations continue to grow. The PostgreSQL project continuously adds new features and better performance for big data use cases in order to keep up. Version 9.5 includes a lot of "goodies" for users with big databases.

Faster sorts

One thing relational databases do a lot of is sorting data. The bigger the database, the more the performance of those sorts becomes the dominating factor in overall response time. Sort time is critical not just for query results, but also for building indexes, grouping data, and certain kinds of joins. To help with this, PostgreSQL 9.5 comes with an across-the-board speedup in sorting text and long numeric values, yielding 3X to 12X speed improvements for many users.

Peter Geoghegan of Heroku wrote a patch in January that made text sorts faster by using a technique called "abbreviated keys". Previously, PostgreSQL sorted all text purely by using glibc's strcoll() function, which allows the database to sort in whatever locale the user has chosen for their data. The problem with this is that strcoll() is expensive, taking hundreds of times more CPU time than comparing integers.

Converting the text values to binary keys using strxfrm() and sorting them is much faster, except that those binary keys can be very large. So Peter hit on the technique of taking only the first eight bytes of the keys produced by strxfrm() , comparing those, and then breaking any ties by doing a full strcoll() comparison. This speeds up sorts of all kinds, both in-memory quicksorts and on-disk tapesorts. One user tested 9.5 with their data warehouse, and the new version reduced indexing eighteen million text values from almost eleven minutes to less than one minute.

After Geoghegan's patch was committed, contributor Andrew Gierth wrote a patch that uses a similar technique to speed up sorts on PostgreSQL's NUMERIC data type. In Postgres, NUMERIC is an arbitrary-precision number value, supporting over 100,000 digits. This means that it can't be sorted as an integer or float, so Gierth's improvement was welcome.

However, there still may be some issues with the sort improvements. Some platforms, especially older versions of Solaris, have buggy versions of the strxfrm() function that prevent use of this optimization. The developers are discussing whether these platforms are old enough, or obscure enough, to disregard so that most users can get enhanced performance, or whether platform-specific code might be required.

TABLESAMPLE

One feature funded by FP7 was written by Petr Jelinek; it adds the SQL-standard TABLESAMPLE query. This clause returns a pseudo-random sample of the rows in the result set. When used with large tables, TABLESAMPLE lets users get a quick "glimpse" of the data so that they can do further analysis.

For example, imagine that I have a table of 10 million user profiles in JSON. I want to take a look at a handful of rows from this table so that I can find out what kinds of keys the JSON has, but I don't want just the rows at the beginning of the table. In 9.5, I can do this:

SELECT * FROM user_profiles TABLESAMPLE SYSTEM ( 0.001 );

Which will return around 0.001%, or 100, of the rows in the table. In the query above, SYSTEM is the name of the chosen sampling algorithm. The SYSTEM algorithm chooses a set of pseudo-random data pages, and then returns all rows on those pages, and has the advantage in running in constant time regardless of the size of the table. PostgreSQL 9.5 will also ship with the BERNOULLI sampling method, which is more rigorously random, but will take longer the larger the table is.

Currently, however, the TABLESAMPLE patch is under some intense review and rewriting. Tom Lane criticized the quality of the patch, and it remains to be seen if it will stay in 9.5.

CUBE , ROLLUP , and GROUPING SETS

In the SQL99 update of the SQL standard, the ANSI committee added a collection of grouping and reporting clauses to SQL to support the technology known as OLAP for "OnLine Analytical Processing". OLAP technologies, still used today, are ways to summarize and explore large data sets by grouping them in a multi-dimensional space. The three features added to SQL, ROLLUP , CUBE , and GROUPING SETS , provide different ways of summarizing data sets on several axes at once.

ROLLUP is the easiest to understand, since it simply provides subtotals and master totals for the whole set provided. For example, this query would provide counts for each city, then subtotals for each country, and finally a grand total of subscribers.

SELECT country, city, count(*) FROM lwn_subscribers GROUP BY ROLLUP ( country, city ); country | city | count ----------+---------------+------- Australia | Brisbane | 561 Australia | Melbourne | 302 Australia | Perth | 219 Australia | | 1082 USA | New York | 591 USA | Portland | 617 USA | San Francisco | 610 USA | | 1818 | | 2900

CUBE is less immediately understandable, because it's mainly intended to feed multi-dimensional data into an external OLAP tool like Mondrian. It expresses all possible totals for all combinations of specified columns, otherwise known as a "combinatorial explosion". For example, this query would provide totals for each country and each subscription level, and each combination of country and level.

SELECT country, level, count(*) FROM lwn_subscribers GROUP BY CUBE ( country, level ); country | level | count ----------+--------------+------- Australia | leader | 140 Australia | professional | 490 Australia | starving | 394 Australia | supporter | 58 Australia | | 1082 USA | leader | 301 USA | professional | 765 USA | starving | 602 USA | supporter | 150 USA | | 1818 | | 2900 | leader | 441 | professional | 1255 | starving | 996 | supporter | 208

Finally, GROUPING SETS are sort of the "parent class" of both ROLLUP and CUBE . It allows you to select subtotal and total groups and combinations manually to display, in case you need more specific output than ROLLUP or CUBE provide.

For example, say I wanted totals for level and city, then level, and then a grand total, I would write:

SELECT city, level, count(*) FROM lwn_subscribers GROUP BY GROUPING SETS ((city, level),(level),()); city | level | count ---------------+--------------+------- Brisbane | leader | 94 Melbourne | leader | 44 New York | leader | 105 Perth | leader | 2 Portland | leader | 94 San Francisco | leader | 102 | leader | 441 Brisbane | professional | 236 Melbourne | professional | 121 New York | professional | 250 ...

The benefit for users in these query constructs is the ability to get totals, subtotals, and combinations in one query and one scan over the table, which in the past would have required multiple queries against the database. The other advantage for PostgreSQL is that it can now be supported by sophisticated "business intelligence" tools that previously only worked with proprietary SQL databases.

This feature was developed by Gierth, and has caused some discussion because of issues with the PostgreSQL review process that almost caused it to not get committed. It's now in and ready for users to try out.

Note: the above subscription data is fictitious and not related to LWN's actual subscription data.

BRIN indexes

Chilean committer Álvaro Herrera developed another feature funded by FP7. He wrote it to solve one of the major problems with really big tables: indexing them. PostgreSQL supports large tables easily, especially append-only tables, where the user may not even notice that the table has grown to several terabytes. However, standard B+ tree indexes grow increasingly inefficient at large numbers of leaf nodes, primarily because they no longer fit in memory.

Herrera figured out that "close" counts in indexes as much as it does with horseshoes and grenades. He devised a new index type that would index values down to an arbitrarily large "block" of data rather than to an individual data page. Instead of having a leaf node pointing to each individual row, the new index has a range of values to be found in each block. When the table is queried, the index is searched for ranges that correspond to the supplied filter conditions, and then the individual blocks are sequentially scanned. This approach is similar to how data is organized in column-store tables.

This new index type is called "BRIN" for "Block-Range INdex". These indexes are much smaller than conventional B+ trees — as little as 1% of the size of the standard indexes. This makes them much faster, since they usually fit in memory, and sometimes even in the CPU cache.

There are some drawbacks to BRIN indexes, though. They can't be used to enforce uniqueness or support keys. BRIN works much better on values that are incrementally increasing with each row inserted, like timestamps and incrementing integers. Most of all, BRIN indexes handle updates and deletes on the underlying table inefficiently, so they're best used with append-only tables. These restrictions would seem to make BRIN indexes a niche tool, except that large, continuously increasing append-only tables are a common use case. For example, any table used to hold a website log or an audit history can benefit.

Foreign Data Wrappers

In the last couple years, PostgreSQL has gained adoption as part of hybrid solutions involving several databases because of its Foreign Data Wrapper (FDW) feature. These wrappers allow users to link to external data in other databases, whether PostgreSQL or other platforms, and query them as if they were local tables. This has been called "data federation". For version 9.5, several developers have been hard at work on enhancements to the FDW capability.

One such enhancement is primarily administrative: IMPORT FOREIGN SCHEMA . In prior versions, if you wanted to link every table in a target remote database, you had to write individual CREATE FOREIGN TABLE statements for each one, which was tedious to do and annoying to maintain. Ronan Dunklau and Michael Paquier fixed this by implementing a command that creates all tables in the remote database as foreign tables so that you can link "the whole database" in one command.

The other major FDW improvement, JOIN pushdown, was introduced by KaiGai Kohei of NEC. Together with the new "custom scan" API, this allows FDWs to perform parts of queries, like index scans and joins between two tables, on the remote database before returning the result. Pushdown makes data federation much more scalable because it no longer requires sending entire tables across the wire for many common operations.

To take advantage of either of these features, though, the FDW driver for the external data source needs to support them. The postgres_fdw driver, which supports PostgreSQL-to-PostgreSQL connections, will support both, but many of the FDW drivers won't get them added for a while, if at all. Regardless, the improvements in FDWs make it more possible for PostgreSQL to be the interface for large data stores even where it is not the backend for them.

Other features

Several other features in PostgreSQL 9.5 will be of interest to users with large databases. PostgreSQL now makes more effective use of larger allocations of memory in "shared_buffers", by shrinking the array of page references kept by each process. GiST indexes, used for geographic searches, now support "index-only scans" for faster queries over large append-only tables. The vacuumdb utility now has a parallel, multi-process mode for doing maintenance on multiple large tables at once. Finally, UNLOGGED tables can be changed into regular tables, allowing users to build a data set and make it durable as the last step. UNLOGGED tables are non-durable tables that are not written to the transaction log, and are often used in data loading. This last feature was a Google Summer of Code project.

Outside of the core project, two tools have been updated that increase the size of databases PostgreSQL can handle. CitusData released version 1.3 of its cstore_fdw project, which adds a column-store option for tables. Column stores both compress data and makes large aggregation queries, like counts and totals, much faster. The company also released version 1.2 of pg_shard, a tool that "shards" a table by distributing it across a set of database nodes using a hash key. This allows table scans to be run on multiple machines in parallel.

Native parallel query, the ability to use multiple cores on the same machine to execute a large query, has been an active project for two years. Unfortunately, due to the technical challenges being more difficult than expected by the team led by EnterpriseDB programmer Robert Haas, no parallel query features have yet been committed. Parallel table scan is under active development for 9.6.

Overall, PostgreSQL continues to hold its own as the "good enough" big data database for many users, by both improving performance and adding new analytical features. We can expect to see larger PostgreSQL databases under 9.5 than we have before. The project further shows some possibility of expanding is ability to handle even bigger databases through federation and sharding, and via numerous forks. We'll have to wait to find out what 9.6 offers in this area.

PostgreSQL 9.5 Alpha 2 will be available on August 6th from PostgreSQL.org.