Performance since PostgreSQL 7.4 / pgbench

So, in the introduction post I briefly described what was the motivation of this whole effort, what hardware and benchmarks were used, etc. So let's see the results for the first benchmark - the well known pgbench and how they evolved since PostgreSQL 7.4.

If there's a one chart in this blog post that you should remember, it's probably the following one - it shows throughput (transactions per second, y-axis) for various numbers of clients (x-axis), for PostgreSQL releases since 7.4.

Note: If a version is not shown, it means "same (or almost the same) as the previous version" - for example 9.3 and 9.4 give almost the same performance as 9.2 in this particular test, so only 9.2 is on the chart.

So on 7.4 we could do up to 10k transactions per second (although the scalability was poor, and with more clients the throughput quickly dropped to ~3k tps). Since then the performance gradually improved, and on 9.2 we can do more than 70k tps, and keep this throughput even for higher client counts. Not bad ;-)

Also note that up until PostgreSQL 9.1 (including), we've seen a significant performance drop once we exceeded the number of CPUs (the machine has 8 cores in total, which is exactly the point where the number of transactions per second starts decreasing on the chart).

Of course, these are the result of just one test (read-only on medium dataset), and gains in other tests may be more modest, but it nicely illustrates the amount of improvements that happened sice PostgreSQL 7.4.

This post is a bit long, but don't worry - most of it are pretty images ;-)

Update: There's a hackernews thread with a discussion about the benchmark, results etc.

pgbench

But first let's talk a bit more about pgbench, because the short description in the intro post was bit too brief. So, what is "select-only" mode, "scale", client count and so on?

If you know these pgbench basics, you may just skip to the "Results" section, presenting the results of the benchmarks.

Mode

Pgbench has two main modes - read-write and read-only (there's also a third mode, and an option to use custom scripts, but I won't talk about those).

read-write (default)

The read-write mode executes a simple transaction, simulating a withdrawal from an account - updating the balance in "accounts" table, selecting the current balance, updating tables representing a branch and a teller, and recording a row into a history.

BEGIN ; UPDATE pgbench_accounts SET abalance = abalance + : delta WHERE aid = : aid ; SELECT abalance FROM pgbench_accounts WHERE aid = : aid ; UPDATE pgbench_tellers SET tbalance = tbalance + : delta WHERE tid = : tid ; UPDATE pgbench_branches SET bbalance = bbalance + : delta WHERE bid = : bid ; INSERT INTO pgbench_history ( tid , bid , aid , delta , mtime ) VALUES (: tid , : bid , : aid , : delta , CURRENT_TIMESTAMP ); END ;

As you can see, all the conditions are on a PK column, always work with a single row, etc. Simple.

read-only (SELECT-only)

The read-only mode is even simpler - it simply executes only the "SELECT" query from the read-write mode, so it's pretty much this:

SELECT abalance FROM pgbench_accounts WHERE aid = : aid ;

Again, access through PK column - very simple.

Scale

In short, scale determines size of the database as a number of rows in the main "accounts" table - the supplied value gets multiplied by 100.000 and that's how many rows in that table you get. This of course determines the size on disk, as every 100.000 rows corresponds to 15MB on disk (including indexes etc.).

When choosing the scale for your benchmark, you have three basic choices, each testing something slightly different.

small

usually scale between 1-10 (15-150MB databases)

only a small fraction of RAM (assuming regular hardware)

usually exposes locking contention, problems with CPU caches and similar issues not visible with larger scales (where it gets overshadowed by other kinds of overhead - most often I/O)

medium

scales corresponding to ~50% of RAM (e.g. 200-300 on systems with 8GB RAM)

the database fits into RAM (assuming there's enough free memory for queries)

often exposes issues with CPU utilization (especially on read-only workloads) or locking

large

scales corresponding to ~200% of RAM, or more (so 1000 on systems with 8GB RAM)

the database does not fit into RAM, so both modes (read-only and read-write) hit the I/O subsystem

exposes issues with inefficient disk access (e.g. because of missing index) and I/O bottlenecks

Client counts

The third important parameter I'd like to discuss is client count, determining the number of concurrent connections used to execute queries.

Let's assume you measure performance with a single client, and you get e.g. 1000 transactions per second. Now, what performance will you get with 2 clients? In an ideal world, you'd get 2000 tps, twice the performance with a single client.

For N clients you'd get N-times the performance of a single client, but it's clear it doesn't work like that, because sooner or later you'll run into some bottleneck. Either the CPUs will become fully utilized, the I/O subsystem will be unable to handle more I/O requests, you'll saturate network or memory bandwidth, or hit some other bottleneck.

As the number of clients is increased, we usually see three segments on the charts:

linear scalability at the beginning (so N clients really give nearly N-times the throughput of a single client), until saturating at least one of the resources (CPU, disk, ...)

at the beginning (so N clients really give nearly N-times the throughput of a single client), until saturating at least one of the resources (CPU, disk, ...) after saturating one of the resources, the system should maintain constant throughput , i.e. adding more clients should result in linear increase of latencies

, i.e. adding more clients should result in linear increase of latencies eventually the system runs into even worse issues, either at the DB or kernel level (excessive locking, process management, context switching etc.), usually resulting in a quick performance drop (where latencies grow exponentially)

On a simple chart, it might look like this:

Of course - this is how it would look in an ideal world. In practice, the initial growth is not linear, and the throughtput in the second part is not constant but drops over time - sometimes fast (bad), sometimes gradually (better). We'll see examples of this.

For sake of clarity and brevity, I've left out the latencies from the charts in this blog post. A proper latency analysis would require a considerable amount of space (to show averages, percentiles, ...) but the results turned out to be especially dull and boring in this case. You have to trust me that the transaction rate (and implied average latency) is a sufficient description of the results in this case.

Jobs count (pgbench threads)

By default, all the pgbench clients (connections) are handled by a single thread - the thread submits new queries, collects results, writes statistics etc. When the queries are long, or when the number of connections is low, this is fine. But as the queries get shorter (e.g. with read-only test on small dataset), or when the number of connections grows (on many-core systems), the management overhead becomes significant.

That's why pgbench has --jobs N (or -j N ) option, to specify the number of threads. The value has to be a divisor of number of clients, so for example with 8 clients you may use 1, 2, 4 or 8 threads, but not 5 (because 8 % 5 = 3). There are various smart strategies to set the value, but the easiest and most straightforward way is to use the same value for both options, and that's what I used for this benchmark.

pgbench tweaks

The pgbench tool evolves with PostgreSQL, so when new features are added to PostgreSQL, which means new pgbench versions may not work with older PostgreSQL releases (e.g. on releases not supporting fillfactor storage option, which is used by pgbench when initializing tables). There are two ways to overcome this - either use pgbench version matching the PostgreSQL release, or use the new version but patch it so that it works with older releases (by skipping features not present in that PostgreSQL version).

I've used the latter approach, mostly because I wanted to use some of the features available only in recent pgbench versions (especially --aggregate-interval ).

The patch I used is available here. It changes three things:

adds -X option to disable IF EXISTS when dropping table (because that's not available on old releases)

option to disable when dropping table (because that's not available on old releases) adds -Q option to entirely skip cleanup (because the script simply recreates the database anyway)

option to entirely skip cleanup (because the script simply recreates the database anyway) disables fillfactor on all the tables (so the default version is used)

None of the changes should have no impact on the results, because the first two just remove new syntax (which is not necessary), and after removing the explicit fillfactor option the default will be used (which is 100 on all versions).

If you're interested in the script driving the testing, it's available here. It's a bit hackish, but it should give you an idea of how the testing was done.

Benchmarked combinations

So, given all the possible parameters (mode, scale, client count), what combinations were benchmarked?

scales : small, medium, large

: small, medium, large modes : read-only, read-write

: read-only, read-write clients: 1, 2, 3, 4, 6, 8, 10, 12, 14, 16, 20, 24, 28, 32 (14 values)

That gives 84 combinations in total, and for each combination there were 3 runs, 30 minutes each. That's ~126 hours per version (not including initialization and warmup), so ~6 days per PostgreSQL version. There were 12 versions benchmarked (7.4, 8.0, 8.1, 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4b2, head) so the total runtime is close to 70 days.

Hopefully that convinces you this was not a rushed unreliable benchmark.

postgresql.conf

There were only minimal changes to the configuration file:

# machine has 16GB of RAM in total shared_buffers = 2GB work_mem = 64MB maintenance_work_mem = 512MB checkpoint_segments = 64 checkpoint_completion_target = 0.9 effective_cache_size = 8GB # when executed on the SSD (otherwise 4) random_page_cost = 2

Of course, there were some tweaks (e.g. because older versions don't know checkpoint_completion_target ) but no extensive per-version tuning was performed, as that'd make the testing much more time consuming (to the point that it'd be impossible).

I however did a limited amount of tests to see how big impact may such tuning have - I'll show you a chart illustrating that later (spoiler: the older versions may be much faster with a bit of tuning, but the new ones are still faster).

Results

So, let's see the results! I'll only show some of the charts - the rest of them is usually quite similar.

medium, read-only

First, the chart we've already seen:

This is a read-only workload, with a dataset that completely fits into RAM. So this is CPU-bound, with locking etc. Yet PostgreSQL 9.4 is 7-20x faster than 7.4. Not bad, I guess.

The other interesting observation is that it scales linearly with the number of cores - in this case it's actually a bit better (the performance on 8 cores is more than 2x of 4 cores), most likely thanks to caching effects. During the 9.2 development this was thoroughly tested (by Robert Haas) and the conclusion was that we do scale linearly up to 64 cores (and probably beyond, but we didn't have machines with more cores).

There's also an excellent talk by Heikki Linnakangas about all the improvements done in PostgreSQL 9.2, presented at FOSDEM 2012.

This is of course dependent on workload - the improvements in locking are most visible on read-only workloads on small datasets, because that's mostly cached and should do much I/O. Once you start hitting any kind of I/O, it will probably overshadow these improvements. It's also related to improvements made in Linux kernel, for example the lseek scalability improvements committed into kernel 3.2. And finally all this is of course dependent on the exact CPU type - on older CPUs the improvements may not be as great.

large, read-only

Let's see how the read-only workload performs on a large dataset (thus not fitting into RAM), on an SSD drive

This is clearly I/O bound, because pgbench generates a lot of random I/O (although the SSD handles that much better than rotational drives). Because of the I/O bottleneck the difference is not as significant as with the medium dataset, but 2-4x speedup is still really nice.

There were two releases that significantly improved the performance - first 8.0, then 8.1 (and that's mostly what we get on 9.4, performance-wise).

large, read-write

So, what happens when we keep the large dataset, but do a read-write test?

Interesting. The 7.4 is doing rather poorly (just ~500 transactions per second on an SSD). The 8.x releases perform much better, with roughly ~2x the performance, but the throughput is mostly constant after reaching 8 clients (which is the number of CPUs in the machine).

But since 9.1, we see a quite interesting behavior - the throughput grows even after 8 clients (where it's mostly equal to 8.x), up to ~2500 tps with 32 clients. And the chart suggests it would probably grow even further with more clients (had it been part of the benchmark). This can happen because SSDs are quite good with handling parallel requests (thanks to internal architecture, which uses multiple independend channels), and since 9.1 we can saturate it much better.

Note 1: The read-write results on medium dataset are very similar to this.

Note 2: If you're looking for a nice but thorough explanation of how SSDs work, I recommend this article on codecapsule.com.

large, read-write on SAS drives

If you ask how would the comparison look on traditional rotational drives, this is the answer:

For the record, these are 10k SAS drives, connected to a HP P400 controller (RAID10 on 6 drives).

It's interesting that 8.4 is slightly faster than 9.4 for higher client counts, but we're still doing better than 7.4, but clearly the I/O is a significant bottleneck that's difficult to beat at the code level.

large, read-only / Xeon vs. i5-2500k

Another interesting question is how much this depends on hardware, so here is a comparison with results from the machine with i5-2500k CPU.

Wow! On PostgreSQL 7.4 the difference is "only" 2x, but on 9.4 the i5 really crushes the Xeon, giving nearly ~4x the performance. Also, the i5 has only 4 cores while the Xeon has 8 cores (in 2 sockets), so that's ~8x the performance per socket, while the frequencies are about the same (3 GHz vs. 3.3 GHz).

Just for the record, this difference is not only thanks to the CPU - the i5 system has a better memory throughput, the SSD is not connected using a lousy PCIe controller (which limits the bandwidth), etc. But clearly, modern hardware can make a huge difference.

large, read-only / postgresql.conf tweaks

I also mentioned that each PostgreSQL version may have different "optimal" configuration values. That makes sense, because the development happens in the context of the current hardware. When 7.4 was released 10 years ago, most machines had only 4GB of RAM (not to mention that everything was still 32-bit and thus capped to 2GB), so there was no point in optimizing for large shared_buffers values, for example.

So, what happens if we use much lower settings for shared_buffers (only 128MB instead of the 2GB used in all the previous benchmarks)? This:

On 7.4, using shared_buffers=128MB improved the performance of shared_buffers=2GB by a factor of ~3x. On 8.0 there's still some difference, but it's much smaller (~5%). The nice thing is that 9.4b1 is faster even with the 2GB.

Note: There are also other reasons to use small shared_buffers values on older versions (absence of checkpoint_completion_target and usually running them on older filesystems like ext3, with poor fsync behavior).

small, read-write

Those were medium and large datasets, but what about the small one? The read-only results look quite similar to the medium dataset, but the read-write seems interesting on it's own:

Wow! 8.2 really improved this, boosting the performance 5x, compared to 8.1. The following releases further improved the results by ~25%, but 8.2 was clearly a huge jump forward.

small / Xeon vs. i5-2500k

And finally, let's do two more comparisons of the Xeon and i5 based systems, with read-only and read-write benchmarks on the small dataset.

The solid lines are Xeon, the dashed lines are i5. This time the Xeon actually wins over i5 (unlike with the large dataset), at least on 9.4 - having 2x the cores matters here, apparently. It's also nice to see that 9.4 fixed the performance degradation on the Xeon, with client counts higher than the number of CPU cores (on the i5 this did not happen at all).

On read-write workload, i5 wins again (despite the smaller number of cores).

Summary

Let me summarize the main points we've learned from the benchmark results: