Over the last few years, the software development community’s love affair with the popular open-source relational database has reached a bit of a fever pitch. This Hacker News thread covering a piece titled “PostgreSQL is the worlds’ best database”, busting at the seams with fawning sycophants lavishing unconditional praise, is a perfect example of this phenomenon.

While much of this praise is certainly well-deserved, the lack of meaningful dissent left me a bit bothered. No software is perfect, so exactly what are PostgreSQL’s imperfections?

I’ve been hands-on with PostgreSQL in production since 2003 with deployments ranging from small (gigabytes) to modest to very large (~petabyte). My perspective is largely from building and running systems that are at least intended to be continuously available. Needless to say, I have gained first-hand experience with PostgreSQL’s particular idiosyncrasies through some painful production issues over the years.

#1: Disastrous XID Wraparound

Read more here. Suffice to say, this one can bite hard. There are many stories of multi-day outages caused by this issue. Go ahead, Google it and you’ll find numerous poor souls writing about the time they stepped on this landmine. Pretty much any non-trivial PostgreSQL install that isn’t staffed with a top expert will run into it eventually.

It’s likely that, at some point in the future, XIDs will transition to use 64-bit integers, but until then, we’re stuck with it. I guess at least we can be thankful that there is a process which prevents it from happening as a matter of course, unlike some airplane software.

#2: Failover Will Probably Lose Data

The run-of-the-mill streaming replication setup will almost certainly lose committed data if the active master suddenly fails. “Such is the price of asynchronous replication,” some might say, but it doesn’t have to be this way. PostgreSQL supports synchronous replication with quorum commit for fault-tolerant durability, but it has a much tighter performance envelope that complicates its application.

Waiting doesn’t utilize system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention.

This bolt-on quorum replication is useful in a pinch, but I hesitate to recommend it for general-purpose use cases. It is similar to Kafka’s ISR replication with acks=all and a quorum min_isr, but with all the nuanced complexity of a transactional relational database running arbitrary queries. I’m not currently aware of a successful application of quorum commit for highly-available, high-durability replication at non-trivial scale. If you have, reach out!

In terms of relational databases, Galera Cluster’s group replication is also imperfect, but closer to the ideal. They even encourage geo-distributed replication, which would very likely be disastrous for a PostgreSQL replication setup using quorum commit.

#3: Inefficient Replication That Spreads Corruption

Streaming Replication is by far the most utilized replication mechanism in production deployments. It is a form of physical replication, meaning that it replicates changes in the on-disk binary data itself.

Every time an on-disk database page (8KB) needs to be modified by a write operation, even just a single byte, a copy of the entire page, edited with the requested changes, is written to the write-ahead log (WAL). Physical streaming replication leverages this existing WAL infrastructure as a log of changes it streams to replicas.

Update: Some folks have pointed out that PostgreSQL only needs to do this full-page write once per WAL checkpoint. This is true, but in most real-world systems most writes will end up on a unique page between checkpoints, following a power law distribution. More importantly though: when anticipating system behavior, the right approach is assuming the more expensive case, particularly if it depends on hard-to-predict and highly dynamic behavior of the application.

With physical replication, a large index build, for instance, creates a huge flood of WAL entries which can easily bottleneck the replication stream. The read-modify-replicate process at page-granularity can lead to hardware-induced data corruption on the master much more easily propagating to replicas, which I’ve personally witnessed several times in production.

This is in contrast to logical replication, which only replicates logical data changes. At least theoretically, a large index build would only result in a single command replicated across the network. While PostgreSQL has supported logical replication for quite some time, most deployments use physical streaming replication because it is more robust, more broadly supported, and much easier to use.

#4: MVCC Garbage Frequently Painful

Like most mainstream databases, PostgreSQL uses multi-version concurrency control (MVCC) to implement concurrent transactions. However, its particular implementation often introduces operational pain around garbage row versions and their cleanup (VACUUM). Generally speaking, UPDATE operations create new copies (or “row versions”) of any modified rows, leaving the old versions on disk until they can be cleaned up.

While this situation has steadily improved over the years, it’s a complex system that is a bit of a black box for anyone approaching the problem for the first time. For instance, knowing about Heap-Only Tuples (HOT) and when it kicks in can be make-or-break for heavy update-in-place workloads like maintaining a consistent counter column in a row. The default autovacuum setup does work most of the time, but when it doesn’t, good lord.

In contrast, MySQL and Oracle use redo and undo logs. They don’t need a similar background garbage collection process. The trade-off they make is mostly additional latency for transactional commit and rollback operations.

It could be that at some point in the distant future, zheap saves us all.

#5: Process-Per-Connection = Pain at Scale

PostgreSQL forks a process for every connection, where as most other databases use a more efficient connection concurrency model. This makes for a difficult tuning problem as there is a relatively low threshold at which adding more connections degrades performance (around ~2x cores) and eventually another higher threshold (hard to estimate, highly workload dependent) where performance will plummet.

The standard recipe of using a connection pooler certainly kicks the can down the road, but introduces significant additional architectural complexity. On a particularly large deployment, I eventually had to layer in a second pgbouncer tier. One tier ran on the application servers and another tier on the database servers. Altogether it aggregated connections for around 1 million client processes. Tuning it was 40% dark art, 40% brute force, and 10% pure luck.

Process scalability has been getting incrementally better every major release, but ultimately there is a somewhat hard limit to the performance of this architecture compared to something like thread-per-connection, which is used in MySQL.

For more technical depth, see https://brandur.org/postgres-connections.

#6: Primary Key Index is a Space Hog

Tables in PostgreSQL have an index for the primary key and separate row storage called the heap. Other databases integrate these together or support “index-organized tables”. In this arrangement, the primary key lookup process leads directly to the row data without a secondary fetch to get the full row and the requisite additional CPU and I/O utilization.

The CLUSTER command in PostgreSQL reorganizes a table according to an index to improve performance, but doesn’t really work for most real-world OLTP cases. It rewrites the entire table under an exclusive lock, blocking any reads or writes. PostgreSQL doesn’t maintain the clustered layout for new data, so this operation must be ran periodically. So it is really only useful if you can take your database offline for long periods of time on a regular basis.

But more critically, index-organized tables save space as the index doesn’t require a separate copy of the row data. For tables with small rows that are mostly covered by the primary key, such as join tables, this can easily cut the table’s storage footprint in half.

Consider the following table which stores social “likes” for arbitrary objects:

CREATE TABLE likes (

object_type INTEGER NOT NULL,

object_id BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY,

user_id BIGINT NOT NULL,

created_at TIMESTAMP WITH TIME ZONE NOT NULL,

PRIMARY KEY(object_type, object_id, user_id)

);

PostgreSQL will maintain an index for the primary key which is separate from the base table storage. This index will contain a full copy of the object_type , object_id , and user_id columns for every row. 20 out of 28 bytes in each row (~70%) will be duplicated. If PostgreSQL supported index-organized tables, it wouldn’t consume all that additional space.

#7: Major Version Upgrades Can Require Downtime

Some major version upgrades require hours of downtime to convert the data for a large database. Using the typical streaming replication mechanism, it is not possible to do this gracefully by upgrading a replica and doing a failover. The on-disk binary format is incompatible across major versions, thus the wire protocol between master and replica is effectively also incompatible.

Hopefully logical replication will eventually completely replace streaming replication, which would enable an online, rolling upgrade strategy. When I worked on a large-scale horizontally-scaled deployment, we made a significant engineering investment in custom infrastructure to do these upgrades without downtime using an additional trigger-based replication system that was also used for shard migration.

#8: Somewhat Cumbersome Replication Setup

To be fair, MySQL’s out-of-the-box replication is much more cumbersome, but compared to some NoSQL stores like MongoDB and Redis or some cluster-oriented replication systems like MySQL Group Replication and Galera Cluster, from an ease-of-use and sharp-edge-avoidance perspective, setting up replication in PostgreSQL leaves a lot to be desired. While logical replication theoretically provides far more flexibility for third-party solutions to paper over these gaps, so far it’s there are some pretty big caveats for using it in place of streaming replication.

#9: Ridiculous No-Planner-Hints Dogma

Planner hints allow queries to direct the query planner to use strategies it wouldn’t otherwise use on its own. In what seems to be a form of the sufficiently smart compiler argument, the PostgreSQL development team has held the line for years in refusing to support query planner hints.

I do understand their reasoning, which largely is about preventing users from attacking problems using query hints that should be fixed by writing proper queries. However, this philosophy seems brutally paternalistic when you’re watching a production database spiral into a full meltdown under a sudden and unexpected query plan shift.

In many of these situations, a hint to the planner can mitigate the issue in minutes, buying the engineering team the hours or days they need to make a proper fix to the query. While there are some indirect workarounds that involve disabling certain query planner strategies, they are risky and definitely shouldn’t be employed under any kind of time pressure.

That ivory tower sure must be nice.

#10: No Block Compression

InnoDB’s Page Compression in MySQL commonly cuts storage footprint in half and is pretty much “free” from a performance perspective. PostgreSQL will automatically compress large values but this isn’t useful for the most common ways data is stored in relational databases. For most RDBMS use cases, a row is typically a few hundred bytes or less, which means compression could only really be effective when applied across multiple rows, or in blocks.

Block compression is indeed quite difficult to implement for the kind of data structures at the heart of PostgreSQL, but the “hole punch” strategy employed by MySQL’s InnoDB storage engine seems to work quite well in practice, despite some drawbacks.

2020–04–07 Update: Mark Callaghan, of “MySQL at Facebook” fame, questioned my statement here that hole-punching compression works “quite well in practice.” It turns out that the largest MySQL installation in the world never used hole-punching compression, as I previously thought. They did successfully use a slightly modified version of the older generation of InnoDB compression with success, however, before migrating to MyRocks a few years ago. While the hole-punching compression does seem to work for some folks, there are some caveats make it less of a home run. If you’re running Percona’s version of MySQL, MyRocks is a better bet. If not, it seems that the classic InnoDB table compression is a safer bet for very read-heavy workloads on flash storage. Mark didn’t point to any specific instances of major production issues, but does point out that he “doubts filesystems have been designed for hole punch per page and I would be afraid of obscure failures.”

The only general-purpose block compression setup widely used in the PostgreSQL world leverages ZFS, which does seem to work quite well for people. ZFS is a production-grade reality in Linux these days, but definitely brings with it some administrative overhead that doesn’t exist for the more “out-of-the-box” filesystems like XFS or ext4.

All That Said…

You should probably still use PostgreSQL and not something else for storing data that you’d ideally like to, you know, keep around for some time. In general I’d recommend starting with PostgreSQL and then trying to figure out why it won’t work for your use case.

PostgreSQL is very mature, well-engineered, richly-featured, is generally free of sharp edges, and is quite performant for the vast majority of use cases. It also is unencumbered by a dominant corporate sponsor, includes fantastic documentation, and has a professional, inclusive community.

The good news is that the pain caused by many of the issues brought up in this post can be reduced or eliminated by using a managed database service like Heroku PostgreSQL, Compose PostgreSQL, Amazon RDS for PostgreSQL, or Google Cloud SQL for PostgreSQL. If you can use one of these services, for the love of all that is holy, please do!

I’m proud to say I’ve built software on top of PostgreSQL for almost two decades and remain a strong advocate, despite its flaws. Given the progress I’ve witnessed over the years by its incredible development team, I can say that most, if not all of these issues will be resolved in due time.