This blog is about my work on the Postgres open source database, and is published on Planet PostgreSQL. PgLife allows monitoring of all Postgres community activity.

Online status:

Unread Postgres emails:

Email graphs: incoming, outgoing, unread, commits (details)

Tuesday, December 12, 2017

It has been a year since my last blog post about sharding. There are many sharding improvements in Postgres 10, and this PGConf.Asia talk from ntt staff covers the advances. The slides go into great detail about what has been accomplished, and what remains. This whiteboard image from the PGConf.Asia Unconference gives a single view of the necessary pieces.

What I wrote about a year ago felt like a far-distant goal — now it feels like we are almost there. Yes, it will take years to fully complete this feature, like all our complex features, but it feels like we are very near to proof-of-concept and production deployments, at least for certain workloads. Postgres pulls off these amazing feats regularly, and you would think I would get use to it and not be surprised, but I still am.

Friday, November 24, 2017

I have been with the project long enough to remember how the project got the name "PostgreSQL". In 1996, we inherited the name "Postgres95" from the Berkeley team's remaining member Jolly Chen. Obviously the Postgres95 name wasn't going to last long-term, so a new name had to be chosen. We could revert to the original Berkeley name "Postgres", or go with a name that more closely identified us as an sql database and call it "PostgreSQL". To add complexity to the discussion, Berkeley Postgres (before Postgres95) used the quel query language, so there were valid arguments that we needed to advertise that the database now used sql. After heated debate, "PostgreSQL" was chosen, and "Postgres" was later accepted as a valid alias.

I was always in the "Postgres" camp, but over time I have learned to appreciate the arguments of those who prefer "PostgreSQL". The two strong arguments for "PostgreSQL" are:

Visually better than "Postgres"

Identifies it to new people as an sql database

The strong argument for "Postgres" is that "Postgres" is simpler to say, while "PostgreSQL" is complex and has several verbal options, e.g., Postgres-Q-L, Postgres Sequel", Postgre Sequel".

What has really cemented my appreciation for both names is that the companies I have worked for have cycled through the two names because they saw value in each one. What that tells me is that both names have unique value and that we have to do our best to leverage the strengths of both.

Also, from the further obscure closet, here is the first email suggesting that Postgres use an elephant as its mascot:

but if you want an animal-based logo, how about some sort of elephant? After all, as the Agatha Christie title read, elephants can remember …

Wednesday, November 22, 2017

Postgres 9.4 added the server variable huge_pages. The Postgres documentation does a good job explaining huge pages, "The use of huge pages results in smaller page tables and less CPU time spent on memory management, increasing performance."

Specifically, all modern operating systems use virtual memory to map process-visible memory to physical ram. Typically, the default granularity of virtual pages is four kilobytes. For a process using one megabyte of virtual memory, that is 256 page table entries mapping virtual addresses to physical addresses. For a processing using one gigabyte of virtual memory, that is 256k virtual page table entries! While modern cpus have translation lookaside buffers (tlb) to speed virtual/physical memory mapping, there is still a lot of overhead for processes that use large memory spaces. Huge pages allow for less granular virtual pages, meaning fewer page table entries and fewer lookups. The x86-64 cpu architecture supports 2MB and 1GB-sized huge pages.

Unfortunately Postgres only supports huge pages on Linux, and to make use of huge pages, you must enable huge pages in the operating system. The Postgres documentation goes into detail on how to do this. You can see how much memory is allocated for huge pages on Linux by looking in /proc/meminfo:

AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB

The last line indicates that huge pages are 2MB is size. The first line indicates the anonymous or transparent huge pages in use in your kernel. Transparent huge pages can cause significant overhead when a background kernel thread tries to rearrange memory to coalesce it into 2MB chunks. It is recommended that transparent huge pages be disabled when using Postgres, and only explicit huge pages be used. Fortunately, most modern Linux kernels have transparent huge pages disabled by default.

Tuesday, November 21, 2017

The Postgres community is often asked to provide automatic tracking of the creation and last modification times for objects, e.g., tables, functions. File systems track file last modification times, and some record creation time, so the logic is why can't Postgres do this? (Unix's inode modification time does not represent the creation time.)

For object creation-time tracking, we have to consider when creation time should be set. It is obvious that the creation time should be set when the object is first created, but what about:

restore of pg_dump

restoring a base backup for pitr or streaming replication

or streaming replication pg_upgrade in copy mode

For object modification, things become more complicated. There are several layers of modification:

object schema changes via alter

storage rewrites via cluster, vacuum full

user data row changes, e.g., via insert, update, delete

internal changes, e.g., hint bits, vacuum freeze

Tracking all of these times is reasonable. However, there is significant code and performance overhead in doing it, and a given user is probably only interested in a few of these. A detailed email by Tom Lane also explains that even if we tracked all of this, it only records the change time, not what was changed.

A more holistic solution is auditing, which would contain change information as well as tracking of multiple change times. Effort is probably better spent adding auditing to Postgres than adding fine-grained tracking of creation and modification times. Users who want to track such times are best served writing custom event triggers.

Monday, November 13, 2017

I previously mentioned my visit to Hangzhou, China. A video interview from that trip is now available, as well as my presentation in English and Chinese (starts at 153 minutes).

The most interesting part of my visit to Shenzhen, China was an evening meeting with 30 Postgres community members discussing how to increase Postgres adoption in China.

Monday, November 6, 2017

An age-old question is whether it is better to put data in a database or a file system. Of course, the answer is "it depends," but let's look at what it depends on.

First, Postgres stores its data in the file system, so the file system must be good for something. Postgres also stores some of its configuration files in the file system, e.g., pg_hba.conf, so it isn't just a issue that once you have a database, everything is better in a database.

It usually comes down to evaluating database positives vs. negatives — first the database storage positives:

Do you need to view the data in multiple ways?

Is synchronizing database data and file system data going to be difficult?

Do you need multi-object commit synchronization?

Do applications need a single, guaranteed-consistent view of their data?

Is a file system API unreasonable or inaccessible to clients?

and these database storage negatives:

Are the objects very large, meaning that using bytea large objects is too cumbersome?

large objects is too cumbersome? Is the data so small that the per-row 28-byte overhead is too large?

Is direct file system access useful?

It is often a question of object size. At the multi-megabyte size, having multiple copies of the data in the database while a query is executing is not a problem, and the auto-synchronization of data in the database is a great benefit. At the hundreds of megabytes and larger, the overhead of moving around multiple copies of the object in the database can become a performance problem. Of course, these problems only happen if you access the large data object in a query — if you don't query the bytea column or access the large object in the query, there is no overhead except storage. Updates of rows containing large objects also don't suffer a performance penalty if the update doesn't modify the large object. (The same is true for any long value, e.g., character strings, json.)

Also, if the object is not a binary blob to the database but has a structure the database can understand then storing it in the database has other advantages. For example, doing full text search or json lookups on data can be very beneficial.

Some of the same questions come up when deciding whether some of your data would be best stored in a different database technology and using foreign data wrappers to unify data access. Again, the answer is still "it depends," but hopefully this helps you judge the "depends" better.

Friday, November 3, 2017

I have attended 323 Postgres events in my career. While I have enjoyed almost all of them, many had different focuses, so I thought I would share my experiences. First, there are a variety of conference types:

Vendor conferences: often in big cities, which focus on company-produced products Business conferences: also often in big cities, which focus on business challenges and discussions, often with high attendance prices Community conferences: led by people who care about open-source software and focus on software knowledge transfer Hobbyist conferences: often in smaller cities, which focus on interpersonal relationship building with technology as a catalyst, often free

It would be nice if I could say which conference types are good or bad, but that isn't possible. Each conference targets an audience whose needs it seeks to fulfill. Let's look at the needs that each fulfills:

Vendor conferences: If you are new to a technology and need people to help you navigate purchase options, these conferences are for you. Business conferences: If you are frequently challenged to make business decisions, but feel you have no one to share options with or brainstorm, this type of conference can give you a framework to help you make your next complex business decision. Community conferences: If you spend significant time solving technological problems, you can gain great insight and new approaches by attending this type of conference. Hobbyist conferences: If you are looking for emotional connections to people who share similar interests, this type of conference can be personally rewarding.

Ideally everyone would go to conferences which match their interests, but what happens when they don't match? Here are some examples:

Vendor conferences: "Wow, this is boring. The booth staff don't even know about the technology they are selling. When will this be over?" Business conferences: "People are very passionate about the problems they are trying to solve. I am glad I don't have these problems — they seem unsolvable." Community conferences: "These people really care about the minutia of the software. When are they going to get a life?" Hobbyist conferences: "Does this end with everyone sitting in a circle and roasting marshmallows over a cpu fan?"

Ninety-five percent of Postgres conferences are community conferences (#3). They have some of the other aspects, but that is not the focus. Open source vendor conferences, e.g., LinuxWorld, used to be popular but are almost extinct. They were really focused on vendors, and when attendees realized this, and vendors didn't get the desired sales opportunities, demand collapsed. Postgres has few business-focused conferences — Postgres Vision is an example. Hobbyist conferences still exist, and probably always will, though they are, by necessity, small.

For me, conferences allow me to hear from and talk to people knowledgeable in a field I want to study. These interactions convey information I can't get from reading books and articles. The best interactions are a fire hose of information that I can absorb and probe. Such interactions give me information it would take me weeks or months to learn.

This Slashdot discussion covers some of the value in attending conferences. Hopefully this blog post and that discussion will help you navigate the many conferences offered and help you get the best value from them.

Wednesday, November 1, 2017

Postgres is advertised as "reading never blocks writing and writing never blocks reading." While this is true for selects and dml (insert, update, delete), it isn't true for ddl.

Ddl, like alter table, can block reads and writes. Postgres has continued to reduce locking requirements for alter and other ddl commands, and will continue to improve things. However, ultimately there is some ddl that will always require, at least briefly, the blocking of reads and writes.

Tuesday, October 17, 2017

I had the pleasure of speaking at Alibaba's Computing Conference last week in Hangzhou, China. I gave two presentations. The first covered Postgres 10 features (images, video). The second was a generic talk about the future of open source databases (images, video in English and Chinese (starts at 153 minutes)).

This week I am attending a community Postgres conference in Shenzhen, China. Postgres users here have a strong desire to increase Postgres adoption in China in the coming years.

Wednesday, October 4, 2017

You might be aware that Postgres uses invisible columns to track concurrent row access and modifications. My mvcc talk covers much of this.

Someone at a conference mentioned they were using the invisible xmin column to perform updates. I was initially skeptical of this approach, but once he explained the purpose, it made sense, e.g.:

CREATE TABLE mvcc_test (x INTEGER PRIMARY KEY, name TEXT); INSERT INTO mvcc_test VALUES (1, 'Sal'); SELECT * FROM mvcc_test; x | name ---+------ 1 | Sal SELECT xmin, * FROM mvcc_test; xmin | x | name ------+---+------ 715 | 1 | Sal UPDATE mvcc_test SET name = 'Sally' WHERE x = 1 AND xmin = 715; SELECT xmin, * FROM mvcc_test; xmin | x | name ------+---+------- 716 | 1 | Sally

Why would you do this? Normally this would be done using select … for update. However, what if you want to do the update without locking the row between select and update, and are willing to discard the update if the row has already been updated by another session? This is a case where using xmin in updates makes sense.

Monday, October 2, 2017

During research for my Postgres Window Magic talk, I studied the unusual behavior of percent_rank and cumm_dist (cumulative distribution). The Postgres documentation wasn't helpful. I finally came up with this paragraph to add to the Postgres 10 docs:

cume_dist computes the fraction of partition rows that are less than or equal to the current row and its peers, while percent_rank computes the fraction of partition rows that are less than the current row, assuming the current row does not exist in the partition.

There are some key phrases in there. First, cume_dist computes the percentage of rows that are "less than or equal," while percent_rank computes a similar percentage that are "less than" the current row, except it also assumes the current row isn't in the partition.

That last issue of not counting the current row struck me as odd. I talked to some statisticians about this and did some web searches. The best I can understand is that this special percent_rank behavior is designed to handle cases where there is a single maximum value in the partition. For example:

SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2), (CUME_DIST() OVER w)::numeric(10, 2) FROM generate_series(1, 5) AS f(x) WINDOW w AS (ORDER BY x); x | percent_rank | cume_dist ---+--------------+----------- 1 | 0.00 | 0.20 2 | 0.25 | 0.40 3 | 0.50 | 0.60 4 | 0.75 | 0.80 5 | 1.00 | 1.00

In this example, what are the ideal ratios for the last line? Cume_dist is easy since is it "less than or equal" to the current row, so that is clearly 1.00. What about percent_rank, which is "less than?" Without the "assume the current row isn't in the partition" exclusion, it would show 0.80. That would be a silly answer since all the rows are less than the last row, except the last row itself. I think this is why that exclusion was added.

This issue becomes even more complex when there are many duplicates. For example, imagine this data set:

WITH cte (x) AS ( SELECT 0 UNION ALL SELECT 1 FROM generate_series(1, 5) UNION ALL SELECT 2 ) SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2), (CUME_DIST() OVER w)::numeric(10, 2) FROM cte WINDOW w AS (ORDER BY x); x | percent_rank | cume_dist ---+--------------+----------- 0 | 0.00 | 0.14 1 | 0.17 | 0.86 1 | 0.17 | 0.86 1 | 0.17 | 0.86 1 | 0.17 | 0.86 1 | 0.17 | 0.86 2 | 1.00 | 1.00

Frankly, neither percent_rank nor cume_dist look like they produce reasonable results. If I got a score of 1, is saying that a ratio of 0.17 rows are less than mine, or that 0.86 are equal or less than mine, useful? Imagine this was an sat test and many people got 650. Wouldn't it be reasonable to count half of the people who had matching scores less then or equal to mine, which would compute 0.50, rather than including all or none of the matching scores?

Right now, there is no way in Postgres to do that, but I think it would be a nice thing to add. I assume a server-side function could be written to do this. Computing the average of percent_rank and cume_dist would yield a similar result.

Wednesday, September 27, 2017

In a previous blog entry I suggested storing the original time zone offset in a separate column if clients need to know the stored time in the original time zone. There is some more complexity to this issue that I would like to cover.

First, when I suggested using select extract(timezone from current_timestamp), I assumed the user was also storing the current_timestamp value in the database. If they were storing a past or future timestamp in the database, they would need to use that value in extract, instead of current_timestamp.

Second, as was pointed out by Miha Vrhovnik in a blog comment, things get more complicated if a future time is stored in the database and the future time zone rules change after the data is stored. You might think this concern is only theoretical, but Russia made such changes in 2014 and 2016.

To get the proper behavior, you have to ask yourself, if you are storing a future timestamp, do you want to retain the same wall-clock time? If you were recording a future doctor's appointment, odds are you would want to adjust the stored value to reflect the same time of day (wall-clock time). If you were recording a future astronomical event, you would want to keep the same instant in time, even if the visible time changes. The default timestamp with time zone behavior is to retain the same instant in time, i.e., astronomical behavior.

To retain the same future wall-clock time after a time zone rule change, you would need to store the timestamp using without time zone and store the time zone name in a separate column. You would then need to combine the timestamp without time zone and the time zone name in each query to compute the instant in time based on the current time zone rules. Any indexes that did such computations would also need to be reindexed when time zone rules change.

This method allows the data to adjust to future time zone rule changes by computing the time zone offset on demand, rather than being locked in the time zone rules which were current at the time the data was entered, e.g.:

-- controls the OUTPUT time zone of AT TIME ZONE when passed a WITHOUT TIME ZONE value SET TIME ZONE 'Europe/Moscow'; -- AT TIME ZONE specifies the time zone for the literal value SELECT TIMESTAMP WITHOUT TIME ZONE '2012-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow'; timezone ------------------------ 2012-03-08 09:00:00+04 SELECT TIMESTAMP WITHOUT TIME ZONE '2022-03-08 09:00:00' AT TIME ZONE 'Europe/Moscow'; timezone ------------------------ 2022-03-08 09:00:00+03

Basically, if the future time zone rules change, the output time of day would be the same, but the instant in time compared to utc would change, e.g., the +03. Of course, if all events are in the same time zone, and you don't want "astronomical behavior," then there is no need for time zone adjustments. Miha Vrhovnik is right that using time zones can be more trouble than they are worth, especially if all entries are in the same time zone and future time zone changes are a possibility.

Monday, September 25, 2017

I recently wrote about PL/pgSQL's good alignment with the scripting needs of the sql language. There have been several email threads in the past few months exploring this in much greater detail so I wanted to share them.

The first email thread defines "impedance mismatch" as a way of measuring how closely a language's features and syntax fit the requirements of server-side functions. It cautions against always choosing familiar languages because sometimes using a language with a lower "impedance mismatch", even if its syntax is unfamiliar, is better in the long run. It also has a long discussion about when server-side logic is wise.

The second email thread talks about the mechanics of using server-side languages, and mentions PL/Perl and PL/v8 as good for compute-heavy server-side functions.

The final thread is a wish-list of how to improve PL/pgSQL. While there are many requests, they center on cases where PL/pgSQL is overly-verbose or the defaults are non-optimal. (I was actually happy to see that the scope of requested improvements was so limited.)

I will close with a particularly-glowing post about the community's accomplishments. As someone who sees complaints and bug reports all day, it is good to remember how much goes well in our community and development processes.

Wednesday, September 20, 2017

I last blogged about Postgres's handling of case sensitivity in 2012. A recent email thread requested that case sensitivity be configurable in Postgres.

Certainly, allowing case sensitivity to be configurable would help those porting from other databases. However, Postgres tools and libraries operate inside other applications, so if there was a case sensitivity setting, they couldn't change it. They would need to support the case sensitivity set by the client, meaning that every identifier would have to be double-quoted. It was concluded that this seems to be a bigger negative than the portability enhancement is a positive.

Monday, September 18, 2017

The Postgres hackers list is a steady-stream of great ideas and discussion, but occasionally something comes along that really makes you sit back and think, "Wow, where did that come from?" Such was a February 2017 email from Konstantin Knizhnik presenting a proof-of-concept vectorization optimization for the executor.

In May Andres Freund presented a plan for speeding up the executor using Just In Time Compilation (jit) and llvm. This work is pending for Postgres 11. (In fact, it was almost committed to Postgres 10 on April Fool's Day.)

Konstantin's work adds vectorization to the executor, which can be revolutionary. Once Andres's work is in, we can research how to make the executor even faster using vectorization. This would open up Postgres to an entirely new class of big-data applications.

Friday, September 15, 2017

When setting up a database schema, indexing is always a consideration. While Postgres supports traditional btree indexes for most data types, it can be quite heavy, often requiring a significant percentage of the table size for index storage.

There are two options for schemas that need to index a large percentage of columns. The first is brin indexes which allow for small, low-overhead indexes that can be created on many columns. One downside of brin indexes is that they must be defined on each column that needs it.

A more generic solution is to place all data in a jsonb column and create a gin index. The gin index indexes every json key and value, and only stores the keys and values once in the index. This is a great "index everything" option for administrators that aren't sure of their indexing needs.

Wednesday, September 13, 2017

The Postgres optimizer has improved dramatically over the years. However, there are still a few areas where it can be improved.

First, it would be nice if users could be informed when an increase of default_statistics_target is wise. Second, while multivariate statistics will be added in Postgres 10, there are still other multivariate optimizations to add.

A more sophisticated improvement, which bypasses the problem of inaccurate statistics, would be to have the executor send feedback on selectivity found during query execution to the optimizer to improve the future query plans.

Monday, September 11, 2017

Postgres 10 is adding partitioning syntax to Postgres. This ends the previous Frankenstein-style partitioning setup of having to configure child tables, check constraints, and triggers or rules.

You would think that the partitioning feature is now complete. Oh, no! It seems there are now more partitioning enhancement requests than we had before adding the syntax. I knew there would be new optimizations once there was a canonical representation of partitions in the backend, but I didn't expect this feature-request growth.

Ottawa's PGCon conference had a talk about partitioning, but the major partition feature list was created during its unconference. The good news is that Amit Langote recorded all the requests. I expect it will be a few years until we can say we are done improving partitioning in Postgres.

Monday, September 4, 2017

Postgres has a well-deserved reputation for flexibility. You can often combine two features to get something very powerful. For example, it was recognized several years ago that combining Pgpool and streaming replication creates a powerful, easy-to-administer read-scaling solution.

When asked about creating a read-scaling solution for multi-terabyte systems, I came up with an enhanced option. The problem with using streaming replication for read scaling is that you need a full copy of the entire database cluster on each standby. For multi-terabyte clusters, that requires a lot of storage. Also, odds are you don't access the cluster's data uniformly — some data is current and frequently accessed, some is archival and rarely accessed.

An interesting solution to this problem is to combine four Postgres features:

Pgpool

Streaming replication

Partitioning

Foreign data wrappers

First, use Pgpool and add a partitioned table on the primary, where some of the partitions are local and some are foreign data wrapper references. The local partitions contain current data and are copied to the top-level streaming standbys. Archival data is placed on a secondary set of streaming-replication servers and accessed via foreign data wrappers from the top-level primary and its standbys. For read scaling, you might need many top-level streaming replicas with current data, but only a few streaming replicas to handle archival-data requests.

This solution uses four Postgres tools to provide flexible read scaling for large data sets. It is also transparent to applications so administrators can rearrange the data without modifying applications.

Friday, September 1, 2017

You might know that Postgres supports materialized views and foreign data wrappers (fdw). Briefly, materialized views allow for queries to be materialized and refreshed on demand. Foreign data wrappers allow data to be pulled from foreign data sources, like Nosql stores and other Postgres servers.

What you might not have considered is that materialized views and foreign data wrappers can be used together. Materialized views speed data access by summarizing data so it isn't necessary to query the base tables that are referenced by materialized views. Foreign data wrappers, because they are remote, can be slow to access. By combining the two features, you can get fast access to remote data.

Let's see this in action! First, let's set up the foreign table:

CREATE DATABASE fdw_test; \connect fdw_test; CREATE TABLE world (greeting TEXT); \connect test CREATE EXTENSION postgres_fdw; CREATE SERVER postgres_fdw_test FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'fdw_test'); CREATE USER MAPPING FOR PUBLIC SERVER postgres_fdw_test OPTIONS (password ''); CREATE FOREIGN TABLE other_world (greeting TEXT) SERVER postgres_fdw_test OPTIONS (table_name 'world'); \det List of foreign tables Schema | Table | Server --------+-------------+------------------- public | other_world | postgres_fdw_test

populate it with some data:

INSERT INTO other_world SELECT * FROM generate_series(1, 100000);

and create a materialized view on the foreign table:

CREATE MATERIALIZED VIEW mat_view (first_letter, count) AS SELECT left(greeting, 1), COUNT(*) FROM other_world GROUP BY left(greeting, 1);

Now we can compare select times for foreign tables and materialized views:

\timing SELECT left(greeting, 1) AS first_letter, COUNT(*) FROM other_world GROUP BY left(greeting, 1); first_letter | count --------------+------- 1 | 11112 2 | 11111 3 | 11111 4 | 11111 5 | 11111 6 | 11111 7 | 11111 8 | 11111 9 | 11111 Time: 354.571 ms SELECT * FROM mat_view; first_letter | count --------------+------- 1 | 11112 2 | 11111 3 | 11111 4 | 11111 5 | 11111 6 | 11111 7 | 11111 8 | 11111 9 | 11111 Time: 0.783 ms

The materialized view is much faster, but also a contrived example. It is interesting that refreshing the materialized view takes a similar time to selecting from the foreign table:

REFRESH MATERIALIZED VIEW mat_view; Time: 364.889 ms

The above output is from Postgres 9.6. Thanks to this improvement in Postgres 10:

Push aggregates to foreign data wrapper servers, where possible (Jeevan Chalke, Ashutosh Bapat) This reduces the amount of data that must be passed from the foreign data wrapper server, and offloads aggregate computation from the requesting server. The postgres_fdw fdw is able to perform this optimization. There are also improvements in pushing down joins involving extensions.

Postgres 10 gets faster foreign table aggregate selects than in 9.6, but still slower than using materialized views:

SELECT left(greeting, 1) AS first_letter, COUNT(*) FROM other_world GROUP BY left(greeting, 1); first_letter | count --------------+------- 1 | 11112 2 | 11111 3 | 11111 4 | 11111 5 | 11111 6 | 11111 7 | 11111 8 | 11111 9 | 11111 Time: 55.052 ms

You don't even need to use aggregates in materialized views — you can just copy the contents of a remote table into a materialized view for faster access, and refresh the materialized view occasionally (though logical replication in Postgres 10 does this better):

CREATE MATERIALIZED VIEW mat_view2 AS SELECT * FROM other_world;

Now we can run a performance test on the foreign table and its local copy:

\o /dev/null SELECT * FROM other_world; Time: 317.428 ms SELECT * FROM mat_view2; Time: 34.861 ms

In summary, materialized views and foreign data wrappers are two features that work well together. Materialized views allow remote data to be cached locally, either entire tables or aggregate summarizations. The cache can be refreshed using refresh materialized view. Also, Postgres 10 speeds up aggregate queries on foreign tables.

Wednesday, August 30, 2017

I blogged about pgindent in 2011. After much discussion, an improved BSD indent binary has been created by Piotr Stefaniak and Tom Lane to fix various problems that were impossible to fix using the old Perl post-processing method. The Postgres 10 tree has been reformatted with the new tool.

Monday, August 28, 2017

Like many open-source projects, Postgres is written in the C programming language. However, with the code base being 31 years old, a lot of specialized C infrastructure has been added over the years to simplify server programming, e.g., memory management, caching, system catalog access, tuple access, error handing.

While this infrastructure helps experienced developers be productive, it often confuses people studying the Postgres source code because you have to understand the infrastructure to understand the code. Robert Haas makes this point clear in an email post:

A system like PostgreSQL is almost a language of its own; we don't really code for PostgreSQL in C, but in "PG-C". Learning the PG-specific idioms is arguably more work than learning C itself …

Friday, August 25, 2017

Autovacuum simplifies Postgres administration by automating the cleanup of updated(expired) and deleted rows and rows created by aborted transactions. It also updates optimizer statistics. It uses information gathered by the statistics collector to launch vacuum and analyze operations at the appropriate time. Autovacuum can respond much more promptly than an administrator.

However, there are cases where autovacuum isn't optimal, and for such cases there are two options. First, autovacuum can be controlled at the global level by settings in postgresql.conf. A lesser-known option is controlling autovacuum parameters at the table level via create/alter table storage options. The later are obviously more fine-grained and allow cases were global options are too coarse. And, of course, these settings can all be adjusted to be more aggressive during idle periods.

Wednesday, August 23, 2017

Linux uses an out-of-memory killer (oom), which is designed to kill processes:

If your memory is exhaustively used up by the processes to the extent which can possibly threaten the stability of the system, then the oom killer comes into picture. It is the task of the oom killer to kill the processes until enough memory is freed for the smooth functioning of the rest of the process.

While killing processes is never good, it is better than having the system halt due to memory exhaustion. Sometimes the oom kills Postgres, and that isn't a good thing either. This email thread explains how to use ulimit to cause Postgres sessions that consume a lot of memory to fail due to excessive memory requests. This avoids having them continue and be killed by the oom killer, which causes the entire database server to restart. The Postgres documentation also explains the behavior of the oom killer.

Monday, August 21, 2017

It is common knowledge that tablespaces allow you to place frequently-accessed data on fast storage, and archive-level data on slower storage. This is even mentioned in our documentation.

One additional aspect I had never considered is that you can store archive-level data on slower and compressed file systems. Most file systems support some kind of compression. While it doesn't make sense for frequently-accessed data, it seems perfect for archive-level data.

Wednesday, August 16, 2017

Docker uses Linux containers (lxc) to allow application deployment in a pre-configured environment. Multiple such isolated environments can run on top of the same operating system. Docker is ideal for rapid, reproducible deployments.

How that relates to database deployments is an open question. The full power of Docker is that everything is in the container and the container can be easily destroyed and recreated. Because databases require a persistent state, a hybrid approach is necessary. Typical Docker database deployments use Docker for database installation and setup, and persistent storage for persistent state. This email thread explains the benefits of Docker for databases:

Isolation between environments

Deployment of tested and verified Docker images

Allows developers to use the same Docker images as production

and explains how to add persistent state:

Run one container on one node

Use bind mounts

Use --net=host

Hopefully this helps users evaluate Docker for future Postgres deployments.

Monday, August 14, 2017

On the server side, high availability means having the ability to quickly failover to standby hardware, hopefully with no data loss. Failover behavior on the client side is more nuanced. For example, when failover happens, what happens to connected clients? If no connection pooler is being used, clients connected to the failed machine will need to reconnect to the new server to continue their database work. Failover procedures should guarantee that all connections to the failed server are terminated and that no new connections happen. (Reconnecting to the failed server could produce incorrect results and lost changes.) If a client is connected to a standby that is promoted to primary, existing client connections and new connections are read/write.

Clients connect to the new primary via operating-system-specific methods, usually either virtual IP addresses (vip, good blog entry) or dns entries with a short time to live (ttl). This is normally accomplished using dedicated high-availability or clustering software. Postgres 10 will also allow multiple host names to be tried by clients.

For clients using a connection pooler, things are even more complicated. Logically, you would think that, since clients didn't connect directly to the failed server, they should be able to continue their queries in the same session uninterrupted. Generally, this is not the case.

First, the client might have been running a query during the failure — therefore, a query error code needs to be returned to the client so the client can retry the query (assuming query retry logic is built into the application). Second, the session might have been in the middle of a transaction block, meaning it has to resubmit all the queries that were part of the transaction (again, assuming the client has such retry logic). Third, several server configuration parameters might have been modified at the session level. The client might be depending on these settings for proper operation, so they would need to be re-modified in the new session. Fourth, there is additional session state that might need to be recreated. You can get an idea of possible session state modifications by reviewing the discard manual page.

If this sounds like a boat-load of challenges to allow clients using a connection pooler to continue uninterrupted, it is. None of the Postgres pooling tools allows uninterrupted failover. Pgpool automatically disconnects all clients and allows them to reconnect. (In pgpool 3.6+, clients connected to still-running standby servers are not affected.) PgBouncer does not detect failover.

To implement even a limited case where failover appears uninterrupted to clients, connection poolers would have to record session state changes, client queries, and even client results because many queries depend on the results of previous queries, e.g., select … for update. Even if all this tracking was performed, changed query results would make uninterrupted behavior impossible. Given all this complexity, it isn't surprising that the standard approach to failover is for clients to be programmed to handle such cases by reconnecting, resetting their session state, and re-running any queries that future queries depend on.

Friday, August 11, 2017

With the addition of logical replication in Postgres 10, we get a whole new set of replication capabilities. First, instead of having to replicate an entire cluster, you can replicate specific tables using streaming replication. With this granularity, you can broadcast a single table to multiple Postgres databases, or aggregate tables from multiple servers on a single server. This provides new data management opportunities.

Another big advantage of logical replication is migrating between major Postgres versions. If both major Postgres versions support logical replication, you can set up logical replication between them and then switch over to the new major-version Postgres server with only seconds of downtime. It also allows you to downgrade back to the old major version and replay any changes made on the upgraded server, assuming logical replication is still working properly.

Quicker upgrade switching and the ability to downgrade in case of problems have been frequent feature requests that pg_upgrade has been unable to fulfill. For users who need this, setting up logical replication for major version upgrades will certainly be worth it.

Wednesday, August 9, 2017

For wal archiving, e.g., archive_command, you are going to need to store your wal files somewhere, and, depending on how often you take base backups, it might be very large.

Most sites that require high availability have both a local standby in the same data center as the primary, and a remote standby in case of data center failure. This brings up the question of where to store the wal archive files. If you store them in the local data center, you get fast recovery because the files are quickly accessible, but if the entire data center goes down, you can't access them from the remote standby, which is now probably the new primary. If you store your wal archive files remotely, it is difficult to quickly transfer the many files needed to perform point-in-time recovery.

My guess is that most sites assume that they are only going to be using wal archive files for local point-in-time recovery because if you are running on your backup data center, doing point-in-time recovery is probably not something you are likely to do soon — you probably only want the most current data, which is already on the standby. However, this is something to consider because with lost wal you will need to take a base backup soon you can do point-in-time recovery in the future.

Monday, August 7, 2017

When the ability to run queries on standby servers (called hot_standby) was added to Postgres, the community was well aware that there were going to be tradeoffs between replaying wal cleanup records on the standby and canceling standby queries that relied on records that were about to be cleaned up. The community added max_standby_streaming_delay and max_standby_archive_delay to control this tradeoff. To completely eliminate this trade-off by delaying cleanup on the primary, hot_standby was added.

So, in summary, there is no cost-free way to have the primary and standby stay in sync. The cost will be either:

Standby queries canceled due to the replay of wal cleanup records Stale standby data caused by the delay of wal replay due to cleanup records Delay of cleanup on the primary

The default is a mix of numbers 1 and 2, i.e., to wait for 30 seconds before canceling queries that conflict with about-to-be-applied wal records.

Unfortunately, there is always a cost in running hot standby queries, but at least with Postgres, you can control it. Fortunately, you can also monitor replay delay using pg_stat_replication.

Monday, July 31, 2017

In streaming replication, wal is streamed from the primary to the standby. For a variety of reasons, including max_standby_streaming_delay, there can be a significant delay for a commit on the primary to appear in read-only queries running on the standby. The pg_stat_replication view shows the possible delays:

\d pg_stat_replication View "pg_catalog.pg_stat_replication" Column | Type | Modifiers ------------------+--------------------------+----------- pid | integer | usesysid | oid | usename | name | application_name | text | client_addr | inet | client_hostname | text | client_port | integer | backend_start | timestamp with time zone | backend_xmin | xid | state | text | sent_location | pg_lsn | write_location | pg_lsn | flush_location | pg_lsn | replay_location | pg_lsn | sync_priority | integer | sync_state | text |

These columns clearly show the typical order of wal from primary to replay:

wal is sent from the primary to the standby ("sent")

is sent from the primary to the standby ("sent") wal is written to the standby's file system ("write")

is written to the standby's file system ("write") wal is flushed to the standby's durable storage ("flushed")

is flushed to the standby's durable storage ("flushed") wal is replayed and is then visible to standby read-only queries ("replay")

Postgres 10 will add interval delay information to the view:

\d pg_stat_replication View "pg_catalog.pg_stat_replication" Column | Type | Collation | Nullable | Default ------------------+--------------------------+-----------+----------+--------- … write_lag | interval | | | flush_lag | interval | | | replay_lag | interval | | | …

Tuesday, July 25, 2017

I was interviewed on July 14 during the PGConf Local: Philly 2017, and a 25-minute recording was just published.

Tuesday, July 11, 2017

Hstore was added to Postgres in 8.2 (2006). I didn't appreciate the purpose of hstore when it was first introduced. I knew it was a per-field key/value store, but not much else. People would come up to me at conferences and say "Hstore is the best," but I could only reply with "Yeah." Only later did someone explain to me the usefulness of being able to store unstructured data in a database.

Of course, with the popularity of NoSQL, everyone now understands the value of storing unstructured data. They probably also know that being able to store structured (relational) and unstructured data in the same database, like Postgres, is a big win.

Unfortunately, hstore, the Postgres extension that started it, hasn't kept up. When jsonb was added in Postgres 9.4 (2014) it superseded the capabilities of hstore. Jsonb is a key/value store, but allows for hierarchical storage too, and uses the popular json syntax. Some of the performance improvements made to the jsonb data type could not be backpatched to hstore, due to compatibility requirements.

At this point, unless you have a backward-compatible need for hstore, you should be using jsonb. Existing users of hstore would probably benefit from moving to jsonb, when convenient.

Monday, July 10, 2017

In conjunction with PGConf Local: Philly, I will be hosting a Postgres Party at my home near Philadelphia this Friday, July 14. You don't have to attend the conference to attend the party.

Wednesday, July 5, 2017

Now that everyone is using IPv6 ( ) it might be time to start playing with it. Postgres has had full IPv6 support for years, so Postgres is a good place to start, particularly with IPv6-aware data types.

Since IPv6 addresses are 128-bits instead of IPv4's 32-bits, they can be quite long, e.g., 2001:0db8:85a3:0000:0000:8a2e:0000:7334. As you can see, it is made up of eight quad-hex segments, separated by colons. To shorten the text representation, leading zeros in any quad-hex segment can be removed, though an all-zero quad still requires a zero. In addition, the longest string of all-zero quads can be abbreviated with double colons. This can be illustrated in Postgres:

SELECT '2001:0db8:85a3:0000:0000:8a2e:0000:7334'::inet; inet ------------------------------ 2001:db8:85a3::8a2e:0:7334

In the output, 0db8 became db8, and :0000:0000: became ::, and the final 0000 became 0. An address with many leading zeros, e.g., localhost (0000:0000:0000:0000:0000:0000:0000:0001), gets dramatically shortened using these rules:

SELECT '0000:0000:0000:0000:0000:0000:0000:0001'::inet; inet ------ ::1

This highlights perhaps the most confusing aspect of IPv6 addresses — you can't just visually compare two IPv6 addresses to check for equality, like you can for IPv4. You must use the IPv6 rules for comparisons.

Use of colons is strictly for IPv6 addresses, i.e., 1::127 represents a 128-bit IPv6 value, not a 32-bit IPv4 one, as illustrated by the IP family() function:

SELECT family('1::127'); family -------- 6 SELECT family('1.0.0.127'); family -------- 4 SELECT '1::127'::inet = '1.0.0.127'::inet; ?column? ---------- f

Postgres is a fun, interactive way to play with the IPv6 address rules, because we are all going to have to learn them eventually.

Wednesday, June 21, 2017

You might be aware of the interval data type, which allows mathematical operations on date, time, and timestamp values. This can lead to odd behavior, but this email posting showed me a new oddity when dealing with months containing a different number of days. First, let's summarize how many days are in the first five months of 2017:

2017-01 31 2017-02 28 2017-03 31 2017-04 30 2017-05 31

Let's add four months to the last day of January, 2017 in two different ways:

SELECT '2017-01-31'::date + '4 month'::interval; ?column? --------------------- 2017-05-31 00:00:00 SELECT '2017-01-31'::date + ('1 month'::interval + '3 month'::interval); ?column? --------------------- 2017-05-31 00:00:00

Looks good — the results are the same, and they make sense. But look what happens when the calculation take a stop-over in February (operations are performed left to right):

SELECT '2014-01-31'::date + '1 month'::interval + '3 month'::interval; ?column? --------------------- 2014-05-28 00:00:00

and a stop-over in April:

SELECT '2014-01-31'::date + '3 month'::interval + '1 month'::interval; ?column? --------------------- 2014-05-30 00:00:00

It seems that once the calculation lands on a lower-numbered day of a month, because the month doesn't have as many days as the previous stop, it never advances past that day number, even if later months have more days. Let's be more overt about this using parentheses:

SELECT ('2014-01-31'::date + '1 month'::interval) + '3 month'::interval; ?column? --------------------- 2014-05-28 00:00:00 SELECT '2014-01-31'::date + ('1 month'::interval + '3 month'::interval); ?column? --------------------- 2014-05-31 00:00:00 SELECT ('2014-01-31'::date + '3 month'::interval) + '1 month'::interval; ?column? --------------------- 2014-05-30 00:00:00

All three queries produce different results. This shows that addition with such values is neither associative (queries one and two) nor commutative (queries one and three). Real number addition is both associative and commutative, so it is confusing that addition with intervals is not for some values, though it is for others. (Other interval operations have similar issues, e.g., subtraction.)

While the results are surprising, it is unclear how Postgres could do any better without adding some very complex logic — so, developers beware.

Monday, June 19, 2017

If you often use the timestamp data type, you might not be making full use of it. In these queries:

CREATE TABLE tztest (x TIMESTAMP); INSERT INTO tztest VALUES (CURRENT_TIMESTAMP); SELECT * FROM tztest; x ---------------------------- 2016-10-25 18:49:20.220891 SHOW timezone; TimeZone ------------ US/Eastern SET timezone = 'Asia/Tokyo'; SELECT * FROM tztest; x ---------------------------- 2016-10-25 18:49:20.220891

Notice that the time has not changed even though the session time zone has changed. A timestamp date type specification defaults to timestamp without time zone:, to match the sql specification. Contrast the above output with using timestamp with time zone:

CREATE TABLE tztest2 (x TIMESTAMP WITH TIME ZONE); INSERT INTO tztest2 VALUES (CURRENT_TIMESTAMP); SELECT * FROM tztest2; x ------------------------------- 2016-10-25 18:57:04.096335-04 SHOW timezone; TimeZone ------------ US/Eastern SET timezone = 'Asia/Tokyo'; SELECT * FROM tztest2; x ------------------------------- 2016-10-26 07:57:04.096335+09

Notice that when the session time zone changed the date and hours were adjusted and the suffix time zone offset changed. This allows users from multiple time zones to insert into the same table and for viewers to see all rows in their local time zone.

Timestamp with time zone is implemented by adjusting all values to utc, meaning there is no storage of the original time zone name or its offset. If you need to record the original time zone offset, you can store the output of SELECT EXTRACT(timezone FROM CURRENT_TIMESTAMP) in a separate column.

Friday, June 16, 2017

PL/Java has been around since 2005, but it has regularly struggled to gain users. Unfortunately, a lot of these problems are specific to the Java language and hamper its adoption.

First, there are a limited number of people who know both Java and the Postgres backend code. Fortunately Chapman Flack has recently resumed PL/Java development.

Second, there is the deployment complexity of binding PL/Java to a Java Runtime Environment (jre) in an easily-deployed way. This 2014 email thread discusses the problems of jre packaging and installation.

Third, PL/Java functions must be compiled into jar files and installed on the server, similar to server-side C functions. While this isn't a huge hurdle, it does hamper PL/Java when competing against interpreted languages like PL/pgSQL and PL/Perl.

Fourth, every stored procedure language has to compete with PL/pgSQL, which has good alignment with the scripting needs of the sql language.

However, the problems go deeper. Because PL/Java isn't interpreted, it must be compiled, but compiling doesn't produce a binary that can be run on the cpu. Instead, a Java Virtual Machine (jvm) is required and it must be installed in a place that can easily receive function arguments and return results. An early attempt was pl/j, which uses a single jvm process — each backend connects to the jvm process via ipc and runs in its own thread. Unfortunately ipc overhead, context switching, and security concerns doomed this attempt.

PL/Java takes a different approach by placing a jvm in each database backend process and using the Java Native Interface (jni) to pass in function arguments and return results. While this reduces communication overhead, it causes backend processes to use much more memory because each backend has its own jvm, and each jvm must be started the first time a PL/Java function is called. (Connection pooling can reduce the startup overhead.) The bottom line is that stored procedures have short run-times, and the heavy nature of Java isn't a good match.

Ultimately, with all these headwinds against PL/Java, it is hard for it to get traction. A lot of people like the idea of PL/Java because they can reuse their Java skills and for portability with other databases that support Java stored procedures. The problem is that once they hit these headwinds, they start looking around. They realize that other languages are a better match for sql, that there isn't that much code reuse from the client-side, and they switch to another server-side language. I am sure there are some things that only Java can do well, but those use cases are not common in server-side functions. (Contrast that with pl/r, which definitely does things no other server-side language can do.)

In summary, the slow adoption of PL/Java isn't an accident, but the result of multiple challenges that hamper its adoption. If PL/Java is get more popular, these headwinds must be addressed.

Wednesday, June 14, 2017

The use of server-side logic, particularly stored procedures, has been a highly contentious topic among database professionals for decades. The question has always been what amount of logic should be encoded in the database vs. in client applications or application servers.

Beyond using the database as a simple data store, there are three levels in which logic can be added to Postgres:

Let's look at each of these in turn:

Database constraints are often seen as optional by database application developers. They don't really do anything except prevent invalid data from being entered into the database. The overhead of performing these checks is often seen as negative, particularly foreign key checks. ( Default clauses that call stored procedures are similar to triggers; the serial data type uses this.) Triggers are stored procedures assigned to tables that are executed automatically during insert, update, delete, or truncate commands. They can be triggered before the command runs, usually for complex constraint checks or to modify incoming data, e.g., capitalization. When triggered after the command, they perform post-command operations, such as adding entries to a log table. Manually-called stored procedures are functions called usually in where clauses or in the target list of select queries. Stored procedures, including ones used as triggers, can be written in many languages.

Now, on to the question of when to use these features. Some Postgres users use none of these features for reasons of performance and portability. Others use all of them, and require applications to call stored procedures to perform tasks, rather than issuing sql statements. This allows database administrators to control every aspect of database access. For most users, the best solution is something in the middle, and figuring out when to use what can be tricky.

For database constraints, it is often possible to perform constraint checks in the application, rather than in the database. While this is possible for most check constraints, it is hard to do for unique constraints because multi-user applications rarely share state except inside the database. If database constraints are not used, applications and manually-issued sql queries must perform the checks, and any mistakes must be detected later and corrected. If multiple applications access the database, they must perform the checks in the same way — this is particularly difficult if they are written in different languages. Application upgrades also require constraint synchronization in multi-application environments.

Triggers can be avoided by having applications perform the checks and queries that triggers would have performed. Performing data checks application-side have the same downsides as avoiding database constraints. Additional application queries necessary when triggers are avoided can lead to slower performance due to network round-trip delays.

Avoiding manually-called stored procedures requires all logic to be in the application. This can lead to serious slowdowns because if a function cannot be used in a where clause, the entire data set must be transferred to the client application where filtering can be done.

Ultimately, the decision of when to use server-side logic revolves around efficiency — efficiency of hardware utilization, efficiency of development, and efficiency of data management. Your use of server-side logic will be dependent on which area of efficiency is most important to you. This email thread outlines many of the tradeoffs seen by Postgres users.

Update: This email explains some of the operational reasons to avoid server-side logic. 2018-09-05

Monday, June 12, 2017

Oh, how I love the title of this 2014 Slashdot request, "Which NoSQL Database For New Project?" The user already knows the type of clients (iPhones and Android phones) and the middleware (Php/Symfony or Ruby/Rails) and then comes the zinger, "I would like to start with a NoSQL solution for scaling …." In addition to the question of whether Ruby on Rails is a scalable solution, the replies are illuminating, and even funny, e.g., "I'm working on a new independent project. It will soon become the new Facebook, and I'll be billionaire next quarter. The only problem is that I don't know which luxury yacht to buy with all this money."

OK, now on to the serious replies, which are many, seem to come from seasoned Sql and NoSql veterans and all seem to fall under the heading of premature optimization and scaling:

* NoSQL solutions can be ridiculously fast and scale beautifully over billions of rows. Under a billion rows, though, and they're just different from normal databases in various arguably-broken ways. By the time you need a NoSQL database, you'll be successful enough to have a well-organized team to manage the transition to a different backend. For a new project, use a rdbms, and enjoy the ample documentation and resources available.

* However, like a lot of other posters, I'm very sceptical that NoSQL is the place to start. SQL databases can do a LOT for you, are very robust and can scale very considerably. As your requirements grow you might find yourself wanting things like indexes, transactions, referential integrity, the ability to manually inspect and edit data using SQL and the ability to store and access more complex structures. You're likely to give yourself a lot of pain if you go straight for NoSQL, and even if you DO need to scale later combining existing SQL and new NoSQL data stores can be a useful way to go.

* So many developers start with the phrase "I need NoSQL so I can scale" and almost all of them are wrong. The chances are your project will never ever ever scale to the kind of size where the NoSQL design decision will win. Its far more likely that NoSQL design choice will cause far more problems (performance etc), than the theoretical scaling issues. … NoSQL does not guarantee scaling, in many cases it scales worse than an SQL based solution. Workout what your scaling problems will be for your proposed application and workout when they will become a problem and will you ever reach that scale. Being on a bandwagon can be fun, but you would be in a better place if you really think through any potential scaling issues. NoSQL might be the right choice but in many places I've seen it in use it was the wrong choice, and it was chosen base on one developers faith that NoSQL scales better rather than think through the scaling issues.

* Given 3 trillion users your options are pretty much limited to horizontal scaling, no SQL etc. but most people never get that far with their applications and in that case, storing the data in a noSQL database and then getting actionable information out of it (which is the hardest part IMO) is a lot of effort spent for something much cheaper and easier done with an rdbms.

* "Why not" is because the cost/benefit analysis is not in NoSQL's favor. NoSQL's downsides are a steeper learning curve (to do it right), fewer support tools, and a more specialized skill set. Its primary benefits don't apply to you. You don't need ridiculously fast writes, you don't need schema flexibility, and you don't need to run complex queries on previously-unknown keys.

* It's a mistake to think that "NoSQL" is a silver bullet for scalability. You can scale just fine using MySQL (FlockDB) or Postresgl if you know what you're doing. On the other, if you don't know what you're doing, NoSQL may create problems where you didn't have them.

* Databases don't scale for people who don't understand SQL, don't understand data normalization, indexing and want to use them as flat files. Unfortunately, a way too common anti-pattern :(

*So default answer to "Which NoSQL database should I use?" is always "Don't use NoSQL."

There were many positive comments about Postgres, both from a relational database perspective and touting its scalability and NoSQL-like features. One poster even wrote about their accounting system using Mongo that they are porting to Postgres.

Friday, June 9, 2017

In a February blog post I talked about the mismatch between what people expect from Postgres in terms of hints, and what exists. In this blog post I would like to cover the more general case of when people should expect feature parity with their previous database, and when such expectations are unreasonable.

First, imagine if every database had the features of every other database — that would be great for compatibility but terrible for usability. You would have so many ways of doing something, with slightly different behaviors, that development and administration would be much more difficult. Of course, Perl has its "There's more than one way to do it," but at least that allows tradeoffs between clarity and conciseness, and allows different programming styles. Having all features in every database would have few of those benefits. Also consider that some features are mutually exclusive, so this would be impossible. Therefore, we have to make tradeoffs in the features each database system supports.

Let's think of Postgres compatibility at three levels: sql, tuning, and monitoring. At the sql level, you should expect parity between Postgres and your previous database. The syntax might be different, but all the capabilities should be similar, e.g., if your application used save-points with the previous database, Postgres should support that, and in almost all cases, it does.

For tuning, parity is more murky. The internals of database systems differ greatly, so the tuning requirements will differ — and let's be frank — the less tuning you have to do to attain good performance, the better, so having a tuning item missing in Postgres might be good thing. Postgres might have good performance without requiring that tuning knob. Postgres might require tuning that wasn't required in your previous database system — that is a negative. So, if Postgres doesn't have a tuning knob you had in your previous database, that might be good (Postgres auto-tunes it), or it might be bad (you can't adjust Postgres to achieve good performance for your workload). (I previously covered the tradeoffs of adding performance settings.)

Monitoring has a similar murky outlook. Sometimes monitoring is required to meet organizational needs, e.g., what are the currently running queries. For most organizational monitoring needs, Postgres has the required features, whether it is log_line_prefix, pg_stat_statements, or external tools like pgBadger or PGAudit. Often, Postgres tooling isn't as polished or as easy to use as tools in more-established database systems, but the tools exist, and are often more flexible.

Just like tuning, sometimes monitoring that was required in your previous database system isn't required in Postgres. For example, Postgres's streaming replication is rock-solid, so there is no need to monitor for streaming replication corruption. Similarly, many organizations built elaborate monitoring of their previous database systems to check for optimizer plan and statistics changes. These are often to avoid optimizer bugs in their previous systems that caused problems. However, you can't assume this kind of monitoring is necessary for Postgres. Postgres isn't bug-free, but it might require different monitoring than what was required in the previous system.

In summary, Postgres is different from your previous database system. Hopefully it supports all the necessary features your applications require, but it will not exactly match your previous tuning and monitoring requirements. It is necessary to approach Postgres with an open mind to make the best use of your new database system.

Wednesday, June 7, 2017

Historically, most relational database systems supported raw devices, i.e., the ability to write data directly to the storage subsystem to avoid file system overhead. Raw device support was added to databases 20–30 years ago when cpus were much slower, and file systems were much less efficient.

Modern recommendations discourage the use of raw devices unless every last bit of performance is required. Raw devices improve performance perhaps a few percentage points, but are difficult to administer and resize. The other problem is that raw devices have to be configured when the database cluster is initialized, meaning you can't easily remove raw devices or add them later. This makes it difficult to know if raw devices would even help your workload.

Postgres has never supported raw devices, and probably never will. Modern file systems are so good, and the flexibility of file system storage so powerful, that raw devices just don't make sense for modern Postgres.

Monday, June 5, 2017

Postgres does all heap and index page operations in a shared memory area called shared buffers. Data is read into shared buffers by requesting 8k blocks from the kernel, which will either satisfy reads from the kernel cache or retrieve them from storage devices. Writes from shared buffers are sent to the kernel, which are eventually written to permanent storage. The write-ahead log (wal) allows writes to be performed asynchronously; this is illustrated in this presentation.

New users are often surprised that Postgres uses the kernel cache for reads and writes, but there are advantages. While the shared buffer size is fixed at server start, the kernel cache is resized based on the amount of unused memory in the system. This blog post explains how to determine the size of the Linux kernel cache.

Many database systems use direct I/O to read and write data, which bypasses the kernel cache. This has two advantages:

Avoids the double-copy of data from storage to the kernel cache and then from the kernel cache to shared buffers

Avoids double buffering (storing) of data in the kernel cache and shared buffers

However direct I/O has some disadvantages:

Prevents the kernel from reordering reads and writes to optimize performance

Does not allow free memory to be used as kernel cache

In summary, direct I/O would improve performance in cases where shared buffers is sized properly, but it would dramatically decrease performance for workloads where shared buffers is missized. This email report confirms this analysis. Postgres does use direct I/O for wal writes (if supported by the operating system) because wal must be flushed immediately to storage and is read only during crash recovery, so kernel buffering is useless. (It is also read by the walsender and archive_command.)

Someday Postgres might support an option to use direct I/O for data files but the downsides make it unlikely it would be enabled by default.

Friday, June 2, 2017

Postgres has a flexible tablespace feature that allows you to place data in any filesystem accessible from the database server, whether it is directly attached storage (das), on a nas or san, a directory in /tmp, or on a ramdisk in memory.

You might be tempted to put data in any of these places, with the idea that if one of the storage areas goes away, either through hardware failure or server reboot, Postgres will continue to function. Unfortunately, that is not the case. If a tablespace disappears or is erased, Postgres will have problems because system tables will now point to nonexistent files. Regular maintenance operations, like autovacuum, will throw errors. The wal might also still contain references to the missing files, preventing crash recovery from completing.

Perhaps someday Postgres will have the concept of transient tablespaces, but right now it doesn't so only create tablespaces on durable storage. Even temp_tablespaces (used to store temporary objects) cannot be placed on transient storage.

Friday, May 19, 2017

Effective_io_concurrency controls how many concurrent requests can be made by bitmap heap scans. The default is 1, which means no concurrency. For magnetic disks, this can be increased, perhaps to 8, but not much higher because of the physical limitations of magnetic disks. However, for ssds, this can be increased dramatically, perhaps in the hundreds.

This is documented in Postgres 9.6, as well as the ability to set effective_io_concurrency at the tablespace level. I have updated my Database Hardware Selection Guidelines presentation to mention this too.

Wednesday, May 17, 2017

I have reorganized my twenty active presentations to be easier to find. I was getting confused, so I assume others were as well. The new categories seem clearer and more closely match the categories I use for Postgres blog entries. I have reordered items within categories. I have also indicated more presentations that can be given in pairs.

Also, this web page now defaults to using https/tls.

Monday, May 15, 2017

Disk array controllers gained popularity in a world of magnetic disks by providing:

Hardware raid

Large read/write cache

Write reordering

Instant durable storage of writes using a bbu

With the popularity of ssds, particularly those with bbus (which are the only kind that should be used with Postgres), the usefulness of raid controllers is now unclear:

Modern cpu s are very fast and can easily do raid in software

s are very fast and can easily do in software Modern servers have tons of ram

Write reordering is unnecessary for ssd s, which have good random write performance

s, which have good random write performance Bbu s on ssd s make raid bbu s redundant

And raid controllers introduce problems of their own:

There is no question that there was a time that raid controllers were necessary for good database performance, but that period might be over for direct-attached storage (das).

Friday, May 12, 2017

Having given the first presentation of my new window function talk, Postgres Window Magic, I have now put the slides online.

Wednesday, May 10, 2017

I did my first presentation covering Postgres 10 features, so my slides are now online.

Wednesday, April 26, 2017

I have completed the draft version of the Postgres 10 release notes. Consisting of 180 items, I think you will be impressed with the substance of the improvements. The release notes will be continually updated until the final release, which is expected in September or October of this year. (I am speaking tomorrow about Postgres 10 in Boston.)

Fyi, future major versions of Postgres will consist of a single number, e.g., Postgres 11 will be released in 2018. This is mentioned in the release notes as well.

Monday, April 24, 2017

Modern systems offer several storage options and databases are very sensitive to the I/O characteristics of these options. The simplest is direct-attached storage (das), e.g., sata, sas. It is the simplest because it uses a dedicated connection between the server and storage.

Another storage option is a storage area network (san), e.g., fibre channel, iscsi. Like das, it offers a sequential set of blocks upon which a local file system can be built.

A third options is network-attached storage (nas), e.g., nfs. While nas is networked storage like san, it offers a remote file system to the server instead of remote storage blocks. (The fact that the san/nas acronym letters are just reversed only adds to the confusion. I remember it as (block) "storage" being the first letter of san.)

So, which is best for you? Well, das is the simplest topology because all storage is connected to a single server, and it is also often the fastest. San's add flexibility by replacing direct connection with a network, e.g., fcp, tcp/ip. This allows for a larger number of devices to be attached than das, and if a server fails another server can access the device's block storage and restart.

Nas is quite different because it exports a file system to the server. While nas-exported file systems can be mounted by multiple servers simultaneously, that doesn't help Postgres because only a single server can safely access the Postgres data directory. Nas usually includes sophisticated management tooling, backup capabilities, and caching layers. Because multiple servers usually access a nas, contention can be a problem, though ease of administration can be a big benefit for large organizations.

Friday, April 21, 2017

Connection poolers have two advantages:

They reduce the number of active connections, allowing the server to operate more efficiently, without the overhead of managing many idle connections.

They speed startup by allowing previously started sessions to be used as soon as a connection request arrives.

Postgres has two popular external connection poolers, PgBouncer and pgpool. External poolers have several advantages compared to poolers built into the database:

They can be placed closer to applications, perhaps on application servers.

They allow the pooling load to be handled by a dedicated server.

If the database server goes down, poolers can be easily re-pointed to new database servers.

But external connection poolers have disadvantages over internal ones:

Authentication not integrated with the database server

Additional network delay going through a pooler

While there are some significant advantages of external pools, there are a few advantages of internal poolers, so someday Postgres might decide to implement a simple connection pooler internally.

Wednesday, April 19, 2017

With streaming replication, Postgres allows sophisticated setups of primary and standby servers. There are two ways to promote a standby to be the new primary. A switchover is when the change happens in a planned way:

All clients are disconnected from the master to prevent writes A sufficient delay allows the final write-ahead log ( wal ) records to be transferred to all standbys (also performed by step 3) The primary is cleanly shut down The standby is promoted to be the primary

A failover happens when the steps above can't be performed, usually because the primary has failed in some catastrophic way. The major difficulty with failover is the possibility that some of the final database changes contained in the wal are not transferred to standbys, unless synchronous_standby_names was used. When a standby is promoted to primary after a failover, the final missing wal records can cause problems:

Some transactions on the old primary that were acknowledged to clients might be lost If the old master needs to be reconnected as a standby without reimaging, it might be necessary to use pg_rewind

Make sure you practice both methods of promoting a standby so, when you have to do the promotion in production, you are ready.

Monday, April 17, 2017

When using continuous archiving, you must restore a file system backup before replaying the wal. If the file system backup was taken long ago, wal replay might take a long time. One way to avoid this is to take file system backups frequently.

Another option is to perform an incremental file system backup that can be laid over the original file system backup, then replay wal over that. This reduces restore time because you only need to replay wal from the start of the incremental backup, not the start of the full backup. This also reduces the amount of wal that must be retained.

However, Postgres doesn't natively support incremental backup. The best you can do is to use a tool like pgBackRest or Barman that supports incremental backup at the file level. The only problem is that the database files are potentially one gigabyte in size, so the granularity of the incremental backup isn't great. Ideally solutions will be developed that do page-level (8k) incremental backups, which would be much smaller. The trick is finding an efficient way to record which 8k pages have been changed since the last file system backup.

Friday, April 14, 2017

You probably have heard the term "checkpoint" before, or seen it mentioned in the postgresql.conf file. A checkpoints is a usually-invisible cleanup feature present in most database systems, but it is useful to know what it does.

This diagram illustrates checkpoints. At the top are three Postgres database sessions. Each session reads and writes to the shared buffer cache. Every modification to shared buffers also causes a change record to be written to the write-ahead log (wal, blog entry). Over time the wal would grow unbounded in size if it were not trimmed occasionally — that is what checkpoints do.

A checkpoint writes previously-dirtied shared buffers to durable storage over a period of several minutes, at which point the wal representing those writes is no longer needed for crash recovery. (Hopefully continuous archiving and streaming replication have also processed those wal files.) Therefore, the old wal can then be removed or recycled.

This diagram illustrates the process. In the diagram, 1 marks three dirty buffers in the shared buffer cache at the start of the checkpoint. During the checkpoint, additional buffers are dirtied, 2, and the wal pointer is advanced. At the end of the checkpoint all dirty 1 buffers have been written to durable storage and the old wal file are moved to the end, to be reused.

This all happens automatically, though it can be tuned and monitored. Fortunately, Postgres doesn't have any pathological behavior related to checkpoints, so most administrators never have to think about it.

Wednesday, April 12, 2017

The write-ahead log (wal) file format changes with every major release. Also, initdb, which is required to install the new major version, starts wal file numbers at 000000010000000000000001.

For these reasons, if you are archiving wal, it is wise to use the major version number in the name of the wal archive directory, e.g., /archive/pgsql/9.6. This avoids the problem of wal from an old Postgres major version conflicting with wal files from a new major version.

Monday, April 10, 2017

The write-ahead log (wal) does many things:

It is great that a single facility is used for so much. Fortunately, it is possible to look inside of the wal. This is particularly useful for setting recovery_target_xid in recovery.conf.

The binary portion of the wal can be viewed using pg_xlogdump. This is the output of transaction 558, which was an insert into a table with one index:

rmgr: Heap len (rec/tot): 3/ 3760, tx: 558, lsn: 0/01510D60, prev 0/01510D28, desc: INSERT off 102, — rmgr: Btree len (rec/tot): 2/ 2133, tx: 558, lsn: 0/01511C10, prev 0/01510D60, desc: INSERT_LEAF off 1, — rmgr: Transaction len (rec/tot): 20/ 46, tx: 558, lsn: 0/01512480, prev 0/01511C10, desc: COMMIT 2016-12-26 —

You can see the insert into the heap, the index insert, and the transaction commit record. When viewing pg_xlogdump output, remember that while transactions are assigned in start order, higher-numbered shorter transactions can commit before lower-numbered longer transactions (see this slide). For example, transaction 32 commits before transaction 30 because transaction 30 runs much longer than 32. (This can make choosing recovery_target_xid particularly tricky.)

To generate more detailed wal activity, set wal_level to logical and use test_decoding to view logical wal information:

SELECT * FROM pg_logical_slot_get_changes( 'test_slot', NULL, NULL, 'include-xids', '0'); location | xid | data -----------+-----+------------------------------------------- 0/1510D60 | 558 | BEGIN 0/1510D60 | 558 | table public.test: INSERT: x[integer]:100 0/15124B0 | 558 | COMMIT

This information will be used for logical replication in Postgres 10.

Friday, April 7, 2017

You might be aware that the sql standard reserves certain identifiers that cannot be used for naming user objects. Postgres follows that standard, with slight modifications. For example, you cannot create a table called all:

CREATE TABLE all (x INTEGER); ERROR: syntax error at or near "all" LINE 1: CREATE TABLE all (x INTEGER); ^

It is actually very easy to find what identifiers are reserved because they are listed in the documentation. More interestingly, they are also accessible via the sql function pg_get_keywords():

SELECT * FROM pg_get_keywords() ORDER BY 1; word | catcode | catdesc -------------------+---------+---------------------------------------------- abort | U | unreserved absolute | U | unreserved access | U | unreserved action | U | unreserved …

You can also do statistical analysis on it:

SELECT catcode, COUNT(*) FROM pg_get_keywords() GROUP BY catcode ORDER BY 2 DESC; catcode | count ---------+------- U | 276 R | 77 C | 48 T | 23

You can even join pg_get_keywords() to system tables to identify the use of system keywords by user objects, e.g., a check for column names:

SELECT nspname, relname, attname FROM pg_namespace JOIN pg_class ON (pg_namespace.oid = relnamespace) JOIN pg_attribute ON (attrelid = pg_class.oid) JOIN pg_get_keywords() ON (word = attname) WHERE nspname NOT IN ('pg_catalog', 'information_schema')

Read the documentation to understand the types of system keywords, i.e., using them is not always a problem.

Wednesday, April 5, 2017

There are four ways to cast a value to another data type:

SELECT int4 '123' + 1; ?column? ---------- 124 SELECT CAST('123' AS int4); int4 ------ 123 SELECT '123'::int4 + 1; ?column? ---------- 124 SELECT int4('123') + 1; ?column? ---------- 124

The first two are sql standard, the third one is a Postgres-ism that is often convenient, and the final one relies on the existence of named functions to do the conversion. Some of the more complex data type specifications have shortcuts, e.g., timestamp with time zone can use ::timestamptz.

Monday, April 3, 2017

While the sql standard allows multiple nulls in a unique column, and that is how Postgres behaves, some database systems (e.g., ms sql) allow only a single null in such cases. Users migrating from other database systems sometimes want to emulate this behavior in Postgres. Fortunately, this can be done. First, let me show the default Postgres behavior:

CREATE TABLE nulltest (x INTEGER UNIQUE); INSERT INTO nulltest VALUES (NULL); INSERT INTO nulltest VALUES (NULL);

A column can be constrained to a single-null value by creating a partial expression index that indexes only null values (the partial part), and uses is null to store true in the unique index (the expression part):

DELETE FROM nulltest; CREATE UNIQUE INDEX i_nulltest ON nulltest ((x IS NULL)) WHERE x IS NULL; INSERT INTO nulltest VALUES (NULL); INSERT INTO nulltest VALUES (NULL); ERROR: duplicate key value violates unique constraint "i_nulltest" DETAIL: Key ((x IS NULL))=(t) already exists.

This method can also be used to constrain a column to allow only a single null for each non-null composite indexed value:

CREATE TABLE nulltest2 (x INTEGER, y INTEGER); CREATE UNIQUE INDEX i_nulltest2 ON nulltest2 (x, (y IS NULL)) WHERE y IS NULL; INSERT INTO nulltest2 VALUES (1, NULL); INSERT INTO nulltest2 VALUES (2, NULL); INSERT INTO nulltest2 VALUES (2, NULL); ERROR: duplicate key value violates unique constraint "i_nulltest2" DETAIL: Key (x, (y IS NULL))=(2, t) already exists. INSERT INTO nulltest2 VALUES (2, 3); INSERT INTO nulltest2 VALUES (2, 3);

The i_nulltest2 index allows only one y null value for each x value. This can actually be useful in certain data models. This illustrates how expression and partial index features can be combined for some interesting effects.

Monday, March 27, 2017

Many database administrators use databases as simple data stores. However, relational systems can do much more, with advanced querying, analysis, and transaction control capabilities. Another area that is often overlooked is constraints. Constraints allow new and updated data to be checked against defined constraints and prevent changes if the constraints would be violated.

Constraints are odd in that they don't do anything if the data is consistent — it is more like an insurance policy against invalid data being entered into the database. If constraints are missing, there often are no initial problems, but over time erroneous or unexpected data gets in, causing problems with applications and reporting.

Do yourself a favor the next time you create a table — take the insurance and create useful check, not null, default, unique, primary key, and foreign key constraints. If your tables are already created, you can use alter table to add constraints to existing tables.

Friday, March 24, 2017

Ever wonder why using select column aliases in where clauses fails?

SELECT random() AS confusion FROM generate_series(1,10) WHERE confusion > 0.5; ERROR: column "confusion" does not exist LINE 3: WHERE confusion > 0.5;

It is because the order in which select clauses are evaluated is specified by the sql standard, and it isn't top to bottom. Tom Lane's excellent email post goes into the details. The thread also mentions the unfortunate effect that users of union often want the behavior of union all.

Wednesday, March 22, 2017

Understanding how characters sets, encodings, and collations work together can be confusing. I would like to explain them in this blog post:

Characters Sets

Postgres databases can be initialized to support a variety of character sets. A character set is a full set of the characters or glyphs that can be represented. Popular characters sets are ascii (127 characters), Latin1 (ISO8859-1, 255 characters), and Unicode (128k+ characters).

Encodings

Encoding is the way a character set is stored. For single-byte character sets like ascii and Latin1, there is only a single way to encode the character set, i.e., as a single byte. For more complex multi-byte character sets there can be several ways to encode the character set. For example, Unicode can be encoded as UTF-8 (8-bit granularity), UTF-16 (16-bit), or UTF-32 (32-bit). (Postgres only supports UTF-8 for Unicode, server-side.) Asian languages also often support multiple encodings for a single character set, e.g., Big5, GB2312, Shift-JIS, EUC-JP.

Collations

Collations specify the order of characters in a character set. Again, for single-byte character sets, there is usually only one possible collation, which uses encoded byte values to provide the ordering, though this can lead to odd orderings, e.g., Z (uppercase) ordered before a (lowercase), or z ordered before á. For a complex character set like Unicode, the user can often select the desired collation.

It is possible to illustrate different collations of the same character set using Unicode:

SHOW lc_collate; lc_collate ------------- en_US.UTF-8 SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x; x ---- a _b _C -- the same, with the collation explicitly stated SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "en_US.utf8"; x ---- a _b _C -- with a collation based on byte values, matching ASCII ordering SELECT * FROM (VALUES ('a'), ('_b'), ('_C')) AS f(x) ORDER BY x COLLATE "C"; x ---- _C _b a

You can see in the first two queries that collation en_US ignores case in letter comparisons, and ignores underscores. The last query, using C collation, reverses the order because the byte values for uppercase letters (C) are less than lowercase letters (b), and the underscore character is lower than lowercase letters (a).

This can be seen at the operating system level too:

$ echo $LANG en_US.UTF-8 $ (echo 'a'; echo '_b'; echo '_C') | sort a _b _C $ (echo 'a'; echo '_b'; echo '_C') | LANG="en_US.UTF-8" sort a _b _C $ (echo 'a'; echo '_b'; echo '_C') | LANG="C" sort _C _b a

The handling of whitespace can also be controlled by sql-level collations:

SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "en_US.utf8"; x ---- a b SELECT * FROM (VALUES (' '), ('a'), (' b')) AS f(x) ORDER BY x COLLATE "C"; x ---- b a

In the first query, while a single space is ordered before a, it does not cause space-b to order before a; it does in the C collation. Things get more complicated with letters with accents, tildes, and pictographic languages.

Postgres relies on the operating system locale to support characters sets, encodings, and collations. On Linux, you can see the list of supported locales via locale -a. While initdb sets the default locale for the cluster based on evironment variables, the locale can be overridden in a variety of ways, including by CREATE TABLE.

Monday, March 20, 2017

Postgres supports both traditional join syntax, which uses the where clause to specify joined columns, and ansi join syntax, that uses the word join in the from clause. While both syntaxes can be used for inner joins, only the ansi join syntax supports outer joins in Postgres.

Because column restrictions like col = 4 are not related to joins, you would think that restrictions have the same effect whether they appear in the where clause or as part of a join clause, e.g., a outer join b on a.x = b.x and col = 4. However, this is not always true. Restrictions in the join clause are processed during joins, while where clause restrictions are processed after joins.

This is only significant in outer joins (and cross joins) because columns from unjoined rows are manufactured by outer joins. Here is an example:

CREATE TABLE test1 (x INTEGER); CREATE TABLE test2 (x INTEGER); INSERT INTO test1 VALUES (1), (2), (3); INSERT INTO test2 VALUES (1), (2); SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x; x | x ---+-------- 1 | 1 2 | 2 3 | (null) SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2; x | x ---+-------- 1 | 1 2 | 2 3 | (null) SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2; x | x ---+--- 1 | 1 2 | 2

As you can see, the first select performs the outer join with no column restrictions. The second select returns the same result because the column restriction matches all rows in test2 (before the outer join happens). The final select operates on the result of the join and since null <= 2 returns null, which behaves like false, the third output row is suppressed.

What is even more interesting is seeing how a column restriction can cause a query that would normally be fully joined to return unjoined column values:

INSERT INTO test2 VALUES (3); SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x; x | x ---+--- 1 | 1 2 | 2 3 | 3 SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x AND test2.x <= 2; x | x ---+-------- 1 | 1 2 | 2 3 | (null) SELECT * FROM test1 LEFT JOIN test2 ON test1.x = test2.x WHERE test2.x <= 2; x | x ---+--- 1 | 1 2 | 2

In the second select, the test2.x value of 3 is excluded from the join, resulting in an outer-join value of null for the column.

In summary, be careful of the placement of restriction clauses on inner-side tables in outer-join queries and clearly determine whether you want restrictions to happen at join time or post-join.

Friday, March 17, 2017

As open source gains popularity in the enterprise, there is increased study of open source communities and how they function. Those studying such things often ask about Postgres because of its unusually healthy community and recent successes.

While I was in India in February, I was interviewed by an open source magazine; an excerpt from that interview is now online. It covers open source leadership, encouraging new developers, and healthy software ecosystems.

Wednesday, March 15, 2017

Postgres uses native cpu alignment to store values in a row. This allows blocks to be copied unchanged from disk into shared buffers and accessed as local variables, as outlined in this presentation.

This can be illustrated by pg_column_size(). First, an empty row size:

SELECT pg_column_size(row()); pg_column_size ---------------- 24

Predictably, 2-byte, 4-byte, and 8-byte integers increase the length as expected:

SELECT pg_column_size(row(1::int2)); pg_column_size ---------------- 26 SELECT pg_column_size(row(1::int4)); pg_column_size ---------------- 28 SELECT pg_column_size(row(1::int8)); pg_column_size ---------------- 32

You can also see alignment effects — the first query has no alignment padding, while the second query has a two-byte padding because the row length is unchanged:

SELECT pg_column_size(row(1::int2, 1::int2, 1::int4)); pg_column_size ---------------- 32 SELECT pg_column_size(row(1::int2, 1::int4)); pg_column_size ---------------- 32

You can see the alignment requirements for each data type by querying the system catalogs:

SELECT typalign, typname FROM pg_type ORDER BY 1, 2; typalign | typname ----------+--------- c | bool c | char c | cstring c | name c | unknown c | uuid d | _box d | _circle d | _float8 …

The Postgres source code file src/include/catalog/pg_type.h documents the meaning of the letters:

'c' = CHAR alignment, ie no alignment needed. 's' = SHORT alignment (2 bytes on most machines). 'i' = INT alignment (4 bytes on most machines). 'd' = DOUBLE alignment (8 bytes on many machines, but by no means all).

It is possible to define table columns in an order that minimizes padding. Someday Postgres might do this automatically.

The 24-byte row header includes an 8-bit mask to record null values. You can see below that the 8-bit mask is sufficient for eight nulls, but the ninth null requires the null bit mask to be expanded, with additional alignment:

SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)); pg_column_size ---------------- 24 SELECT pg_column_size(row(NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)); pg_column_size ---------------- 32

These things all happen automatically, but sometimes it is in interesting to see it working.

Monday, March 13, 2017

Application_name might be one of those Postgres settings that you have seen in various places but never understood its purpose. It is true that setting application_name doesn't change the behavior of Postgres (with one small exception), but it is very useful for monitoring. Its value appears in pg_stat_activity and can be prepended to every log line with log_line_prefix.

Applications can set application_name when they connect as part of the connection string. Psql, pg_dump, and other tools set application_name. (They actually set fallback_application_name so application_name can be overridden by users.)

A more interesting use of application_name is to change it while the application is running. pgAdmin updates the application_name when a user changes screens so administrators can know exactly what screen users are on. While you can't change application_name while a query is running, you can change it between queries, which allows you to monitor the progress of long-running batch jobs, e.g.:

SELECT application_name FROM pg_stat_activity WHERE pid = pg_backend_pid(); application_name ------------------ psql SET application_name = 'demo'; SELECT application_name FROM pg_stat_activity WHERE pid = pg_backend_pid(); application_name ------------------ demo

Some people might complain about the overhead of a separate query to update application_name. There are two solutions. One solution is to send two queries in a single string, e.g., select application_name = 'demo2'; select 100. Some libraries like libpq support this, but psql sends these as two separate queries — you have to enable log_statement to see how queries are sent to the server.

Another approach is to bundle the setting of application_name inside the query:

SELECT 100 FROM set_config('application_name', 'demo3', false); SELECT application_name FROM pg_stat_activity WHERE pid = pg_backend_pid(); application_name ------------------ demo3

Unfortunately, there is no way to guarantee that set_config() will be run first, e.g., in the first query, set_config() is run first, and in the second query, run second, and the optimizer is allowed to run from-clause expressions in any order:

SELECT 100 FROM set_config('application_name', 'demo4', false), pg_sleep(5); SELECT 100 FROM pg_sleep(5), set_config('application_name', 'demo5', false);

As you can see, application_name gives you a way to monitor what Postgres is doing, not only at connection time but during the life of the connection.

Friday, March 10, 2017

Postgres has many logging options, e.g., log_statement, log_min_duration_statement, and log_line_prefix. These values can be set at various levels, which I have already covered. One level that has particular flexibility is postgresql.conf. Settings in this file affect all running sessions, unless the values are overridden at lower levels.

When debugging, it is often hard to know which log settings to enable. One approach is to enable all settings beforehand, but that can generate lots of log traffic and be hard to interpret. Ideally you can enable just the settings you need at the time you need them — this is where postgresql.conf comes in. With postgresql.conf or alter system, you can change any logging settings you want and then signal a reload by either sending a sighup signal to the server, running "pg_ctl reload", or executing "SELECT 