The log

Why not force writing to disk the data pages themselves, why duplicate the work instead? It appears to be more efficient.



First, WAL is a sequential stream of append-only data. Even HDD disks do the job of sequential writing fine. However, the data themselves are written in a random fashion since pages are spread across the disk more or less in disorder.



Second, a WAL record can be way smaller than the page.



Third, when writing to disk we do not need to take care of maintaining the consistency of data on disk at every point in time (this requirement makes life really difficult).



And fourth, as we will see later, WAL (once it is available) can be used not only for recovery, but also for backup and replication.



Changes to pages in the buffer cache (mostly table and index pages) — since it takes some time for the page changed to get to disk.

Transactions' commits and aborts — since a change of the status is done in XACT buffers and it also takes some time for the change to get to disk.

File operations (creation and deletion of files and directories, such as creation of files during creation of a table) — since these operations must be synchronous with the changes to data.

Operations with unlogged tables — their name is self-explanatory.

Operations with temporary tables — logging makes no sense since the lifetime of such tables does not exceed the lifetime of the session that created them.

Logical structure

The ID of the transaction that the record relates to.

The resource manager — the system component responsible for the record.

The checksum (CRC) — permits to detect data corruption.

The length of the record and link to the preceding record.

pg_waldump -r list

Physical structure

$PGDATA/pg_wal

--wal-segsize

pg_current_wal_lsn

pg_current_wal_insert_lsn

=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();

pg_current_wal_lsn | pg_current_wal_insert_lsn --------------------+--------------------------- 0/331E4E64 | 0/331E4EA0 (1 row)

pg_lsn

=> SELECT file_name, upper(to_hex(file_offset)) file_offset FROM pg_walfile_name_offset('0/331E4E64');

file_name | file_offset --------------------------+------------- 000000010000000000000033 | 1E4E64 \ /\ / time 0/331E4E64 line

$PGDATA/pg_wal/

=> SELECT * FROM pg_ls_waldir() WHERE name = '000000010000000000000033';

name | size | modification --------------------------+----------+------------------------ 000000010000000000000033 | 16777216 | 2019-07-08 20:24:13+03 (1 row)

Write-ahead logging

=> CREATE TABLE wal(id integer); => INSERT INTO wal VALUES (1);

=> CREATE EXTENSION pageinspect;

=> BEGIN; => SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn --------------------------- 0/331F377C (1 row)

=> UPDATE wal set id = id + 1;

=> SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn --------------------------- 0/331F37C4 (1 row)

=> SELECT lsn FROM page_header(get_raw_page('wal',0));

lsn ------------ 0/331F37C4 (1 row)

pg_current_wal_insert_lsn

=> COMMIT;

=> SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn --------------------------- 0/331F37E8 (1 row)

=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();

pg_current_wal_lsn | pg_current_wal_insert_lsn --------------------+--------------------------- 0/331F37E8 | 0/331F37E8 (1 row)

pg_lsn

=> SELECT '0/331F37E8'::pg_lsn - '0/331F377C'::pg_lsn;

?column? ---------- 108 (1 row)

pg_waldump

postgres

postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/331F377C -e 0/331F37E8 000000010000000000000033

rmgr: Heap len (rec/tot): 69/ 69, tx: 101085, lsn: 0/331F377C, prev 0/331F3014, desc: HOT_UPDATE off 1 xmax 101085 ; new off 2 xmax 0, blkref #0: rel 1663/16386/33081 blk 0

rmgr: Transaction len (rec/tot): 34/ 34, tx: 101085, lsn: 0/331F37C4, prev 0/331F377C, desc: COMMIT 2019-07-08 20:24:13.945435 MSK

blkref

=> SELECT pg_relation_filepath('wal');

pg_relation_filepath ---------------------- base/16386/33081 (1 row)

Recovery

$PGDATA/global/pg_control

pg_controldata

postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | grep state

Database cluster state: in production

But there are exceptions. Certain records are created as FPI (full page image), which overrides page contents and can therefore be applied to the page regardless of its state. A change to the transaction status can be applied to any version of a XACT page, so there is no need to store LSN inside such pages.



postmaster