Tuple header

xmin

xmax

xmin

xmax

xmax

xmax

xmin

xmax

xmin

xmax

infomask — several bits that determine the properties of a given tuple. There are quite a few of them, and we will discuss each over time.

— several bits that determine the properties of a given tuple. There are quite a few of them, and we will discuss each over time. ctid — a reference to the next, more recent, version of the same row. ctid of the newest, up-to-date, row version references that very version. The number is in the (x,y) form, where x is the number of the page and y is the order number of the pointer in the array.

— a reference to the next, more recent, version of the same row. of the newest, up-to-date, row version references that very version. The number is in the form, where is the number of the page and is the order number of the pointer in the array. The NULLs bitmap, which marks those columns of a given version that contain a NULL. NULL is not a regular value of data types, and therefore, we have to store this characteristic separately.

Insert

=> CREATE TABLE t( id serial, s text ); => CREATE INDEX ON t(s);

=> BEGIN; => INSERT INTO t(s) VALUES ('FOO');

=> SELECT txid_current();

txid_current -------------- 3664 (1 row)

heap_page_items

=> SELECT * FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]------------------- lp | 1 lp_off | 8160 lp_flags | 1 lp_len | 32 t_xmin | 3664 t_xmax | 0 t_field3 | 0 t_ctid | (0,1) t_infomask2 | 2 t_infomask | 2050 t_hoff | 24 t_bits | t_oid | t_data | \x0100000009464f4f

=> SELECT '(0,'||lp||')' AS ctid, CASE lp_flags WHEN 0 THEN 'unused' WHEN 1 THEN 'normal' WHEN 2 THEN 'redirect to '||lp_off WHEN 3 THEN 'dead' END AS state, t_xmin as xmin, t_xmax as xmax, (t_infomask & 256) > 0 AS xmin_commited, (t_infomask & 512) > 0 AS xmin_aborted, (t_infomask & 1024) > 0 AS xmax_commited, (t_infomask & 2048) > 0 AS xmax_aborted, t_ctid FROM heap_page_items(get_raw_page('t',0)) \gx

-[ RECORD 1 ]-+------- ctid | (0,1) state | normal xmin | 3664 xmax | 0 xmin_commited | f xmin_aborted | f xmax_commited | f xmax_aborted | t t_ctid | (0,1)

Added a zero to the pointer number to make it look like a t_ctid : (page number, pointer number).

: (page number, pointer number). Interpreted the status of the lp_flags pointer. It is «normal» here, which means that the pointer actually references a row version. We will discuss other values later.

pointer. It is «normal» here, which means that the pointer actually references a row version. We will discuss other values later. Of all information bits, we selected only two pairs so far. xmin_committed and xmin_aborted bits show whether the transaction with the ID xmin is committed (rolled back). A pair of similar bits relates to the transaction with the ID xmax .

xmin

xmin_committed

xmin_aborted

ctid

xmax

xmax_aborted

=> CREATE FUNCTION heap_page(relname text, pageno integer) RETURNS TABLE(ctid tid, state text, xmin text, xmax text, t_ctid tid) AS $$ SELECT (pageno,lp)::text::tid AS ctid, CASE lp_flags WHEN 0 THEN 'unused' WHEN 1 THEN 'normal' WHEN 2 THEN 'redirect to '||lp_off WHEN 3 THEN 'dead' END AS state, t_xmin || CASE WHEN (t_infomask & 256) > 0 THEN ' (c)' WHEN (t_infomask & 512) > 0 THEN ' (a)' ELSE '' END AS xmin, t_xmax || CASE WHEN (t_infomask & 1024) > 0 THEN ' (c)' WHEN (t_infomask & 2048) > 0 THEN ' (a)' ELSE '' END AS xmax, t_ctid FROM heap_page_items(get_raw_page(relname,pageno)) ORDER BY lp; $$ LANGUAGE SQL;

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3664 | 0 (a) | (0,1) (1 row)

xmin

xmax

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id | s ------+------+----+----- 3664 | 0 | 1 | FOO (1 row)

Commit

Was the transaction xmin completed? If not, the created tuple must not be visible.

This is checked by looking through another structure, which is located in the shared memory of the instance and called ProcArray. This structure holds a list of all active processes, along with the ID of the current (active) transaction for each. If the transaction was completed, then was it committed or rolled back? If it was rolled back, the tuple must not be visible either.

This is just what XACT is needed for. But it is expensive to check XACT each time, although last pages of XACT are stored in buffers in the shared memory. Therefore, once figured out, the transaction status is written to xmin_committed and xmin_aborted bits of the tuple. If any of these bits is set, the transaction status is treated as known and the next transaction will not need to check XACT.

=> COMMIT;

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3664 | 0 (a) | (0,1) (1 row)

xmin

=> SELECT * FROM t;

id | s ----+----- 1 | FOO (1 row)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3664 (c) | 0 (a) | (0,1) (1 row)

Delete

xmax

xmax_aborted

xmax

xmax

=> BEGIN; => DELETE FROM t; => SELECT txid_current();

txid_current -------------- 3665 (1 row)

xmax

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+------+-------- (0,1) | normal | 3664 (c) | 3665 | (0,1) (1 row)

Abort

=> ROLLBACK; => SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+------+-------- (0,1) | normal | 3664 (c) | 3665 | (0,1) (1 row)

xmax_aborted

xmax

=> SELECT * FROM t;

id | s ----+----- 1 | FOO (1 row)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+----------+-------- (0,1) | normal | 3664 (c) | 3665 (a) | (0,1) (1 row)

Update

=> BEGIN; => UPDATE t SET s = 'BAR'; => SELECT txid_current();

txid_current -------------- 3666 (1 row)

=> SELECT * FROM t;

id | s ----+----- 1 | BAR (1 row)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3664 (c) | 3666 | (0,2) (0,2) | normal | 3666 | 0 (a) | (0,2) (2 rows)

xmax

xmax_aborted

xmax

=> COMMIT;

Indexes

ctid

xmin

xmax

=> SELECT itemoffset, ctid FROM bt_page_items('t_s_idx',1);

itemoffset | ctid ------------+------- 1 | (0,2) 2 | (0,1) (2 rows)

Virtual transactions

=> BEGIN; => SELECT txid_current_if_assigned();

txid_current_if_assigned -------------------------- (1 row)

=> UPDATE accounts SET amount = amount - 1.00; => SELECT txid_current_if_assigned();

txid_current_if_assigned -------------------------- 3667 (1 row)

=> COMMIT;

Subtransactions

Savepoints

=> TRUNCATE TABLE t; => BEGIN; => INSERT INTO t(s) VALUES ('FOO'); => SELECT txid_current();

txid_current -------------- 3669 (1 row)

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO (1 row)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (1 row)

=> SAVEPOINT sp; => INSERT INTO t(s) VALUES ('XYZ'); => SELECT txid_current();

txid_current -------------- 3669 (1 row)

txid_current

=> SELECT xmin, xmax, * FROM t;

xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3670 | 0 | 3 | XYZ (2 rows)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (0,2) | normal | 3670 | 0 (a) | (0,2) (2 rows)

=> ROLLBACK TO sp; => INSERT INTO t VALUES ('BAR'); => SELECT xmin, xmax, * FROM t;

xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3671 | 0 | 4 | BAR (2 rows)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 | 0 (a) | (0,1) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 | 0 (a) | (0,3) (3 rows)

=> COMMIT; => SELECT xmin, xmax, * FROM t;

xmin | xmax | id | s ------+------+----+----- 3669 | 0 | 2 | FOO 3671 | 0 | 4 | BAR (2 rows)

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 (c) | 0 (a) | (0,1) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 (c) | 0 (a) | (0,3) (3 rows)

=> BEGIN;

BEGIN

=> BEGIN;

WARNING: there is already a transaction in progress BEGIN

=> COMMIT;

COMMIT

=> COMMIT;

WARNING: there is no transaction in progress COMMIT

Errors and operation atomicity

=> BEGIN; => SELECT * FROM t;

id | s ----+----- 2 | FOO 4 | BAR (2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR: division by zero

=> SELECT * FROM t;

ERROR: current transaction is aborted, commands ignored until end of transaction block

=> COMMIT;

ROLLBACK

=> SELECT * FROM heap_page('t',0);

ctid | state | xmin | xmax | t_ctid -------+--------+----------+-------+-------- (0,1) | normal | 3669 (c) | 3672 | (0,4) (0,2) | normal | 3670 (a) | 0 (a) | (0,2) (0,3) | normal | 3671 (c) | 0 (a) | (0,3) (0,4) | normal | 3672 | 0 (a) | (0,4) (4 rows)

=> \set ON_ERROR_ROLLBACK on => BEGIN; => SELECT * FROM t;

id | s ----+----- 2 | FOO 4 | BAR (2 rows)

=> UPDATE t SET s = repeat('X', 1/(id-4));

ERROR: division by zero

=> SELECT * FROM t;

id | s ----+----- 2 | FOO 4 | BAR (2 rows)

=> COMMIT;

Well, we've already discussed isolation and made a digression regarding the low-level data structure . And we've finally reached the most fascinating thing, that is, row versions (tuples).As already mentioned, several versions of each row can be simultaneously available in the database. And we need to somehow distinguish one version from another one. To this end, each version is labeled with its effective «time» () and expiration «time» (). Quotation marks denote that a special incrementing counter is used rather than the time itself. And this counter is the transaction identifier.(As usual, in reality this is more complicated: the transaction ID cannot always increment due to a limited bit depth of the counter. But we will explore more details of this when our discussion reaches freezing.)When a row is created, the value ofis set equal to the ID of the transaction that performed the INSERT command, whileis not filled in.When a row is deleted, thevalue of the current version is labeled with the ID of the transaction that performed DELETE.An UPDATE command actually performs two subsequent operations: DELETE and INSERT. In the current version of the row,is set equal to the ID of the transaction that performed UPDATE. Then a new version of the same row is created, in which the value ofis the same asof the previous version.andfields are included in the header of a row version. In addition to these fields, the tuple header contains other ones, such as:As a result, the header appears pretty large: 23 bytes per each tuple at a minimum, but usually larger because of the NULLs bitmap. If a table is «narrow» (that is, it contains few columns), the overhead bytes can occupy more space that the useful information.Let's look in more detail at how the operations on rows are performed at a low level, and we start with an insert.To experiment, we will create a new table with two columns and an index on one of them:We start a transaction to insert a row.This is the ID of our current transaction:Let's look into the contents of the page. Thefunction from the «pageinspect» extension enables us to get information on the pointers and row versions:Note that the word «heap» in PostgreSQL denotes tables. This is one more weird usage of a term: a heap is a known data structure , which has nothing to do with a table. This word is used here in the sense that «all is heaped up», unlike in ordered indexes.This function shows the data «as is», in a format that is difficult to comprehend. To clarify the things, we leave only part of the information and interpret it:We did the following:What do we observe? When a row is inserted, in the table page a pointer appears that has number 1 and references the first and the only version of the row.Thefield in the tuple is filled with the ID of the current transaction. Because the transaction is still active, bothandbits are unset.Thefield of the row version references the same row. It means that no newer version is available.Thefield is filled with the conventional number 0 since the tuple is not deleted, that is, up-to-date. Transactions will ignore this number because of thebit set.Let's move one more step to improving the readability by appending information bits to transaction IDs. And let's create the function since we will need the query more than once:What is happening in the header of the row version it is much clearer in this form:We can get similar information, but far less detailed, from the table itself by usingandpseudo-columns:When a transaction is successful, its status must be remembered, that is, the transaction must be marked as committed. To this end, the XACT structure is used. (Before version 10 it was called CLOG (commit log), and you are still likely to come across this name.)XACT is not a table of the system catalog, but files in the PGDATA/pg_xact directory. Two bits are allocated in these files for each transaction — «committed» and «aborted» — exactly the same way as in the tuple header. This information is spread across several files only for convenience; we will get back to this when we discuss freezing. PostgreSQL works with these files page by page, as with all others.So, when a transaction is committed, the «committed» bit is set for this transaction in XACT. And this is all that happens when the transaction is committed (although we do not mention the write-ahead log yet).When some other transaction accesses the table page we were just looking at, the former will have to answer a few questions.Why does not the transaction that performs the insert set these bits? When an insert is being performed, the transaction is yet unaware of whether it will be completed successfully. And at the commit time it's already unclear which rows and in which pages were changed. There can be a lot of such pages, and it is impractical to keep track of them. Besides, some of the pages can be evicted to disk from the buffer cache; to read them again in order to change the bits would mean a considerable slowdown of the commit.The reverse side of the cost saving is that after the updates, any transaction (even the one performing SELECT) can begin changing data pages in the buffer cache.So, we commit the change.Nothing has changed in the page (but we know that the transactions status is already written to XACT):Now a transaction that first accesses the page will need to determine the status of the transactionand will write it to the information bits:When a row is deleted, the ID of the current deleting transaction is written to thefield of the up-to-date version and thebit is reset.Note that the value ofcorresponding to the active transaction works as a row lock. If another transaction is going to update or delete this row, it will have to wait until thetransaction completes. We will talk about locks in more detail later. At this point, only note that the number of row locks is not limited at all. They do not occupy memory, and the system performance is not affected by that number. However, long lasting transactions have other drawbacks, which will also be discussed later.Let's delete a row.We see that the transaction ID is written to thefield, but information bits are unset:Abort of a transaction works similarly to commit, except that the «aborted» bit is set in XACT. An abort is done as fast as a commit. Although the command is called ROLLBACK, the changes are not rolled back: everything that the transaction has already changed, remains untouched.When accessing the page, the status will be checked and the hint bitwill be set. Although the numberitself will be still in the page, it will not be looked at.An update works as if the current version is deleted first and then a new one is inserted.The query returns one row (the new version):But we can see both versions in the page:The deleted version is labeled with the ID of the current transaction in thefield. Moreover, this value has overwritten the old one since the previous transaction was rolled back. And thebit is reset since the status of the current transaction is unknown yet.The first version of the row is now referencing the second, as a newer one.The index page now contains the second pointer and second row, which references the second version in the table page.The same way as for a delete, the value ofin the first version indicates that the row is locked.Lastly, we commit the transaction.We were talking only about table pages so far. But what happens inside indexes?Information in index pages highly depends on the specific index type. Moreover, even one type of indexes can have different kinds of pages. For example: a B-tree has the metadata page and «normal» pages.Nevertheless, an index page usually has an array of pointers to the rows and rows themselves (just like table pages). Besides, some space at the end of a page is allocated for special data.Rows in indexes can also have different structures depending on the index type. For example: in an B-tree, rows pertinent to leaf pages contain the value of the indexing key and a reference () to the appropriate table row. In general, an index can be structured quite a different way.The main point is that in indexes of any type there are no row versions. Or we can consider each row to be represented by only one version. In other words, the header of the index row does not contain theandfields. For now we can assume that references from the index point to all versions of table rows. So to make out which of the row versions are visible to a transaction, PostgreSQL needs to look into the table. (As usual, this is not the whole story. Sometimes the visibility map enables optimizing the process, but we will discuss this later.)Here, in the index page, we find pointers to both versions: the up-to-date and previous:In practice, PostgreSQL takes advantage of an optimization that permits to «sparingly» expends transaction IDs.If a transaction only reads data, it does not affect the visibility of tuple at all. Therefore, first the backend process assigns a virtual ID (virtual xid) to the transaction. This ID consists of the process identifier and a sequential number.Assignment of this virtual ID does not require synchronization between all the processes and is therefore performed very quickly. We will learn another reason of using virtual IDs when we discuss freezing.Data snapshots do not take into account virtual ID at all.At different points in time, the system can have virtual transactions with IDs that were already used, and this is fine. But this ID cannot be written to data pages since when the page is accessed next time, the ID can become meaningless.But if a transaction begins to change data, it receives a true, unique, transaction ID.In SQL, savepoints are defined, which permit rolling back some operations of the transaction without its complete abortion. But this is incompatible with the above model since the transaction status is one for all the changes and no data is physically rolled back.To implement this functionality, a transaction with a savepoint is divided into several separate subtransactions whose statuses can be managed separately.Subtrabsactions have their own IDs (greater than the ID of the main transaction). The statuses of subtransactions are written to XACT in a usual way, but the final status depends on the status of the main transaction: if it is rolled back, all subtransactions are rolled back as well.Information about subtransactions nesting is stored in files of the PGDATA/pg_subtrans directory. These files are accessed through buffers in the shared memory of the instance, which are structured the same way as XACT buffers.Do not confuse subtransactions with autonomous transactions. Autonomous transactions in no way depend on one another, while subtransactions do depend. There are no autonomous transactions in the regular PostgreSQL, which is, perhaps, for the better: they are actually needed extremely rarely, and their availability in other DBMS invites abuse, which everyone suffers.Let's clear the table, start a transaction and insert a row:Now we establish a savepoint and insert another row.Note that thefunction returns the ID of the main transaction rather than of the subtransaction.Let's rollback to the savepoint and insert the third row.In the page, we continue to see the row that was added by the rolled back subtransaction.Committing the changes.It is clearly seen now that each subtransaction has its own status.Note that SQL does not permit explicit use of subtransactions, that is, you cannot start a new transaction before you complete the current one. This technique gets implicitly involved when savepoints are used and also when handling PL/pgSQL exceptions, as well as in some other, more exotic situations.What happens if an error occurs while the operation is being performed? For example, like this:An error occurred. Now the transaction is treated as aborted and no operations are permitted in it:And even if we try to commit the changes, PostgreSQL will report the rollback:Why is it impossible to continue execution of the transaction after a failure? The thing is that the error could occur so that we would get access to part of the changes, that is, the atomicity would be broken not only for the transaction, but even for a single operator. For instance, in our example the operator could have updated one row before the error occurred:It's worth noting that psql has a mode that allows continuing the transaction after failure, as if the effects of the erroneous operator were rolled back.It's easy to figure out that in this mode, psql actually establishes an implicit savepoint before each command and initiates a rollback to it in the event of failure. This mode is not used by default since establishing savepoints (even without a rollback to them) entails a significant overhead.