Patch: Write Amplification Reduction Method (WARM)

Hi All, As previously discussed [1], WARM is a technique to reduce write

amplification when an indexed column of a table is updated. HOT fails to

handle such updates and ends up inserting a new index entry in all indexes

of the table, irrespective of whether the index key has changed or not for

a specific index. The problem was highlighted by Uber's blog post [2], but

it was a well known problem and affects many workloads. Simon brought up the idea originally within 2ndQuadrant and I developed it

further with inputs from my other colleagues and community members. There were two important problems identified during the earlier discussion.

This patch addresses those issues in a simplified way. There are other

complex ideas to solve those issues, but as the results demonstrate, even a

simple approach will go far way in improving performance characteristics of

many workloads, yet keeping the code complexity to relatively low. Two problems have so far been identified with the WARM design. “*Duplicate Scan*” - Claudio Freire brought up a design flaw which may lead

an IndexScan to return same tuple twice or more, thus impacting the

correctness of the solution. “*Root Pointer Search*” - Andres raised the point that it could be

inefficient to find the root line pointer for a tuple in the HOT or WARM

chain since it may require us to scan through the entire page. The Duplicate Scan problem has correctness issues so could block WARM

completely. We propose the following solution: We discussed a few ideas to address the "Duplicate Scan" problem. For

example, we can teach Index AMs to discard any duplicate (key, CTID) insert

requests. Or we could guarantee uniqueness by either only allowing updates

in one lexical order. While the former is a more complete solution to avoid

duplicate entries, searching through large number of keys for non-unique

indexes could be a drag on performance. The latter approach may not be

sufficient for many workloads. Also tracking increment/decrement for many

indexes will be non-trivial. There is another problem with allowing many index entries pointing to the

same WARM chain. It will be non-trivial to know how many index entries are

currently pointing to the WARM chain and index/heap vacuum will throw up

more challenges. Instead, what I would like to propose and the patch currently implements is

to restrict WARM update to once per chain. So the first non-HOT update to a

tuple or a HOT chain can be a WARM update. The chain can further be HOT

updated any number of times. But it can no further be WARM updated. This

might look too restrictive, but it can still bring down the number of

regular updates by almost 50%. Further, if we devise a strategy to convert

a WARM chain back to HOT chain, it can again be WARM updated. (This part is

currently not implemented). A good side effect of this simple strategy is

that we know there can maximum two index entries pointing to any given WARM

chain. The other problem Andres brought up can be solved by storing the root line

pointer offset in the t_ctid field of the last tuple in the update chain.

Barring some aborted update case, usually it's the last tuple in the update

chain that will be updated, hence it seems logical and sufficient if we can

find the root line pointer while accessing that tuple. Note that the t_ctid

field in the latest tuple is usually useless and is made to point to

itself. Instead, I propose to use a bit from t_infomask2 to identify the

LATEST tuple in the chain and use OffsetNumber field in t_ctid to store

root line pointer offset. For rare aborted update case, we can scan the

heap page and find root line pointer is a hard way. Index Recheck

-------------------- As the original proposal explains, while doing index scan we must recheck

if the heap tuple matches the index keys. This has to be done only when the

chain is marked as a WARM chain. Currently we do that by setting the last

free bit in t_infomask2 to HEAP_WARM_TUPLE. The bit is set on the tuple

that gets WARM updated and all subsequent tuples in the chain. But the

information can subsequently be copied to root line pointer when it's

converted to a LP_REDIRECT line pointer. Since each index AM has its own view of the index tuples, each AM must

implement its "amrecheck" routine. This routine to used to confirm that a

tuple returned from a WARM chain indeed satisfies the index keys. If the

index AM does not implement "amrecheck" routine, WARM update is disabled on

a table which uses such an index. The patch currently implements

"amrecheck" routines for hash and btree indexes. Hence a table with GiST or

GIN index will not honour WARM updates. Results

---------- We used a customised pgbench workload to test the feature. In particular,

the pgbench_accounts table was widened to include many more columns and

indexes. We also added an index on "abalance" field which gets updated in

every transaction. This replicates a workload where there are many indexes

on a table and an update changes just one index key. CREATE TABLE pgbench_accounts (

aid bigint,

bid bigint,

abalance bigint,

filler1 text DEFAULT md5(random()::text),

filler2 text DEFAULT md5(random()::text),

filler3 text DEFAULT md5(random()::text),

filler4 text DEFAULT md5(random()::text),

filler5 text DEFAULT md5(random()::text),

filler6 text DEFAULT md5(random()::text),

filler7 text DEFAULT md5(random()::text),

filler8 text DEFAULT md5(random()::text),

filler9 text DEFAULT md5(random()::text),

filler10 text DEFAULT md5(random()::text),

filler11 text DEFAULT md5(random()::text),

filler12 text DEFAULT md5(random()::text)

); CREATE UNIQUE INDEX pgb_a_aid ON pgbench_accounts(aid);

CREATE INDEX pgb_a_abalance ON pgbench_accounts(abalance);

CREATE INDEX pgb_a_filler1 ON pgbench_accounts(filler1);

CREATE INDEX pgb_a_filler2 ON pgbench_accounts(filler2);

CREATE INDEX pgb_a_filler3 ON pgbench_accounts(filler3);

CREATE INDEX pgb_a_filler4 ON pgbench_accounts(filler4); These tests are run on c3.4xlarge AWS instances, with 30GB of RAM, 16 vCPU

and 2x160GB SSD. Data and WAL were mounted on a separate SSD. The scale factor of 700 was chosen to ensure that the database does not fit

in memory and implications of additional write activity is evident. The actual transactional tests would just update the pgbench_accounts table: \set aid random(1, 100000 * :scale)

\set delta random(-5000, 5000)

BEGIN;

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;

SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

END; The tests were run for a long duration of 16 hrs each with 16 pgbench

clients to ensure that effects of the patch are captured correctly. Headline TPS numbers: Master: transaction type: update.sql

scaling factor: 700

query mode: simple

number of clients: 16

number of threads: 8

duration: 57600 s

number of transactions actually processed: 65552986

latency average: 14.059 ms

*tps = 1138.072117 (including connections establishing)*

tps = 1138.072156 (excluding connections establishing) WARM: transaction type: update.sql

scaling factor: 700

query mode: simple

number of clients: 16

number of threads: 8

duration: 57600 s

number of transactions actually processed: 116168454

latency average: 7.933 ms

*tps = 2016.812924 (including connections establishing)*

tps = 2016.812997 (excluding connections establishing) So WARM shows about *77% increase* in TPS. Note that these are fairly long

running tests with nearly 100M transactions and the tests show a steady

performance. We also measured the amount of WAL generated by Master and WARM per

transaction. While master generated 34967 bytes of WAL per transaction,

WARM generated 18421 bytes of WAL per transaction. We plotted a moving average of TPS against time and also against the

percentage of WARM updates. Clearly higher the number of WARM updates,

higher is the TPS. A graph showing percentage of WARM updates is also

plotted and it shows a steady convergence to 50% mark with time. We repeated the same tests starting with 90% heap fill factor such that

there are many more WARM updates. Since with 90% fill factor and in

combination with HOT pruning, most initial updates will be WARM updates and

that impacts TPS positively. WARM shows nearly *150% increase *in TPS for

that workload. Master: transaction type: update.sql

scaling factor: 700

query mode: simple

number of clients: 16

number of threads: 8

duration: 57600 s

number of transactions actually processed: 78134617

latency average: 11.795 ms

*tps = 1356.503629 (including connections establishing)*

tps = 1356.503679 (excluding connections establishing) WARM: transaction type: update.sql

scaling factor: 700

query mode: simple

number of clients: 16

number of threads: 8

duration: 57600 s

number of transactions actually processed: 196782770

latency average: 4.683 ms

*tps = 3416.364822 (including connections establishing)*

tps = 3416.364949 (excluding connections establishing) In this case, master produced ~49000 bytes of WAL per transaction where as

WARM produced ~14000 bytes of WAL per transaction.

I concede that we haven't yet done many tests to measure overhead of the

technique, especially in circumstances where WARM may not be very useful.

What I have in mind are couple of tests: - With many indexes and a good percentage of them requiring update

- A mix of read-write workload Any other ideas to do that are welcome. Concerns:

-------------- The additional heap recheck may have negative impact on performance. We

tried to measure this by running a SELECT only workload for 1hr after 16hr

test finished. But the TPS did not show any negative impact. The impact

could be more if the update changes many index keys, something these tests

don't test. The patch also changes things such that index tuples are always returned

because they may be needed for recheck. It's not clear if this is something

to be worried about, but we could try to further fine tune this change. There seems to be some modularity violations since index AM needs to access

some of the executor stuff to form index datums. If that's a real concern,

we can look at improving amrecheck signature so that it gets index datums

from the caller. The patch uses remaining 2 free bits in t_infomask, thus closing any

further improvements which may need to use heap tuple flags. During the

patch development we tried several other approaches such as reusing

3-higher order bits in OffsetNumber since the current max BLCKSZ limits the

MaxOffsetNumber to 8192 and that can be represented in 13 bits. We finally

reverted that change to keep the patch simple. But there is clearly a way

to free up more bits if required. Converting WARM chains back to HOT chains (VACUUM ?)

--------------------------------------------------------------------------------- The current implementation of WARM allows only one WARM update per chain.

This

simplifies the design and addresses certain issues around duplicate scans.

But

this also implies that the benefit of WARM will be no more than 50%, which

is

still significant, but if we could return WARM chains back to normal

status, we

could do far more WARM updates. A distinct property of a WARM chain is that at least one index has more than

one live index entries pointing to the root of the chain. In other words,

if we

can remove duplicate entry from every index or conclusively prove that there

are no duplicate index entries for the root line pointer, the chain can

again

be marked as HOT. Here is one idea, but more thoughts/suggestions are most welcome. A WARM chain has two parts, separated by the tuple that caused WARM update.

All

tuples in each part has matching index keys, but certain index keys may not

match between these two parts. Lets say we mark heap tuples in each part

with a

special Red-Blue flag. The same flag is replicated in the index tuples. For

example, when new rows are inserted in a table, they are marked with Blue

flag

and the index entries associated with those rows are also marked with Blue

flag. When a row is WARM updated, the new version is marked with Red flag

and

the new index entry created by the update is also marked with Red flag. Heap chain: lp [1] [2] [3] [4]

[aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)

(bbbb)R points to 1 (satisfies only tuples marked with R) Index2: (1111)B points to 1 (satisfies both B and R tuples) It's clear that for indexes with Red and Blue pointers, a heap tuple with

Blue

flag will be reachable from Blue pointer and that with Red flag will be

reachable from Red pointer. But for indexes which did not create a new

entry,

both Blue and Red tuples will be reachable from Blue pointer (there is no

Red

pointer in such indexes). So, as a side note, matching Red and Blue flags is

not enough from index scan perspective. During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE

set.

If all live tuples in the chain are either marked with Blue flag or Red flag

(but no mix of Red and Blue), then the chain is a candidate for HOT

conversion.

We remember the root line pointer and Red-Blue flag of the WARM chain in a

separate array. If we have a Red WARM chain, then our goal is to remove Blue pointers and

vice

versa. But there is a catch. For Index2 above, there is only Blue pointer

and that must not be removed. IOW we should remove Blue pointer iff a Red

pointer exists. Since index vacuum may visit Red and Blue pointers in any

order, I think we will need another index pass to remove dead

index pointers. So in the first index pass we check which WARM candidates

have

2 index pointers. In the second pass, we remove the dead pointer and reset

Red

flag is the surviving index pointer is Red. During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE

flag

and also reset Red flag to Blue. There are some more problems around aborted vacuums. For example, if vacuum

aborts after changing Red index flag to Blue but before removing the other

Blue

pointer, we will end up with two Blue pointers to a Red WARM chain. But

since

the HEAP_WARM_TUPLE flag on the heap tuple is still set, further WARM

updates

to the chain will be blocked. I guess we will need some special handling for

case with multiple Blue pointers. We can either leave these WARM chains

alone

and let them die with a subsequent non-WARM update or must apply

heap-recheck

logic during index vacuum to find the dead pointer. Given that vacuum-aborts

are not common, I am inclined to leave this case unhandled. We must still

check

for presence of multiple Blue pointers and ensure that we don't accidently

remove any of the Blue pointers and not clear WARM chains either. Of course, the idea requires one bit each in index and heap tuple. There is

already a free bit in index tuple and I've some ideas to free up additional

bits in heap tuple (as mentioned above). Further Work

------------------ 1.The patch currently disables WARM updates on system relations. This is

mostly to keep the patch simple, but in theory we should be able to support

WARM updates on system tables too. It's not clear if its worth the

complexity though. 2. AFAICS both CREATE INDEX and CIC should just work fine, but need

validation for that. 3. GiST and GIN indexes are currently disabled for WARM. I don't see a

fundamental reason why they won't work once we implement "amrecheck"

method, but I don't understand those indexes well enough. 4. There are some modularity invasions I am worried about (is amrecheck

signature ok?). There are also couple of hacks around to get access to

index tuples during scans and I hope to get them correct during review

process, with some feedback. 5. Patch does not implement machinery to convert WARM chains into HOT

chains. I would give it go unless someone finds a problem with the idea or

has a better idea. Thanks,

Pavan [1] - https://www.postgresql.org/message-id/CABOikdMop5Rb_RnS2xF

dAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw(at)mail(dot)gmail(dot)com

[2] - https://eng.uber.com/mysql-migration/ --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size image/png 19.8 KB image/png 11.3 KB image/png 17.7 KB 0001_track_root_lp_v2.patch application/octet-stream 25.4 KB 0002_warm_updates_v2.patch application/octet-stream 90.2 KB

On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: > We discussed a few ideas to address the "Duplicate Scan" problem. For

> example, we can teach Index AMs to discard any duplicate (key, CTID) insert

> requests. Or we could guarantee uniqueness by either only allowing updates

> in one lexical order. While the former is a more complete solution to avoid

> duplicate entries, searching through large number of keys for non-unique

> indexes could be a drag on performance. The latter approach may not be

> sufficient for many workloads. Also tracking increment/decrement for many

> indexes will be non-trivial.

>

> There is another problem with allowing many index entries pointing to the

> same WARM chain. It will be non-trivial to know how many index entries are

> currently pointing to the WARM chain and index/heap vacuum will throw up

> more challenges.

>

> Instead, what I would like to propose and the patch currently implements

> is to restrict WARM update to once per chain. So the first non-HOT update

> to a tuple or a HOT chain can be a WARM update. The chain can further be

> HOT updated any number of times. But it can no further be WARM updated.

> This might look too restrictive, but it can still bring down the number of

> regular updates by almost 50%. Further, if we devise a strategy to convert

> a WARM chain back to HOT chain, it can again be WARM updated. (This part is

> currently not implemented). A good side effect of this simple strategy is

> that we know there can maximum two index entries pointing to any given WARM

> chain.

> We should probably think about coordinating with my btree patch. From the description above, the strategy is quite readily "upgradable" to

one in which the indexam discards duplicate (key,ctid) pairs and that would

remove the limitation of only one WARM update... right?



On Wed, Aug 31, 2016 at 10:38 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>

wrote: >

> On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

> wrote:

>

>> We discussed a few ideas to address the "Duplicate Scan" problem. For

>> example, we can teach Index AMs to discard any duplicate (key, CTID) insert

>> requests. Or we could guarantee uniqueness by either only allowing updates

>> in one lexical order. While the former is a more complete solution to avoid

>> duplicate entries, searching through large number of keys for non-unique

>> indexes could be a drag on performance. The latter approach may not be

>> sufficient for many workloads. Also tracking increment/decrement for many

>> indexes will be non-trivial.

>>

>> There is another problem with allowing many index entries pointing to the

>> same WARM chain. It will be non-trivial to know how many index entries are

>> currently pointing to the WARM chain and index/heap vacuum will throw up

>> more challenges.

>>

>> Instead, what I would like to propose and the patch currently implements

>> is to restrict WARM update to once per chain. So the first non-HOT update

>> to a tuple or a HOT chain can be a WARM update. The chain can further be

>> HOT updated any number of times. But it can no further be WARM updated.

>> This might look too restrictive, but it can still bring down the number of

>> regular updates by almost 50%. Further, if we devise a strategy to convert

>> a WARM chain back to HOT chain, it can again be WARM updated. (This part is

>> currently not implemented). A good side effect of this simple strategy is

>> that we know there can maximum two index entries pointing to any given WARM

>> chain.

>>

>

> We should probably think about coordinating with my btree patch.

>

> From the description above, the strategy is quite readily "upgradable" to

> one in which the indexam discards duplicate (key,ctid) pairs and that would

> remove the limitation of only one WARM update... right?

> Yes, we should be able to add further optimisations on lines you're working

on, but what I like about the current approach is that a) it reduces

complexity of the patch and b) having thought about cleaning up WARM

chains, limiting number of index entries per root chain to a small number

will simplify that aspect too. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On Wed, Aug 31, 2016 at 10:15 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: > Hi All,

>

> As previously discussed [1], WARM is a technique to reduce write

> amplification when an indexed column of a table is updated. HOT fails to

> handle such updates and ends up inserting a new index entry in all indexes

> of the table, irrespective of whether the index key has changed or not for

> a specific index. The problem was highlighted by Uber's blog post [2], but

> it was a well known problem and affects many workloads.

>

>

I realised that the patches were bit-rotten because of 8e1e3f958fb. Rebased

patches on the current master are attached. I also took this opportunity to

correct some white space errors and improve formatting of the README. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v3.patch application/octet-stream 25.4 KB 0002_warm_updates_v3.patch application/octet-stream 89.9 KB

On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:

> Instead, what I would like to propose and the patch currently implements is to

> restrict WARM update to once per chain. So the first non-HOT update to a tuple

> or a HOT chain can be a WARM update. The chain can further be HOT updated any

> number of times. But it can no further be WARM updated. This might look too

> restrictive, but it can still bring down the number of regular updates by

> almost 50%. Further, if we devise a strategy to convert a WARM chain back to

> HOT chain, it can again be WARM updated. (This part is currently not

> implemented). A good side effect of this simple strategy is that we know there

> can maximum two index entries pointing to any given WARM chain. I like the simplified approach, as long as it doesn't block further

improvements. > Headline TPS numbers:

>

> Master:

>

> transaction type: update.sql

> scaling factor: 700

> query mode: simple

> number of clients: 16

> number of threads: 8

> duration: 57600 s

> number of transactions actually processed: 65552986

> latency average: 14.059 ms

> tps = 1138.072117 (including connections establishing)

> tps = 1138.072156 (excluding connections establishing)

>

>

> WARM:

>

> transaction type: update.sql

> scaling factor: 700

> query mode: simple

> number of clients: 16

> number of threads: 8

> duration: 57600 s

> number of transactions actually processed: 116168454

> latency average: 7.933 ms

> tps = 2016.812924 (including connections establishing)

> tps = 2016.812997 (excluding connections establishing) These are very impressive results. > Converting WARM chains back to HOT chains (VACUUM ?)

> ---------------------------------------------------------------------------------

>

> The current implementation of WARM allows only one WARM update per chain. This

> simplifies the design and addresses certain issues around duplicate scans. But

> this also implies that the benefit of WARM will be no more than 50%, which is

> still significant, but if we could return WARM chains back to normal status, we

> could do far more WARM updates.

>

> A distinct property of a WARM chain is that at least one index has more than

> one live index entries pointing to the root of the chain. In other words, if we

> can remove duplicate entry from every index or conclusively prove that there

> are no duplicate index entries for the root line pointer, the chain can again

> be marked as HOT. I had not thought of how to convert from WARM to HOT yet. > Here is one idea, but more thoughts/suggestions are most welcome.

>

> A WARM chain has two parts, separated by the tuple that caused WARM update. All

> tuples in each part has matching index keys, but certain index keys may not

> match between these two parts. Lets say we mark heap tuples in each part with a

> special Red-Blue flag. The same flag is replicated in the index tuples. For

> example, when new rows are inserted in a table, they are marked with Blue flag

> and the index entries associated with those rows are also marked with Blue

> flag. When a row is WARM updated, the new version is marked with Red flag and

> the new index entry created by the update is also marked with Red flag.

>

>

> Heap chain: lp [1] [2] [3] [4]

> [aaaa, 1111]B -> [aaaa, 1111]B -> [bbbb, 1111]R -> [bbbb, 1111]R

>

> Index1: (aaaa)B points to 1 (satisfies only tuples marked with B)

> (bbbb)R points to 1 (satisfies only tuples marked with R)

>

> Index2: (1111)B points to 1 (satisfies both B and R tuples)

>

>

> It's clear that for indexes with Red and Blue pointers, a heap tuple with Blue

> flag will be reachable from Blue pointer and that with Red flag will be

> reachable from Red pointer. But for indexes which did not create a new entry,

> both Blue and Red tuples will be reachable from Blue pointer (there is no Red

> pointer in such indexes). So, as a side note, matching Red and Blue flags is

> not enough from index scan perspective.

>

> During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE set.

> If all live tuples in the chain are either marked with Blue flag or Red flag

> (but no mix of Red and Blue), then the chain is a candidate for HOT conversion. Uh, if the chain is all blue, then there is are WARM entries so it

already a HOT chain, so there is nothing to do, right? > We remember the root line pointer and Red-Blue flag of the WARM chain in a

> separate array.

>

> If we have a Red WARM chain, then our goal is to remove Blue pointers and vice

> versa. But there is a catch. For Index2 above, there is only Blue pointer

> and that must not be removed. IOW we should remove Blue pointer iff a Red

> pointer exists. Since index vacuum may visit Red and Blue pointers in any

> order, I think we will need another index pass to remove dead

> index pointers. So in the first index pass we check which WARM candidates have

> 2 index pointers. In the second pass, we remove the dead pointer and reset Red

> flag is the surviving index pointer is Red. Why not just remember the tid of chains converted from WARM to HOT, then

use "amrecheck" on an index entry matching that tid to see if the index

matches one of the entries in the chain. (It will match all of them or

none of them, because they are all red.) I don't see a point in

coloring the index entries as reds as later you would have to convert to

blue in the WARM-to-HOT conversion, and a vacuum crash could lead to

inconsistencies. Consider that you can just call "amrecheck" on the few

chains that have converted from WARM to HOT. I believe this is more

crash-safe too. However, if you have converted WARM to HOT in the heap,

but crash durin the index entry removal, you could potentially have

duplicates in the index later, which is bad. --

Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us

EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. +

+ Ancient Roman grave inscription +

On Wed, Aug 31, 2016 at 04:03:29PM -0400, Bruce Momjian wrote:

> Why not just remember the tid of chains converted from WARM to HOT, then

> use "amrecheck" on an index entry matching that tid to see if the index

> matches one of the entries in the chain. (It will match all of them or

> none of them, because they are all red.) I don't see a point in

> coloring the index entries as reds as later you would have to convert to

> blue in the WARM-to-HOT conversion, and a vacuum crash could lead to

> inconsistencies. Consider that you can just call "amrecheck" on the few

> chains that have converted from WARM to HOT. I believe this is more

> crash-safe too. However, if you have converted WARM to HOT in the heap,

> but crash during the index entry removal, you could potentially have

> duplicates in the index later, which is bad. I think Pavan had the "crash durin the index entry removal" fixed via: > During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE flag

> and also reset Red flag to Blue. so the marking from WARM to HOT only happens after the index has been cleaned. --

Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us

EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. +

+ Ancient Roman grave inscription +

On Thu, Sep 1, 2016 at 1:33 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote: > On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:

> > Instead, what I would like to propose and the patch currently implements

> is to

> > restrict WARM update to once per chain. So the first non-HOT update to a

> tuple

> > or a HOT chain can be a WARM update. The chain can further be HOT

> updated any

> > number of times. But it can no further be WARM updated. This might look

> too

> > restrictive, but it can still bring down the number of regular updates by

> > almost 50%. Further, if we devise a strategy to convert a WARM chain

> back to

> > HOT chain, it can again be WARM updated. (This part is currently not

> > implemented). A good side effect of this simple strategy is that we know

> there

> > can maximum two index entries pointing to any given WARM chain.

>

> I like the simplified approach, as long as it doesn't block further

> improvements.

>

>

Yes, the proposed approach is simple yet does not stop us from improving

things further. Moreover it has shown good performance characteristics and

I believe it's a good first step. >

> > Master:

> > tps = 1138.072117 (including connections establishing)

> >

> > WARM:

> > tps = 2016.812924 (including connections establishing)

>

> These are very impressive results.

>

>

Thanks. What's also interesting and something that headline numbers don't

show is that WARM TPS is as much as 3 times of master TPS when the

percentage of WARM updates is very high. Notice the spike in TPS in the

comparison graph. Results with non-default heap fill factor are even better. In both cases,

the improvement in TPS stays constant over long periods. >

> >

> > During first heap scan of VACUUM, we look for tuples with

> HEAP_WARM_TUPLE set.

> > If all live tuples in the chain are either marked with Blue flag or Red

> flag

> > (but no mix of Red and Blue), then the chain is a candidate for HOT

> conversion.

>

> Uh, if the chain is all blue, then there is are WARM entries so it

> already a HOT chain, so there is nothing to do, right?

> For aborted WARM updates, the heap chain may be all blue, but there may

still be a red index pointer which must be cleared before we allow further

WARM updates to the chain. >

> > We remember the root line pointer and Red-Blue flag of the WARM chain in

> a

> > separate array.

> >

> > If we have a Red WARM chain, then our goal is to remove Blue pointers

> and vice

> > versa. But there is a catch. For Index2 above, there is only Blue pointer

> > and that must not be removed. IOW we should remove Blue pointer iff a Red

> > pointer exists. Since index vacuum may visit Red and Blue pointers in any

> > order, I think we will need another index pass to remove dead

> > index pointers. So in the first index pass we check which WARM

> candidates have

> > 2 index pointers. In the second pass, we remove the dead pointer and

> reset Red

> > flag is the surviving index pointer is Red.

>

> Why not just remember the tid of chains converted from WARM to HOT, then

> use "amrecheck" on an index entry matching that tid to see if the index

> matches one of the entries in the chain. That will require random access to heap during index vacuum phase,

something I would like to avoid. But we can have that as a fall back

solution for handling aborted vacuums. > (It will match all of them or

> none of them, because they are all red.) I don't see a point in

> coloring the index entries as reds as later you would have to convert to

> blue in the WARM-to-HOT conversion, and a vacuum crash could lead to

> inconsistencies. Yes, that's a concern since the conversion of red to blue will also need to

WAL logged to ensure that a crash doesn't leave us in inconsistent state. I

still think that this will be an overall improvement as compared to

allowing one WARM update per chain. > Consider that you can just call "amrecheck" on the few

> chains that have converted from WARM to HOT. I believe this is more

> crash-safe too. However, if you have converted WARM to HOT in the heap,

> but crash durin the index entry removal, you could potentially have

> duplicates in the index later, which is bad.

>

>

As you probably already noted, we clear heap flags only after all indexes

are cleared of duplicate entries and hence a crash in between should not

cause any correctness issue. As long as heap tuples are marked as warm,

amrecheck will ensure that only valid tuples are returned to the caller. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:

> I like the simplified approach, as long as it doesn't block further

> improvements.

>

>

>

> Yes, the proposed approach is simple yet does not stop us from improving things

> further. Moreover it has shown good performance characteristics and I believe

> it's a good first step. Agreed. This is BIG. Do you think it can be done for PG 10? > Thanks. What's also interesting and something that headline numbers don't show

> is that WARM TPS is as much as 3 times of master TPS when the percentage of

> WARM updates is very high. Notice the spike in TPS in the comparison graph.

>

> Results with non-default heap fill factor are even better. In both cases, the

> improvement in TPS stays constant over long periods. Yes, I expect the benefits of this to show up in better long-term

performance. > > During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE

> set.

> > If all live tuples in the chain are either marked with Blue flag or Red

> flag

> > (but no mix of Red and Blue), then the chain is a candidate for HOT

> conversion.

>

> Uh, if the chain is all blue, then there is are WARM entries so it

> already a HOT chain, so there is nothing to do, right?

>

>

> For aborted WARM updates, the heap chain may be all blue, but there may still

> be a red index pointer which must be cleared before we allow further WARM

> updates to the chain. Ah, understood now. Thanks. > Why not just remember the tid of chains converted from WARM to HOT, then

> use "amrecheck" on an index entry matching that tid to see if the index

> matches one of the entries in the chain.

>

>

> That will require random access to heap during index vacuum phase, something I

> would like to avoid. But we can have that as a fall back solution for handling

> aborted vacuums. Yes, that is true. So the challenge is figuring how which of the index

entries pointing to the same tid is valid, and coloring helps with that? > (It will match all of them or

> none of them, because they are all red.) I don't see a point in

> coloring the index entries as reds as later you would have to convert to

> blue in the WARM-to-HOT conversion, and a vacuum crash could lead to

> inconsistencies.

>

>

> Yes, that's a concern since the conversion of red to blue will also need to WAL

> logged to ensure that a crash doesn't leave us in inconsistent state. I still

> think that this will be an overall improvement as compared to allowing one WARM

> update per chain. OK. I will think some more on this to see if I can come with another

approach. >

>

> Consider that you can just call "amrecheck" on the few

> chains that have converted from WARM to HOT. I believe this is more

> crash-safe too. However, if you have converted WARM to HOT in the heap,

> but crash durin the index entry removal, you could potentially have

> duplicates in the index later, which is bad.

>

> As you probably already noted, we clear heap flags only after all indexes are

> cleared of duplicate entries and hence a crash in between should not cause any

> correctness issue. As long as heap tuples are marked as warm, amrecheck will

> ensure that only valid tuples are returned to the caller. OK, got it. --

Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us

EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. +

+ Ancient Roman grave inscription +

On Thu, Sep 1, 2016 at 9:44 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote: > On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:

> > I like the simplified approach, as long as it doesn't block further

> > improvements.

> >

> >

> >

> > Yes, the proposed approach is simple yet does not stop us from improving

> things

> > further. Moreover it has shown good performance characteristics and I

> believe

> > it's a good first step.

>

> Agreed. This is BIG. Do you think it can be done for PG 10?

> I definitely think so. The patches as submitted are fully functional and

sufficient. Of course, there are XXX and TODOs that I hope to sort out

during the review process. There are also further tests needed to ensure

that the feature does not cause significant regression in the worst cases.

Again something I'm willing to do once I get some feedback on the broader

design and test cases. What I am looking at this stage is to know if I've

missed something important in terms of design or if there is some show

stopper that I overlooked. Latest patches rebased with current master are attached. I also added a few

more comments to the code. I forgot to give a brief about the patches, so

including that as well. 0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to

track latest tuple in an update chain. The t_ctid.ip_posid is used to track

the root line pointer of the update chain. We do this only in the latest

tuple in the chain because most often that tuple will be updated and we

need to quickly find the root only during update. 0002_warm_updates_v4.patch: This patch implements the core of WARM logic.

During WARM update, we only insert new entries in the indexes whose key has

changed. But instead of indexing the real TID of the new tuple, we index

the root line pointer and then use additional recheck logic to ensure only

correct tuples are returned from such potentially broken HOT chains. Each

index AM must implement a amrecheck method to support WARM. The patch

currently implements this for hash and btree indexes. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v4.patch application/octet-stream 27.2 KB 0002_warm_updates_v4.patch application/octet-stream 94.0 KB

On Mon, Sep 5, 2016 at 1:53 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:

> 0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to track

> latest tuple in an update chain. The t_ctid.ip_posid is used to track the

> root line pointer of the update chain. We do this only in the latest tuple

> in the chain because most often that tuple will be updated and we need to

> quickly find the root only during update.

>

> 0002_warm_updates_v4.patch: This patch implements the core of WARM logic.

> During WARM update, we only insert new entries in the indexes whose key has

> changed. But instead of indexing the real TID of the new tuple, we index the

> root line pointer and then use additional recheck logic to ensure only

> correct tuples are returned from such potentially broken HOT chains. Each

> index AM must implement a amrecheck method to support WARM. The patch

> currently implements this for hash and btree indexes. Moved to next CF, I was surprised to see that it is not *that* large:

43 files changed, 1539 insertions(+), 199 deletions(-)

--

Michael

On 09/05/2016 06:53 AM, Pavan Deolasee wrote:

>

>

> On Thu, Sep 1, 2016 at 9:44 PM, Bruce Momjian <bruce(at)momjian(dot)us

> <mailto:bruce(at)momjian(dot)us>> wrote:

>

> On Thu, Sep 1, 2016 at 02:37:40PM +0530, Pavan Deolasee wrote:

> > I like the simplified approach, as long as it doesn't block further

> > improvements.

> >

> >

> >

> > Yes, the proposed approach is simple yet does not stop us from improving things

> > further. Moreover it has shown good performance characteristics and I believe

> > it's a good first step.

>

> Agreed. This is BIG. Do you think it can be done for PG 10?

>

>

> I definitely think so. The patches as submitted are fully functional and

> sufficient. Of course, there are XXX and TODOs that I hope to sort out

> during the review process. There are also further tests needed to ensure

> that the feature does not cause significant regression in the worst

> cases. Again something I'm willing to do once I get some feedback on the

> broader design and test cases. What I am looking at this stage is to

> know if I've missed something important in terms of design or if there

> is some show stopper that I overlooked.

>

> Latest patches rebased with current master are attached. I also added a

> few more comments to the code. I forgot to give a brief about the

> patches, so including that as well.

>

> 0001_track_root_lp_v4.patch: This patch uses a free t_infomask2 bit to

> track latest tuple in an update chain. The t_ctid.ip_posid is used to

> track the root line pointer of the update chain. We do this only in the

> latest tuple in the chain because most often that tuple will be updated

> and we need to quickly find the root only during update.

>

> 0002_warm_updates_v4.patch: This patch implements the core of WARM

> logic. During WARM update, we only insert new entries in the indexes

> whose key has changed. But instead of indexing the real TID of the new

> tuple, we index the root line pointer and then use additional recheck

> logic to ensure only correct tuples are returned from such potentially

> broken HOT chains. Each index AM must implement a amrecheck method to

> support WARM. The patch currently implements this for hash and btree

> indexes.

> Hi, I've been looking at the patch over the past few days, running a bunch

of benchmarks etc. I can confirm the significant speedup, often by more

than 75% (depending on number of indexes, whether the data set fits into

RAM, etc.). Similarly for the amount of WAL generated, although that's a

bit more difficult to evaluate due to full_page_writes. I'm not going to send detailed results, as that probably does not make

much sense at this stage of the development - I can repeat the tests

once the open questions get resolved. There's a lot of useful and important feedback in the thread(s) so far,

particularly the descriptions of various failure cases. I think it'd be

very useful to collect those examples and turn them into regression

tests - that's something the patch should include anyway. I don't really have much comments regarding the code, but during the

testing I noticed a bit strange behavior when updating statistics.

Consider a table like this: create table t (a int, b int, c int) with (fillfactor = 10);

insert into t select i, i from generate_series(1,1000) s(i);

create index on t(a);

create index on t(b); and update: update t set a = a+1, b=b+1; which has to update all indexes on the table, but: select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables n_tup_upd | n_tup_hot_upd

-----------+---------------

1000 | 1000 So it's still counted as "WARM" - does it make sense? I mean, we're

creating a WARM chain on the page, yet we have to add pointers into all

indexes (so not really saving anything). Doesn't this waste the one WARM

update per HOT chain without actually getting anything in return? The way this is piggy-backed on the current HOT statistics seems a bit

strange for another reason, although WARM is a relaxed version of HOT.

Until now, HOT was "all or nothing" - we've either added index entries

to all indexes or none of them. So the n_tup_hot_upd was fine. But WARM changes that - it allows adding index entries only to a subset

of indexes, which means the "per row" n_tup_hot_upd counter is not

sufficient. When you have a table with 10 indexes, and the counter

increases by 1, does that mean the update added index tuple to 1 index

or 9 of them? So I think we'll need two counters to track WARM - number of index

tuples we've added, and number of index tuples we've skipped. So

something like blks_hit and blks_read. I'm not sure whether we should

replace the n_tup_hot_upd entirely, or keep it for backwards

compatibility (and to track perfectly HOT updates). regards --

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>

wrote: >

>

> I've been looking at the patch over the past few days, running a bunch of

> benchmarks etc. Thanks for doing that. > I can confirm the significant speedup, often by more than 75% (depending

> on number of indexes, whether the data set fits into RAM, etc.). Similarly

> for the amount of WAL generated, although that's a bit more difficult to

> evaluate due to full_page_writes.

>

> I'm not going to send detailed results, as that probably does not make

> much sense at this stage of the development - I can repeat the tests once

> the open questions get resolved.

> Sure. Anything that stands out? Any regression that you see? I'm not sure

if your benchmarks exercise the paths which might show overheads without

any tangible benefits. For example, I wonder if a test with many indexes

where most of them get updated and then querying the table via those

updated indexes could be one such test case. >

> There's a lot of useful and important feedback in the thread(s) so far,

> particularly the descriptions of various failure cases. I think it'd be

> very useful to collect those examples and turn them into regression tests -

> that's something the patch should include anyway.

> Sure. I added only a handful test cases which I knew regression isn't

covering. But I'll write more of them. One good thing is that the code gets

heavily exercised even during regression. I caught and fixed multiple bugs

running regression. I'm not saying that's enough, but it certainly gives

some confidence. >

> and update:

>

> update t set a = a+1, b=b+1;

>

> which has to update all indexes on the table, but:

>

> select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables

>

> n_tup_upd | n_tup_hot_upd

> -----------+---------------

> 1000 | 1000

>

> So it's still counted as "WARM" - does it make sense? No, it does not. The code currently just marks any update as a WARM update

if the table supports it and there is enough free space in the page. And

yes, you're right. It's worth fixing that because of one-WARM update per

chain limitation. Will fix. >

> The way this is piggy-backed on the current HOT statistics seems a bit

> strange for another reason, Agree. We could add a similar n_tup_warm_upd counter. > But WARM changes that - it allows adding index entries only to a subset of

> indexes, which means the "per row" n_tup_hot_upd counter is not sufficient.

> When you have a table with 10 indexes, and the counter increases by 1, does

> that mean the update added index tuple to 1 index or 9 of them?

> How about having counters similar to n_tup_ins/n_tup_del for indexes as

well? Today it does not make sense because every index gets the same number

of inserts, but WARM will change that. For example, we could have idx_tup_insert and idx_tup_delete that shows up

in pg_stat_user_indexes. I don't know if idx_tup_delete adds any value, but

one can then look at idx_tup_insert for various indexes to get a sense

which indexes receives more inserts than others. The indexes which receive

more inserts are the ones being frequently updated as compared to other

indexes. This also relates to vacuuming strategies. Today HOT updates do not count

for triggering vacuum (or to be more precise, HOT pruned tuples are

discounted while counting dead tuples). WARM tuples get the same treatment

as far as pruning is concerned, but since they cause fresh index inserts, I

wonder if we need some mechanism to cleanup the dead line pointers and dead

index entries. This will become more important if we do something to

convert WARM chains into HOT chains, something that only VACUUM can do in

the design I've proposed so far. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services





On 10/06/2016 07:36 AM, Pavan Deolasee wrote:

>

>

> On Wed, Oct 5, 2016 at 1:43 PM, Tomas Vondra

> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:

>

...

> I can confirm the significant speedup, often by more than 75%

> (depending on number of indexes, whether the data set fits into RAM,

> etc.). Similarly for the amount of WAL generated, although that's a

> bit more difficult to evaluate due to full_page_writes.

>

> I'm not going to send detailed results, as that probably does not

> make much sense at this stage of the development - I can repeat the

> tests once the open questions get resolved.

>

>

> Sure. Anything that stands out? Any regression that you see? I'm not

> sure if your benchmarks exercise the paths which might show overheads

> without any tangible benefits. For example, I wonder if a test with many

> indexes where most of them get updated and then querying the table via

> those updated indexes could be one such test case.

> No, nothing that would stand out. Let me explain what benchmark(s) I've

done. I've made some minor mistakes when running the benchmarks, so I

plan to rerun them and post the results after that. So let's take the

data with a grain of salt. My goal was to compare current non-HOT behavior (updating all indexes)

with the WARM (updating only indexes on modified columns), and I've

taken two approaches: 1) fixed number of indexes, update variable number of columns Create a table with 8 secondary indexes and then run a bunch of

benchmarks updating increasing number of columns. So the first run did UPDATE t SET c1 = c1+1 WHERE id = :id; while the second did UPDATE t SET c1 = c1+1, c2 = c2+1 WHERE id = :id; and so on, up to updating all the columns in the last run. I've used

multiple scripts to update all the columns / indexes uniformly

(essentially using multiple "-f" flags with pgbench). The runs were

fairly long (2h, enough to get stable behavior). For a small data set (fits into RAM), the results look like this: master patched diff

1 5994 8490 +42%

2 4347 7903 +81%

3 4340 7400 +70%

4 4324 6929 +60%

5 4256 6495 +52%

6 4253 5059 +19%

7 4235 4534 +7%

8 4194 4237 +1% and the amount of WAL generated (after correction for tps difference)

looks like this (numbers are MBs) master patched diff

1 27257 18508 -32%

2 21753 14599 -33%

3 21912 15864 -28%

4 22021 17135 -22%

5 21819 18258 -16%

6 21929 20659 -6%

7 21994 22234 +1%

8 21851 23267 +6% So this is quite significant difference. I'm pretty sure the minor WAL

increase for the last two runs is due to full page writes (which also

affects the preceding runs, making the WAL reduction smaller than the

tps increase). I do have results for larger data sets (>RAM), the results are very

similar although the speedup seems a bit smaller. But I need to rerun those. 2) single-row update, adding indexes between runs This is kinda the opposite of the previous approach, i.e. transactions

always update a single column (multiple scripts to update the columns

uniformly), but there are new indexes added between runs. The results

(for a large data set, exceeding RAM) look like this: master patched diff

0 954 1404 +47%

1 701 1045 +49%

2 484 816 +70%

3 346 683 +97%

4 248 608 +145%

5 190 525 +176%

6 152 397 +161%

7 123 315 +156%

8 123 270 +119% So this looks really interesting. >

> There's a lot of useful and important feedback in the thread(s) so

> far, particularly the descriptions of various failure cases. I think

> it'd be very useful to collect those examples and turn them into

> regression tests - that's something the patch should include anyway.

>

>

> Sure. I added only a handful test cases which I knew regression isn't

> covering. But I'll write more of them. One good thing is that the code

> gets heavily exercised even during regression. I caught and fixed

> multiple bugs running regression. I'm not saying that's enough, but it

> certainly gives some confidence.

> I don't see any changes to src/test in the patch, so I'm not sure what

you mean when you say you added a handful of test cases? >

>

> and update:

>

> update t set a = a+1, b=b+1;

>

> which has to update all indexes on the table, but:

>

> select n_tup_upd, n_tup_hot_upd from pg_stat_user_tables

>

> n_tup_upd | n_tup_hot_upd

> -----------+---------------

> 1000 | 1000

>

> So it's still counted as "WARM" - does it make sense?

>

>

> No, it does not. The code currently just marks any update as a WARM

> update if the table supports it and there is enough free space in the

> page. And yes, you're right. It's worth fixing that because of one-WARM

> update per chain limitation. Will fix.

> Hmmm, so this makes monitoring of %WARM during benchmarks less reliable

than I hoped for :-( >

> The way this is piggy-backed on the current HOT statistics seems a

> bit strange for another reason,

>

>

> Agree. We could add a similar n_tup_warm_upd counter.

> Yes, although HOT is a special case of WARM. But it probably makes sense

to differentiate them, I guess. >

> But WARM changes that - it allows adding index entries only to a

> subset of indexes, which means the "per row" n_tup_hot_upd counter

> is not sufficient. When you have a table with 10 indexes, and the

> counter increases by 1, does that mean the update added index tuple

> to 1 index or 9 of them?

>

>

> How about having counters similar to n_tup_ins/n_tup_del for indexes

> as well? Today it does not make sense because every index gets the

> same number of inserts, but WARM will change that.

>

> For example, we could have idx_tup_insert and idx_tup_delete that shows

> up in pg_stat_user_indexes. I don't know if idx_tup_delete adds any

> value, but one can then look at idx_tup_insert for various indexes to

> get a sense which indexes receives more inserts than others. The indexes

> which receive more inserts are the ones being frequently updated as

> compared to other indexes.

> Hmmm, I'm not sure that'll work. I mean, those metrics would be useful

(although I can't think of a use case for idx_tup_delete), but I'm not

sure it's a enough to measure WARM. We need to compute index_tuples_inserted / index_tuples_total where (index_tuples_total - index_tuples_inserted) is the number of

index tuples we've been able to skip thanks to WARM. So we'd also need

to track the number of index tuples that we skipped for the index, and

I'm not sure that's a good idea. Also, we really don't care about inserted tuples - what matters for WARM

are updates, so idx_tup_insert is either useless (because it also

includes non-UPDATE entries) or the naming is misleading. > This also relates to vacuuming strategies. Today HOT updates do not

> count for triggering vacuum (or to be more precise, HOT pruned tuples

> are discounted while counting dead tuples). WARM tuples get the same

> treatment as far as pruning is concerned, but since they cause fresh

> index inserts, I wonder if we need some mechanism to cleanup the dead

> line pointers and dead index entries. This will become more important if

> we do something to convert WARM chains into HOT chains, something that

> only VACUUM can do in the design I've proposed so far.

> True. regards --

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Thanks for the patch. This shows a very good performance improvement. I started reviewing the patch, during this process and I ran the regression

test on the WARM patch. I observed a failure in create_index test.

This may be a bug in code or expected that needs to be corrected. Regards,

Hari Babu

Fujitsu Australia



On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

wrote: >

> Thanks for the patch. This shows a very good performance improvement.

>

>

Thank you. Can you please share the benchmark you ran, results and

observations? > I started reviewing the patch, during this process and I ran the regression

> test on the WARM patch. I observed a failure in create_index test.

> This may be a bug in code or expected that needs to be corrected.

> Can you please share the diff? I ran regression after applying patch on the

current master and did not find any change? Does it happen consistently? I'm also attaching fresh set of patches. The first patch hasn't changed at

all (though I changed the name to v5 to keep it consistent with the other

patch). The second patch has the following changes: 1. WARM updates are now tracked separately. We still don't count number of

index inserts separately as suggested by Tomas.

2. We don't do a WARM update if all columns referenced by all indexes have

changed. Ideally, we should check if all indexes will require an update and

avoid WARM. So there is still some room for improvement here

3. I added a very minimal regression test case. But really, it just

contains one test case which I specifically wanted to test. So not a whole lot of changes since the last version. I'm still waiting for

some serious review of the design/code before I spend a lot more time on

the patch. I hope the patch receives some attention in this CF. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v5.patch application/octet-stream 27.2 KB 0002_warm_updates_v5.patch application/octet-stream 103.5 KB

On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: >

>

> On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

> wrote:

>

>>

>> Thanks for the patch. This shows a very good performance improvement.

>>

>>

> Thank you. Can you please share the benchmark you ran, results and

> observations?

> I just ran a performance test on my laptop with minimal configuration, it

didn't show much

improvement, currently I don't have access to a big machine to test the

performance. > I started reviewing the patch, during this process and I ran the regression

>> test on the WARM patch. I observed a failure in create_index test.

>> This may be a bug in code or expected that needs to be corrected.

>>

>

> Can you please share the diff? I ran regression after applying patch on

> the current master and did not find any change? Does it happen consistently?

> Yes, it is happening consistently. I ran the make installcheck. Attached

the regression.diffs file with the failed test.

I applied the previous warm patch on this commit -

e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15 Regards,

Hari Babu

Fujitsu Australia



Attachment Content-Type Size regression.diffs application/octet-stream 1.0 KB

On Tue, Nov 15, 2016 at 5:58 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

wrote: >

>

> On Sat, Nov 12, 2016 at 10:12 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com

> > wrote:

>

>>

>>

>> On Tue, Nov 8, 2016 at 9:13 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

>> wrote:

>>

>>>

>>> Thanks for the patch. This shows a very good performance improvement.

>>>

>>>

>> Thank you. Can you please share the benchmark you ran, results and

>> observations?

>>

>

> I just ran a performance test on my laptop with minimal configuration, it

> didn't show much

> improvement, currently I don't have access to a big machine to test the

> performance.

>

>

>

>> I started reviewing the patch, during this process and I ran the

>>> regression

>>> test on the WARM patch. I observed a failure in create_index test.

>>> This may be a bug in code or expected that needs to be corrected.

>>>

>>

>> Can you please share the diff? I ran regression after applying patch on

>> the current master and did not find any change? Does it happen consistently?

>>

>

>

> Yes, it is happening consistently. I ran the make installcheck. Attached

> the regression.diffs file with the failed test.

> I applied the previous warm patch on this commit

> - e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15

> Are you able to reproduce the issue? Currently the patch is moved to next CF with "needs review" state. Regards,

Hari Babu

Fujitsu Australia



On Fri, Dec 2, 2016 at 8:34 AM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

wrote: >

>

> On Tue, Nov 15, 2016 at 5:58 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>

> wrote:

>

>>

>>

>>

>> Yes, it is happening consistently. I ran the make installcheck. Attached

>> the regression.diffs file with the failed test.

>> I applied the previous warm patch on this commit

>> - e3e66d8a9813d22c2aa027d8f373a96d4d4c1b15

>>

>

>

> Are you able to reproduce the issue?

>

>

Apologies for the delay. I could reproduce this on a different environment.

It was a case of uninitialised variable and hence the inconsistent results. I've updated the patches after fixing the issue. Multiple rounds of

regression passes for me without any issue. Please let me know if it works

for you. > Currently the patch is moved to next CF with "needs review" state.

>

>

Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v6.patch application/octet-stream 27.2 KB 0002_warm_updates_v6.patch application/octet-stream 106.7 KB

I noticed that this patch changes HeapSatisfiesHOTAndKeyUpdate() by

adding one more set of attributes to check, and one more output boolean

flag. My patch to add indirect indexes also modifies that routine to

add the same set of things. I think after committing both these

patches, the API is going to be fairly ridiculous. I propose to use a

different approach. With your WARM and my indirect indexes, plus the additions for for-key

locks, plus identity columns, there is no longer a real expectation that

we can exit early from the function. In your patch, as well as mine,

there is a semblance of optimization that tries to avoid computing the

updated_attrs output bitmapset if the pointer is not passed in, but it's

effectively pointless because the only interesting use case is from

ExecUpdate() which always activates the feature. Can we just agree to

drop that? If we do drop that, then the function can become much simpler: compare

all columns in new vs. old, return output bitmapset of changed columns.

Then "satisfies_hot" and all the other boolean output flags can be

computed simply in the caller by doing bms_overlap(). Thoughts? --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera wrote: > With your WARM and my indirect indexes, plus the additions for for-key

> locks, plus identity columns, there is no longer a real expectation that

> we can exit early from the function. In your patch, as well as mine,

> there is a semblance of optimization that tries to avoid computing the

> updated_attrs output bitmapset if the pointer is not passed in, but it's

> effectively pointless because the only interesting use case is from

> ExecUpdate() which always activates the feature. Can we just agree to

> drop that? I think the only case that gets worse is the path that does

simple_heap_update, which is used for DDL. I would be very surprised if

a change there is noticeable, when compared to the rest of the stuff

that goes on for DDL commands. Now, after saying that, I think that a table with a very large number of

columns is going to be affected by this. But we don't really need to

compute the output bits for every single column -- we only care about

those that are covered by some index. So we should pass an input

bitmapset comprising all such columns, and the output bitmapset only

considers those columns, and ignores columns not indexed. My patch for

indirect indexes already does something similar (though it passes a

bitmapset of columns indexed by indirect indexes only, so it needs a

tweak there.) --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 2 December 2016 at 07:36, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:

>

> I've updated the patches after fixing the issue. Multiple rounds of

> regression passes for me without any issue. Please let me know if it works

> for you.

> Hi Pavan, Today i was playing with your patch and running some tests and found

some problems i wanted to report before i forget them ;) * You need to add a prototype in src/backend/utils/adt/pgstatfuncs.c:

extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS); * The isolation test for partial_index fails (attached the regression.diffs) * running a home-made test i have at hand i got this assertion:

"""

TRAP: FailedAssertion("!(buf_state & (1U << 24))", File: "bufmgr.c", Line: 837)

LOG: server process (PID 18986) was terminated by signal 6: Aborted

"""

To reproduce:

1) run prepare_test.sql

2) then run the following pgbench command (sql scripts attached):

pgbench -c 24 -j 24 -T 600 -n -f inserts(dot)sql(at)15 -f updates_1(dot)sql(at)20 -f

updates_2(dot)sql(at)20 -f deletes(dot)sql(at)45 db_test * sometimes when i have made the server crash the attempt to recovery

fails with this assertion:

"""

LOG: database system was not properly shut down; automatic recovery in progress

LOG: redo starts at 0/157F970

TRAP: FailedAssertion("!(!warm_update)", File: "heapam.c", Line: 8924)

LOG: startup process (PID 14031) was terminated by signal 6: Aborted

LOG: aborting startup due to startup process failure

"""

still cannot reproduce this one consistently but happens often enough will continue playing with it... --

Jaime Casanova www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Attachment Content-Type Size regression.diffs application/octet-stream 6.7 KB deletes.sql application/sql 253 bytes inserts.sql application/sql 98 bytes prepare_test.sql application/sql 453 bytes updates_1.sql application/sql 263 bytes updates_2.sql application/sql 286 bytes

Jaime Casanova wrote: > * The isolation test for partial_index fails (attached the regression.diffs) Hmm, I had a very similar (if not identical) failure with indirect

indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I

was missing to have HOT considerate the columns in index predicate, that

is, the second pull_varattnos() call. --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Alvaro Herrera wrote:

> Jaime Casanova wrote:

>

> > * The isolation test for partial_index fails (attached the regression.diffs)

>

> Hmm, I had a very similar (if not identical) failure with indirect

> indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I

> was missing to have HOT considerate the columns in index predicate, that

> is, the second pull_varattnos() call. Sorry, I meant: Hmm, I had a very similar (if not identical) failure with indirect

indexes; in my case it was a bug in RelationGetIndexAttrBitmap() -- I

was missing to have HOT [take into account] the columns in index predicate, that

is, the second pull_varattnos() call. --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sat, Dec 24, 2016 at 1:18 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>

wrote: > Alvaro Herrera wrote:

>

> > With your WARM and my indirect indexes, plus the additions for for-key

> > locks, plus identity columns, there is no longer a real expectation that

> > we can exit early from the function. In your patch, as well as mine,

> > there is a semblance of optimization that tries to avoid computing the

> > updated_attrs output bitmapset if the pointer is not passed in, but it's

> > effectively pointless because the only interesting use case is from

> > ExecUpdate() which always activates the feature. Can we just agree to

> > drop that?

>

>

Yes, I agree. As you noted below, the only case that may be affected is

simple_heap_update() which does a lot more and hence this function will be

least of the worries. > I think the only case that gets worse is the path that does

> simple_heap_update, which is used for DDL. I would be very surprised if

> a change there is noticeable, when compared to the rest of the stuff

> that goes on for DDL commands.

>

> Now, after saying that, I think that a table with a very large number of

> columns is going to be affected by this. But we don't really need to

> compute the output bits for every single column -- we only care about

> those that are covered by some index. So we should pass an input

> bitmapset comprising all such columns, and the output bitmapset only

> considers those columns, and ignores columns not indexed. My patch for

> indirect indexes already does something similar (though it passes a

> bitmapset of columns indexed by indirect indexes only, so it needs a

> tweak there.)

>

>

Yes, that looks like a good compromise. This would require us to compare

only those columns that any caller of the function might be interested in. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On Mon, Dec 26, 2016 at 11:49 AM, Jaime Casanova <

jaime(dot)casanova(at)2ndquadrant(dot)com> wrote: > On 2 December 2016 at 07:36, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

> wrote:

> >

> > I've updated the patches after fixing the issue. Multiple rounds of

> > regression passes for me without any issue. Please let me know if it

> works

> > for you.

> >

>

> Hi Pavan,

>

> Today i was playing with your patch and running some tests and found

> some problems i wanted to report before i forget them ;)

>

>

Thanks Jaime for the tests and bug reports. I'm attaching an add-on patch

which fixes these issues for me. I'm deliberately not sending a fresh

revision because the changes are still minor. > * You need to add a prototype in src/backend/utils/adt/pgstatfuncs.c:

> extern Datum pg_stat_get_tuples_warm_updated(PG_FUNCTION_ARGS);

> Added. >

> * The isolation test for partial_index fails (attached the

> regression.diffs)

> Fixed. Looks like I forgot to include attributes from predicates and

expressions in the list of index attributes (as pointed by Alvaro) >

> * running a home-made test i have at hand i got this assertion:

> """

> TRAP: FailedAssertion("!(buf_state & (1U << 24))", File: "bufmgr.c", Line:

> 837)

> LOG: server process (PID 18986) was terminated by signal 6: Aborted

> """

> To reproduce:

> 1) run prepare_test.sql

> 2) then run the following pgbench command (sql scripts attached):

> pgbench -c 24 -j 24 -T 600 -n -f inserts(dot)sql(at)15 -f updates_1(dot)sql(at)20 -f

> updates_2(dot)sql(at)20 -f deletes(dot)sql(at)45 db_test

>

>

Looks like the patch was failing to set the block number correctly in the

t_ctid field, leading to these strange failures. There was also couple of

instances where the t_ctid field was being accessed directly, instead of

the newly added macro. I think we need some better mechanism to ensure that

we don't miss out on such things. But I don't have a very good idea about

doing that right now. >

> * sometimes when i have made the server crash the attempt to recovery

> fails with this assertion:

> """

> LOG: database system was not properly shut down; automatic recovery in

> progress

> LOG: redo starts at 0/157F970

> TRAP: FailedAssertion("!(!warm_update)", File: "heapam.c", Line: 8924)

> LOG: startup process (PID 14031) was terminated by signal 6: Aborted

> LOG: aborting startup due to startup process failure

> """

> still cannot reproduce this one consistently but happens often enough

>

>

This could be a case of uninitialised variable in log_heap_update(). What

surprises me though that none of the compilers I tried so far could catch

that. In the following code snippet, if the condition evaluates to false

then "warm_update" may remain uninitialised, leading to wrong xlog entry,

which may later result in assertion failure during redo recovery. 7845

7846 if (HeapTupleIsHeapWarmTuple(newtup))

7847 warm_update = true;

7848 Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0003_warm_fixes_v6.patch application/octet-stream 3.7 KB

On Tue, Dec 27, 2016 at 6:51 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: >

>

> Thanks Jaime for the tests and bug reports. I'm attaching an add-on patch

> which fixes these issues for me. I'm deliberately not sending a fresh

> revision because the changes are still minor.

>

> Per Alvaro's request in another thread, I've rebased these patches on his

patch to refactor HeapSatisfiesHOTandKeyUpdate(). I've also attached that

patch here for easy reference. The fixes based on bug reports by Jaime are also included in this patch

set. Other than that there are not any significant changes. The patch still

disables WARM on system tables, something I would like to fix. But I've

been delaying that because it will require changes at several places since

indexes on system tables are managed separately. In addition to that, the

patch only works with btree and hash indexes. We must implement the recheck

method for other index types so as to support them. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size interesting-attrs-2.patch application/octet-stream 11.7 KB 0001_track_root_lp_v7.patch application/octet-stream 27.4 KB 0002_warm_updates_v7.patch application/octet-stream 103.7 KB

On Tue, Jan 3, 2017 at 9:43 AM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: > The patch still disables WARM on system tables, something I would like to

> fix. But I've been delaying that because it will require changes at several

> places since indexes on system tables are managed separately.

> Here is another version which fixes a bug that I discovered while adding

support for system tables. The patch set now also includes a patch to

enable WARM on system tables. I'm attaching that as a separate patch

because while the changes to support WARM on system tables are many, almost

all of them are purely mechanical. We need to pass additional information

to CatalogUpdateIndexes()/CatalogIndexInsert(). We need to tell these

routines whether the update leading to them was a WARM update and which

columns were modified so that it can correctly avoid adding new index

tuples for indexes for which index keys haven't changed. I wish I could find another way of passing this information instead of

making changes at so many places, but the only other way I could think of

was tracking that information as part of the HeapTuple itself, which

doesn't seem nice and may also require changes at many call sites where

tuples are constructed. One minor improvement could be that instead of two,

we could just pass "modified_attrs" and a NULL value may imply non-WARM

update. Other suggestions are welcome though. I'm quite happy that all tests pass even after adding support for system

tables. One reason for testing support for system tables was to ensure some

more code paths get exercised. As before, I've included Alvaro's

refactoring patch too. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v8.patch application/octet-stream 27.4 KB 0002_warm_updates_v8.patch application/octet-stream 102.7 KB 0003_warm_fixes_v6.patch application/octet-stream 3.7 KB interesting-attrs-2.patch application/octet-stream 11.7 KB

Reading through the track_root_lp patch now. > + /*

> + * For HOT (or WARM) updated tuples, we store the offset of the root

> + * line pointer of this chain in the ip_posid field of the new tuple.

> + * Usually this information will be available in the corresponding

> + * field of the old tuple. But for aborted updates or pg_upgraded

> + * databases, we might be seeing the old-style CTID chains and hence

> + * the information must be obtained by hard way

> + */

> + if (HeapTupleHeaderHasRootOffset(oldtup.t_data))

> + root_offnum = HeapTupleHeaderGetRootOffset(oldtup.t_data);

> + else

> + heap_get_root_tuple_one(page,

> + ItemPointerGetOffsetNumber(&(oldtup.t_self)),

> + &root_offnum); Hmm. So the HasRootOffset tests the HEAP_LATEST_TUPLE bit, which is

reset temporarily during an update. So that case shouldn't occur often. Oh, I just noticed that HeapTupleHeaderSetNextCtid also clears the flag. > @@ -4166,10 +4189,29 @@ l2:

> HeapTupleClearHotUpdated(&oldtup);

> HeapTupleClearHeapOnly(heaptup);

> HeapTupleClearHeapOnly(newtup);

> + root_offnum = InvalidOffsetNumber;

> }

>

> - RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */

> + /* insert new tuple */

> + RelationPutHeapTuple(relation, newbuf, heaptup, false, root_offnum);

> + HeapTupleHeaderSetHeapLatest(heaptup->t_data);

> + HeapTupleHeaderSetHeapLatest(newtup->t_data);

>

> + /*

> + * Also update the in-memory copy with the root line pointer information

> + */

> + if (OffsetNumberIsValid(root_offnum))

> + {

> + HeapTupleHeaderSetRootOffset(heaptup->t_data, root_offnum);

> + HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);

> + }

> + else

> + {

> + HeapTupleHeaderSetRootOffset(heaptup->t_data,

> + ItemPointerGetOffsetNumber(&heaptup->t_self));

> + HeapTupleHeaderSetRootOffset(newtup->t_data,

> + ItemPointerGetOffsetNumber(&heaptup->t_self));

> + } This is repetitive. I think after RelationPutHeapTuple it'd be better

to assign root_offnum = &heaptup->t_self, so that we can just call

SetRootOffset() on each tuple without the if(). > + HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);

> + if (OffsetNumberIsValid(root_offnum))

> + HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,

> + root_offnum);

> + else

> + HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,

> + offnum); Just a matter of style, but this reads nicer IMO: HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,

OffsetNumberIsValid(root_offnum) ? root_offnum : offnum); > @@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,

> * holds a pin on the buffer. Once pin is released, a tuple might be pruned

> * and reused by a completely unrelated tuple.

> */

> -void

> -heap_get_root_tuples(Page page, OffsetNumber *root_offsets)

> +static void

> +heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,

> + OffsetNumber *root_offsets)

> {

> OffsetNumber offnum, I think this function deserves more/better/updated commentary. > @@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,

> * set the ctid of this tuple to point to the new location, and

> * insert it right away.

> */

> - new_tuple->t_data->t_ctid = mapping->new_tid;

> + HeapTupleHeaderSetNextCtid(new_tuple->t_data,

> + ItemPointerGetBlockNumber(&mapping->new_tid),

> + ItemPointerGetOffsetNumber(&mapping->new_tid)); I think this would be nicer:

HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);

AFAICS all the callers are doing ItemPointerGetFoo for a TID, so this is

overly verbose for no reason. Also, the "c" in Ctid stands for

"current"; I think we can omit that. > @@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,

> new_tuple = unresolved->tuple;

> free_new = true;

> old_tid = unresolved->old_tid;

> - new_tuple->t_data->t_ctid = new_tid;

> + HeapTupleHeaderSetNextCtid(new_tuple->t_data,

> + ItemPointerGetBlockNumber(&new_tid),

> + ItemPointerGetOffsetNumber(&new_tid)); Did you forget to SetHeapLatest here, or ..? (If not, a comment is

warranted). > diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c

> index 32bb3f9..466609c 100644

> --- a/src/backend/executor/execMain.c

> +++ b/src/backend/executor/execMain.c

> @@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode,

> * As above, it should be safe to examine xmax and t_ctid without the

> * buffer content lock, because they can't be changing.

> */

> - if (ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))

> + if (HeapTupleHeaderIsHeapLatest(tuple.t_data, tuple.t_self))

> {

> /* deleted, so forget about it */

> ReleaseBuffer(buffer); This is the place where this patch would have an effect. To test this

bit I think we're going to need an ad-hoc stress-test harness. > +/*

> + * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain. But for

> + * clusters which are upgraded from pre-10.0 release, we still check if c_tid

> + * is pointing to itself and declare such tuple as the latest tuple in the

> + * chain

> + */

> +#define HeapTupleHeaderIsHeapLatest(tup, tid) \

> +( \

> + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \

> + ((ItemPointerGetBlockNumber(&(tup)->t_ctid) == ItemPointerGetBlockNumber(&tid)) && \

> + (ItemPointerGetOffsetNumber(&(tup)->t_ctid) == ItemPointerGetOffsetNumber(&tid))) \

> +) Please add a "!= 0" to the first arm of the ||, so that we return a boolean. > +/*

> + * Get TID of next tuple in the update chain. Traditionally, we have stored

> + * self TID in the t_ctid field if the tuple is the last tuple in the chain. We

> + * try to preserve that behaviour by returning self-TID if HEAP_LATEST_TUPLE

> + * flag is set.

> + */

> +#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \

> +do { \

> + if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \

> + { \

> + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \

> + (offnum)); \

> + } \

> + else \

> + { \

> + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid), \

> + ItemPointerGetOffsetNumber(&(tup)->t_ctid)); \

> + } \

> +} while (0) This is a really odd macro, I think. Is any of the callers really

depending on the traditional behavior? If so, can we change them to

avoid that? (I think the "else" can be more easily written with

ItemPointerCopy). In any case, I think the documentation of the macro

leaves a bit to be desired -- I don't think we really care all that much

what we used to do, except perhaps as a secondary comment, but we do

care very much about what it actually does, which the current comment

doesn't really explain. --

Álvaro Herrera https://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Alvaro, On Tue, Jan 17, 2017 at 8:41 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>

wrote: > Reading through the track_root_lp patch now.

>

>

Thanks for the review. > > + /*

> > + * For HOT (or WARM) updated tuples, we store the offset

> of the root

> > + * line pointer of this chain in the ip_posid field of the

> new tuple.

> > + * Usually this information will be available in the

> corresponding

> > + * field of the old tuple. But for aborted updates or

> pg_upgraded

> > + * databases, we might be seeing the old-style CTID chains

> and hence

> > + * the information must be obtained by hard way

> > + */

> > + if (HeapTupleHeaderHasRootOffset(oldtup.t_data))

> > + root_offnum = HeapTupleHeaderGetRootOffset(o

> ldtup.t_data);

> > + else

> > + heap_get_root_tuple_one(page,

> > + ItemPointerGetOffsetNumber(&(

> oldtup.t_self)),

> > + &root_offnum);

>

> Hmm. So the HasRootOffset tests the HEAP_LATEST_TUPLE bit, which is

> reset temporarily during an update. So that case shouldn't occur often.

> Right. The root offset is stored only in those tuples where

HEAP_LATEST_TUPLE is set. This flag should generally be set on the tuples

that are being updated, except for the case when the last update failed and

the flag was cleared. In other common case is going to be pg-upgraded

cluster where none of the existing tuples will have this flag set. So in

those cases, we must find the root line pointer hard way. >

> Oh, I just noticed that HeapTupleHeaderSetNextCtid also clears the flag.

> Yes, but this should happen only during updates and unless the update

fails, the next-to-be-updated tuple should have the flag set. >

> > @@ -4166,10 +4189,29 @@ l2:

> > HeapTupleClearHotUpdated(&oldtup);

> > HeapTupleClearHeapOnly(heaptup);

> > HeapTupleClearHeapOnly(newtup);

> > + root_offnum = InvalidOffsetNumber;

> > }

> >

> > - RelationPutHeapTuple(relation, newbuf, heaptup, false);

> /* insert new tuple */

> > + /* insert new tuple */

> > + RelationPutHeapTuple(relation, newbuf, heaptup, false,

> root_offnum);

> > + HeapTupleHeaderSetHeapLatest(heaptup->t_data);

> > + HeapTupleHeaderSetHeapLatest(newtup->t_data);

> >

> > + /*

> > + * Also update the in-memory copy with the root line pointer

> information

> > + */

> > + if (OffsetNumberIsValid(root_offnum))

> > + {

> > + HeapTupleHeaderSetRootOffset(heaptup->t_data,

> root_offnum);

> > + HeapTupleHeaderSetRootOffset(newtup->t_data, root_offnum);

> > + }

> > + else

> > + {

> > + HeapTupleHeaderSetRootOffset(heaptup->t_data,

> > + ItemPointerGetOffsetNumber(&h

> eaptup->t_self));

> > + HeapTupleHeaderSetRootOffset(newtup->t_data,

> > + ItemPointerGetOffsetNumber(&h

> eaptup->t_self));

> > + }

>

> This is repetitive. I think after RelationPutHeapTuple it'd be better

> to assign root_offnum = &heaptup->t_self, so that we can just call

> SetRootOffset() on each tuple without the if().

> Fixed. I actually ripped off HeapTupleHeaderSetRootOffset() completely and

pushed setting of root line pointer into the HeapTupleHeaderSetHeapLatest().

That seems much cleaner because the system expects to find root line

pointer whenever HEAP_LATEST_TUPLE flag is set. Hence it makes sense to set

them together. >

>

> > + HeapTupleHeaderSetHeapLatest((HeapTupleHeader) item);

> > + if (OffsetNumberIsValid(root_offnum))

> > + HeapTupleHeaderSetRootOffset((HeapTupleHeader)

> item,

> > + root_offnum);

> > + else

> > + HeapTupleHeaderSetRootOffset((HeapTupleHeader)

> item,

> > + offnum);

>

> Just a matter of style, but this reads nicer IMO:

>

> HeapTupleHeaderSetRootOffset((HeapTupleHeader) item,

> OffsetNumberIsValid(root_offnum) ? root_offnum : offnum);

> Understood. This code no longer exists in the new patch since

HeapTupleHeaderSetRootOffset is merged with HeapTupleHeaderSetHeapLatest. >

>

> > @@ -740,8 +742,9 @@ heap_page_prune_execute(Buffer buffer,

> > * holds a pin on the buffer. Once pin is released, a tuple might be

> pruned

> > * and reused by a completely unrelated tuple.

> > */

> > -void

> > -heap_get_root_tuples(Page page, OffsetNumber *root_offsets)

> > +static void

> > +heap_get_root_tuples_internal(Page page, OffsetNumber target_offnum,

> > + OffsetNumber *root_offsets)

> > {

> > OffsetNumber offnum,

>

> I think this function deserves more/better/updated commentary.

> Sure. I added more commentary. I also reworked the function so that the

caller can pass just one item array when it's interested in finding root

line pointer for just one item. Hopefully that will save a few bytes on the

stack. >

> > @@ -439,7 +439,9 @@ rewrite_heap_tuple(RewriteState state,

> > * set the ctid of this tuple to point to the new

> location, and

> > * insert it right away.

> > */

> > - new_tuple->t_data->t_ctid = mapping->new_tid;

> > + HeapTupleHeaderSetNextCtid(new_tuple->t_data,

> > + ItemPointerGetBlockNumber(&ma

> pping->new_tid),

> > + ItemPointerGetOffsetNumber(&m

> apping->new_tid));

>

> I think this would be nicer:

> HeapTupleHeaderSetNextTid(new_tuple->t_data, &mapping->new_tid);

> AFAICS all the callers are doing ItemPointerGetFoo for a TID, so this is

> overly verbose for no reason. Also, the "c" in Ctid stands for

> "current"; I think we can omit that.

> Yes, fixed. I realised that all callers where anyways calling the macro

with the block/offset of the same TID. So it makes sense to just pass TID

to the macro. >

> > @@ -525,7 +527,9 @@ rewrite_heap_tuple(RewriteState state,

> > new_tuple = unresolved->tuple;

> > free_new = true;

> > old_tid = unresolved->old_tid;

> > - new_tuple->t_data->t_ctid = new_tid;

> > + HeapTupleHeaderSetNextCtid(ne

> w_tuple->t_data,

> > +

> ItemPointerGetBlockNumber(&new_tid),

> > +

> ItemPointerGetOffsetNumber(&new_tid));

>

> Did you forget to SetHeapLatest here, or ..? (If not, a comment is

> warranted).

> Umm probably not. The way I see it, new_tuple is not actually the new tuple

when this is called, but it's changed to the unresolved tuple (see the

start of the hunk). So what we're doing is setting next CTID in the

previous tuple in the chain. SetHeapLatest is called on the new tuple

inside raw_heap_insert(). I did not add any more comments, but please let

me know if you think it's still confusing or if I'm missing something. >

> > diff --git a/src/backend/executor/execMain.c

> b/src/backend/executor/execMain.c

> > index 32bb3f9..466609c 100644

> > --- a/src/backend/executor/execMain.c

> > +++ b/src/backend/executor/execMain.c

> > @@ -2443,7 +2443,7 @@ EvalPlanQualFetch(EState *estate, Relation

> relation, int lockmode,

> > * As above, it should be safe to examine xmax and t_ctid

> without the

> > * buffer content lock, because they can't be changing.

> > */

> > - if (ItemPointerEquals(&tuple.t_self,

> &tuple.t_data->t_ctid))

> > + if (HeapTupleHeaderIsHeapLatest(tuple.t_data,

> tuple.t_self))

> > {

> > /* deleted, so forget about it */

> > ReleaseBuffer(buffer);

>

> This is the place where this patch would have an effect. To test this

> bit I think we're going to need an ad-hoc stress-test harness.

> Sure. I did some pgbench tests and ran consistency checks during and at the

end of tests. I chose a small scale factor and many clients so that same

tuple is often concurrently updated. That should exercise the new

chain-following code reguorsly. But I'll do more of those on a bigger box.

Do you have other suggestions for ad-hoc tests? >

>

> > +/*

> > + * If HEAP_LATEST_TUPLE is set in the last tuple in the update chain.

> But for

> > + * clusters which are upgraded from pre-10.0 release, we still check if

> c_tid

> > + * is pointing to itself and declare such tuple as the latest tuple in

> the

> > + * chain

> > + */

> > +#define HeapTupleHeaderIsHeapLatest(tup, tid) \

> > +( \

> > + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) || \

> > + ((ItemPointerGetBlockNumber(&(tup)->t_ctid) ==

> ItemPointerGetBlockNumber(&tid)) && \

> > + (ItemPointerGetOffsetNumber(&(tup)->t_ctid) ==

> ItemPointerGetOffsetNumber(&tid))) \

> > +)

>

> Please add a "!= 0" to the first arm of the ||, so that we return a

> boolean.

>

>

Done. Also rebased with new master where similar changes have been done. >

> > +/*

> > + * Get TID of next tuple in the update chain. Traditionally, we have

> stored

> > + * self TID in the t_ctid field if the tuple is the last tuple in the

> chain. We

> > + * try to preserve that behaviour by returning self-TID if

> HEAP_LATEST_TUPLE

> > + * flag is set.

> > + */

> > +#define HeapTupleHeaderGetNextCtid(tup, next_ctid, offnum) \

> > +do { \

> > + if ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) \

> > + { \

> > + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid),

> \

> > + (offnum)); \

> > + } \

> > + else \

> > + { \

> > + ItemPointerSet((next_ctid), ItemPointerGetBlockNumber(&(tup)->t_ctid),

> \

> > + ItemPointerGetOffsetNumber(&(tup)->t_ctid));

> \

> > + } \

> > +} while (0)

>

> This is a really odd macro, I think. Is any of the callers really

> depending on the traditional behavior? If so, can we change them to

> avoid that? (I think the "else" can be more easily written with

> ItemPointerCopy). In any case, I think the documentation of the macro

> leaves a bit to be desired -- I don't think we really care all that much

> what we used to do, except perhaps as a secondary comment, but we do

> care very much about what it actually does, which the current comment

> doesn't really explain.

>

>

I reworked this quite a bit and I believe the new code does what you

suggested. The HeapTupleHeaderGetNextTid macro is now much simpler (it

just copies the TID) and we leave it to the caller to ensure they don't

call this on a tuple which is already at the end of the chain (i.e has

HEAP_LATEST_TUPLE set, but we don't look for old-style end-of-the-chain

markers). The callers can choose to return the same TID back if their

callers rely on that behaviour. But inside this macro, we now assert that

HEAP_LATEST_TUPLE is not set. One thing that worried me is if there exists a path which sets the

t_infomask (and hence HEAP_LATEST_TUPLE) during redo recovery and if we

will fail to set the root line pointer correctly along with that. But

AFAICS the interesting cases of insert, multi-insert and update are being

handled ok. The only other places where I saw t_infomask being copied as-is

from the WAL record is DecodeXLogTuple() and DecodeMultiInsert(), but those

should not cause any problem AFAICS. Revised patch is attached. All regression tests, isolation tests and

pgbench test with -c40 -j10 pass on my laptop. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size 0001_track_root_lp_v9.patch application/octet-stream 37.3 KB

On Thu, Jan 19, 2017 at 6:35 PM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>

wrote: >

>

> Revised patch is attached.

> I've now also rebased the main WARM patch against the current master

(3eaf03b5d331b7a06d79 to be precise). I'm attaching Alvaro's patch to get

interesting attributes (prefixed with 0000 since the other two patches are

based on that). The changes to support system tables are now merged with

the main patch. I could separate them if it helps in review. I am also including a stress test workload that I am currently running to

test WARM's correctness since Robert raised a valid concern about that. The

idea is to include a few more columns in the pgbench_accounts table and

have a few more indexes. The additional columns with indexes kind of share

a relationship with the "aid" column. But instead of a fixed value, values

for these columns can vary within a fixed, non-overlapping range. For

example, for aid = 1, aid1's original value will be 10 and it can vary

between 8 to 12. Similarly, aid2's original value will be 20 and it can

vary between 16 to 24. This setup allows us to update these additional

columns (thus force WARM), but still ensure that we can do some sanity

checks on the results. The test contains a bunch of UPDATE, FOR UPDATE, FOR SHARE transactions.

Some of these transactions commit and some rollback. The checks are

in-place to ensure that we always find exactly one tuple irrespective of

which column we use to fetch the row. Of course, when the aid[1-4] columns

are used to fetch tuples, we need to scan with a range instead of an

equality. Then we do a bunch of operations like CREATE INDEX, DROP INDEX,

CIC, run long transactions, VACUUM FULL etc while the tests are running and

ensure that the sanity checks always pass. We could do a few other things

like, may be marking these indexes as UNIQUE or keeping a long transaction

open while doing updates and other operations. I'll add some of those to

the test, but suggestions are welcome. I do see a problem with CREATE INDEX CONCURRENTLY with these tests, though

everything else has run ok so far (I am yet to do very long running tests.

Probably just a few hours tests today). I'm trying to understand why CIC fails to build a consistent index. I think

I've some clue now why it could be happening. With HOT, we don't need to

worry about broken chains since at the very beginning we add the index

tuple and all subsequent updates will honour the new index while deciding

on HOT updates i.e. we won't create any new broken HOT chains once we start

building the index. Later during validation phase, we only need to insert

tuples that are not already in the index. But with WARM, I think the check

needs to be more elaborate. So even if the TID (we always look at its root

line pointer etc) exists in the index, we will need to ensure that the

index key matches the heap tuple we are dealing with. That looks a bit

tricky. May be we can lookup the index using key from the current heap

tuple and then see if we get a tuple with the same TID back. Of course, we

need to do this only if the tuple is a WARM tuple. The other option is that

we collect not only TIDs but also keys while scanning the index. That might

increase the size of the state information for wildly wide indexes. Or may

be just turn WARM off if there exists a build-in-progress index. Suggestions/reviews/tests welcome. Thanks,

Pavan --

Pavan Deolasee http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Attachment Content-Type Size warm_stress_test.tar.gz application/x-gzip 3.0 KB 0000_interesting_attrs.patch application/octet-stream 11.6 KB 0001_track_root_lp_v9.patch application/octet-stream 37.3 KB 0002_warm_updates_v9.patch application/octet-stream 253.2 KB

Reading 0001_track_root_lp_v9.patch again: > +/*

> + * We use the same HEAP_LATEST_TUPLE flag to check if the tuple's t_ctid field

> + * contains the root line pointer. We can't use the same

> + * HeapTupleHeaderIsHeapLatest macro because that also checks for TID-equality

> + * to decide whether a tuple is at the of the chain

> + */

> +#define HeapTupleHeaderHasRootOffset(tup) \

> +( \

> + ((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0 \

> +)

>

> +#define HeapTupleHeaderGetRootOffset(tup) \

> +( \

> + AssertMacro(((tup)->t_infomask2 & HEAP_LATEST_TUPLE) != 0), \

> + ItemPointerGetOffsetNumber(&(tup)->t_ctid) \

> +) Interesting stuff; it took me a bit to see why thes