Proposal: "Causal reads" mode for load balancing reads without stale data

Hi hackers, Many sites use hot standby servers to spread read-heavy workloads over more

hardware, or at least would like to. This works well today if your

application can tolerate some time lag on standbys. The problem is that

there is no guarantee of when a particular commit will become visible for

clients connected to standbys. The existing synchronous commit feature is

no help here because it guarantees only that the WAL has been flushed on

another server before commit returns. It says nothing about whether it has

been applied or whether it has been applied on the standby that you happen

to be talking to. A while ago I posted a small patch[1] to allow synchronous_commit to wait

for remote apply on the current synchronous standby, but (as Simon Riggs

rightly pointed out in that thread) that part isn't the main problem. It

seems to me that the main problem for a practical 'writer waits' system is

how to support a dynamic set of servers, gracefully tolerating failures and

timing out laggards, while also providing a strong guarantee during any

such transitions. Now I would like to propose something to do that, and

share a proof-of-concept patch. === PROPOSAL === The working name for the proposed feature is "causal reads", because it

provides a form of "causal consistency"[2] (and "read-your-writes"

consistency) no matter which server the client is connected to. There is a

similar feature by the same name in another product (albeit implemented

differently -- 'reader waits'; more about that later). I'm not wedded to

the name. The feature allows arbitrary read-only transactions to be run on any hot

standby, with a specific guarantee about the visibility of preceding

transactions. The guarantee is that if you set a new GUC "causal_reads =

on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins

after tx1 successfully returns, then tx2 will either see tx1 or fail with a

new error "standby is not available for causal reads", no matter which

server it runs on. A discovery mechanism is also provided, giving an

instantaneous snapshot of the set of standbys that are currently available

for causal reads (ie won't raise the error), in the form of a new column in

pg_stat_replication. For example, a web server might run tx1 to insert a new row representing a

message in a discussion forum on the primary server, and then send the user

to another web page that runs tx2 to load all messages in the forum on an

arbitrary hot standby server. If causal_reads = on in both tx1 and tx2

(for example, because it's on globally), then tx2 is guaranteed to see the

new post, or get a (hopefully rare) error telling the client to retry on

another server. Very briefly, the approach is:

1. The primary tracks apply lag on each standby (including between

commits).

2. The primary deems standbys 'available' for causal reads if they are

applying WAL and replying to keepalives fast enough, and periodically sends

the standby an authorization to consider itself available for causal reads

until a time in the near future.

3. Commit on the primary with "causal_reads = on" waits for all

'available' standbys either to apply the commit record, or to cease to be

'available' and begin raising the error if they are still alive (because

their authorizations have expired).

4. Standbys can start causal reads transactions only while they have an

authorization with an expiry time in the future; otherwise they raise an

error when an initial snapshot is taken. In a follow-up email I can write about the design trade-offs considered

(mainly 'writer waits' vs 'reader waits'), comparison with some other

products, method of estimating replay lag, wait and timeout logic and how

it maintains the guarantee in various failure scenarios, logic for standbys

joining and leaving, implications of system clock skew between servers, or

any other questions you may have, depending on feedback/interest (but see

comments in the attached patch for some of those subjects). For now I

didn't want to clog up the intertubes with too large a wall of text. === PROOF-OF-CONCEPT === Please see the POC patch attached. It adds two new GUCs. After setting up

one or more hot standbys as per usual, simply add "causal_reads_timeout =

4s" to the primary's postgresql.conf and restart. Now, you can set

"causal_reads = on" in some/all sessions to get guaranteed causal

consistency. Expected behaviour: the causal reads guarantee is maintained

at all times, even when you overwhelm, kill, crash, disconnect, restart,

pause, add and remove standbys, and the primary drops them from the set it

waits for in a timely fashion. You can monitor the system with the

replay_lag and causal_reads_status in pg_stat_replication and some state

transition LOG messages on the primary. (The patch also supports

"synchronous_commit = apply", but it's not clear how useful that is in

practice, as already discussed.) Lastly, a few notes about how this feature related to some other work: The current version of this patch has causal_reads as a feature separate

from synchronous_commit, from a user's point of view. The thinking behind

this is that load balancing and data loss avoidance are separate concerns:

synchronous_commit deals with the latter, and causal_reads with the

former. That said, existing SyncRep machinery is obviously used

(specifically SyncRep queues, with a small modification, as a way to wait

for apply messages to arrive from standbys). (An earlier prototype had

causal reads as a new level for synchronous_commit and associated states as

new walsender states above 'streaming'. When contemplating how to combine

this proposal with the multiple-synchronous-standby patch, some colleagues

and I came around to the view that the concerns are separate. The reason

for wanting to configure complicated quorum definitions is to control data

loss risks and has nothing to do with load balancing requirements, so we

thought the features should probably be separate.) The multiple-synchronous-servers patch[3] could be applied or not

independently of this feature as a result of that separation, as it doesn't

use synchronous_standby_names or indeed any kind of statically defined

quorum. The standby WAL writer patch[4] would significantly improve walreceiver

performance and smoothness which would work very well with this proposal. Please let me know what you think! Thanks, [1]

http://www.postgresql.org/message-id/flat/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag(at)mail(dot)gmail(dot)com [2] From http://queue.acm.org/detail.cfm?id=1466448 "Causal consistency. If process A has communicated to process B that it has

updated a data item, a subsequent access by process B will return the

updated value, and a write is guaranteed to supersede the earlier write.

Access by process C that has no causal relationship to process A is subject

to the normal eventual consistency rules. Read-your-writes consistency. This is an important model where process A,

after it has updated a data item, always accesses the updated value and

will never see an older value. This is a special case of the causal

consistency model." [3]

http://www.postgresql.org/message-id/flat/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ(at)mail(dot)gmail(dot)com [4]

http://www.postgresql.org/message-id/flat/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ(at)mail(dot)gmail(dot)com --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-poc.patch application/octet-stream 67.5 KB

On 11 November 2015 at 05:37, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: Many sites use hot standby servers to spread read-heavy workloads over more

> hardware, or at least would like to. This works well today if your

> application can tolerate some time lag on standbys. The problem is that

> there is no guarantee of when a particular commit will become visible for

> clients connected to standbys. The existing synchronous commit feature is

> no help here because it guarantees only that the WAL has been flushed on

> another server before commit returns. It says nothing about whether it has

> been applied or whether it has been applied on the standby that you happen

> to be talking to.

> Thanks for working on this issue. > 3. Commit on the primary with "causal_reads = on" waits for all

> 'available' standbys either to apply the commit record, or to cease to be

> 'available' and begin raising the error if they are still alive (because

> their authorizations have expired).

> This causes every writer to wait. What we want is to isolate the wait only to people performing a write-read

sequence, so I think it should be readers that wait. Let's have that debate

up front before we start reviewing the patch. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 11/11/2015 10:23 AM, Simon Riggs wrote:

> On 11 November 2015 at 05:37, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

> wrote:

>

> Many sites use hot standby servers to spread read-heavy workloads over more

>> hardware, or at least would like to. This works well today if your

>> application can tolerate some time lag on standbys. The problem is that

>> there is no guarantee of when a particular commit will become visible for

>> clients connected to standbys. The existing synchronous commit feature is

>> no help here because it guarantees only that the WAL has been flushed on

>> another server before commit returns. It says nothing about whether it has

>> been applied or whether it has been applied on the standby that you happen

>> to be talking to.

>

> Thanks for working on this issue. +1. >> 3. Commit on the primary with "causal_reads = on" waits for all

>> 'available' standbys either to apply the commit record, or to cease to be

>> 'available' and begin raising the error if they are still alive (because

>> their authorizations have expired).

>>

>

> This causes every writer to wait.

>

> What we want is to isolate the wait only to people performing a write-read

> sequence, so I think it should be readers that wait. Let's have that debate

> up front before we start reviewing the patch. Agreed. And in the write-read sequence, you don't need to wait at the

write either, it's enough that you wait just before you start doing the

read. An application might do a lot of other things between the two, so

that in most cases, there would in fact be no waiting as the record is

already applied when you perform the read. I'm thinking the client should get some kind of a token back from the

commit, and it could use the token on the standby, to wait for that

commit to be applied. The token could be just the XID, or the LSN of the

commit record. Or the application could generate the token and pass it

to the server in the commit, similar to how 2PC works. So the

interaction would be something like: In master:

BEGIN;

INSERT INTO FOO ...;

COMMIT;

Server returns: COMMITted with token 1234 Later, in standby:

BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

SELECT * FROM foo;

... - Heikki

> I'm thinking the client should get some kind of a token back from the

commit, and it could use the token on the standby, to wait for that commit

to be applied. The token could be just the XID, or the LSN of the commit

record. Or the application could generate the token and pass it to the

server in the commit, similar to how 2PC works. So the interaction would be

something like:

>

> In master:

> BEGIN;

> INSERT INTO FOO ...;

> COMMIT;

> Server returns: COMMITted with token 1234

>

> Later, in standby:

> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

> SELECT * FROM foo; +1. The LSN should be good enough IMO.



On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote: > On 11/11/2015 10:23 AM, Simon Riggs wrote:

>

>> On 11 November 2015 at 05:37, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com

>> >

>> wrote:

>>

>> Many sites use hot standby servers to spread read-heavy workloads over

>> more

>>

>>> hardware, or at least would like to. This works well today if your

>>> application can tolerate some time lag on standbys. The problem is that

>>> there is no guarantee of when a particular commit will become visible for

>>> clients connected to standbys. The existing synchronous commit feature

>>> is

>>> no help here because it guarantees only that the WAL has been flushed on

>>> another server before commit returns. It says nothing about whether it

>>> has

>>> been applied or whether it has been applied on the standby that you

>>> happen

>>> to be talking to.

>>>

>>

>> Thanks for working on this issue.

>>

>

> +1.

>

> 3. Commit on the primary with "causal_reads = on" waits for all

>>> 'available' standbys either to apply the commit record, or to cease to be

>>> 'available' and begin raising the error if they are still alive (because

>>> their authorizations have expired).

>>>

>>>

>> This causes every writer to wait.

>>

>> What we want is to isolate the wait only to people performing a write-read

>> sequence, so I think it should be readers that wait. Let's have that

>> debate

>> up front before we start reviewing the patch.

>>

>

> Agreed. And in the write-read sequence, you don't need to wait at the

> write either, it's enough that you wait just before you start doing the

> read. An application might do a lot of other things between the two, so

> that in most cases, there would in fact be no waiting as the record is

> already applied when you perform the read.

>

> I'm thinking the client should get some kind of a token back from the

> commit, and it could use the token on the standby, to wait for that commit

> to be applied. The token could be just the XID, or the LSN of the commit

> record. Or the application could generate the token and pass it to the

> server in the commit, similar to how 2PC works. So the interaction would be

> something like:

>

> In master:

> BEGIN;

> INSERT INTO FOO ...;

> COMMIT;

> Server returns: COMMITted with token 1234

>

> Later, in standby:

> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

> SELECT * FROM foo;

> ... I thought about this question, and considered three different approaches: 1. Reader waits with exposed LSNs, as Heikki suggests. This is what

BerkeleyDB does in "read-your-writes" mode. It means that application

developers have the responsibility for correctly identifying transactions

with causal dependencies and dealing with LSNs (or whatever equivalent

tokens), potentially even passing them to other processes where the

transactions are causally dependent but run by multiple communicating

clients (for example, communicating microservices). This makes it

difficult to retrofit load balancing to pre-existing applications and (like

anything involving concurrency) difficult to reason about as applications

grow in size and complexity. It is efficient if done correctly, but it is

a tax on application complexity. 2. Reader waits for a conservatively chosen LSN. This is roughly what

MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1"

modes. Read transactions would start off by finding the current end of WAL

on the primary, since that must be later than any commit that already

completed, and then waiting for that to apply locally. That means every

read transaction waits for a complete replication lag period, potentially

unnecessarily. This is tax on readers with unnecessary waiting. 3. Writer waits, as proposed. In this model, there is no tax on readers

(they have zero overhead, aside from the added complexity of dealing with

the possibility of transactions being rejected when a standby falls behind

and is dropped from 'available' status; but database clients must already

deal with certain types of rare rejected queries/failures such as

deadlocks, serialization failures, server restarts etc). This is a tax on

writers. My thinking was that the reason for wanting to load balance over a set of

hot standbys is because you have a very read-heavy workload, so it makes

sense to tax the writers and leave the many dominant readers unburdened, so

(3) should be better than (2) for the majority of users who want such a

configuration. (Note also that it's not a requirement to tax every write;

with this proposal you can set causal_reads to off for those transactions

where you know there is no possibility of a causally dependent read). As for (1), my thinking was that most application developers would probably

prefer not to have to deal with that type of interface. For users who do

want to do that, it would be comparatively simple to make that possible,

and would not conflict with this proposal. This proposal could be used by

people retrofitting load balancing to an existing applications with

relative ease, or simply not wanting to have to deal with LSNs and

complexity. (I have considered proposing

pg_wait_for_xlog_replay_location(lsn, timeout) separately, which could be

called on a standby with the lsn obtained from pg_current_xlog_location()

on the primary any time after a COMMIT completes, but I was thinking of

that as a different feature addressing a different user base: people

prepared to do more work to squeeze out some extra performance.) --

Thomas Munro

http://www.enterprisedb.com



On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

>> On 11/11/2015 10:23 AM, Simon Riggs wrote:

>>> Thanks for working on this issue.

>>

>> +1. +1. I have seen a lot of interest for something along these lines. >> I'm thinking the client should get some kind of a token back from the

>> commit, and it could use the token on the standby, to wait for that commit

>> to be applied. The token could be just the XID, or the LSN of the commit

>> record. Or the application could generate the token and pass it to the

>> server in the commit, similar to how 2PC works. So the interaction would be

>> something like:

>>

>> In master:

>> BEGIN;

>> INSERT INTO FOO ...;

>> COMMIT;

>> Server returns: COMMITted with token 1234

>>

>> Later, in standby:

>> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

>> SELECT * FROM foo;

>> ... To avoid read anomalies (backwards timetravel) it should also be

possible to receive a token from read-only transactions based on the

latest snapshot used. > My thinking was that the reason for wanting to load balance over a set of

> hot standbys is because you have a very read-heavy workload, so it makes

> sense to tax the writers and leave the many dominant readers unburdened, so

> (3) should be better than (2) for the majority of users who want such a

> configuration. (Note also that it's not a requirement to tax every write;

> with this proposal you can set causal_reads to off for those transactions

> where you know there is no possibility of a causally dependent read).

>

> As for (1), my thinking was that most application developers would probably

> prefer not to have to deal with that type of interface. For users who do

> want to do that, it would be comparatively simple to make that possible, and

> would not conflict with this proposal. This proposal could be used by

> people retrofitting load balancing to an existing applications with relative

> ease, or simply not wanting to have to deal with LSNs and complexity. (I

> have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)

> separately, which could be called on a standby with the lsn obtained from

> pg_current_xlog_location() on the primary any time after a COMMIT completes,

> but I was thinking of that as a different feature addressing a different

> user base: people prepared to do more work to squeeze out some extra

> performance.) Although I still think that 1) is the correct long term solution I

must say that I agree with the reasoning presented. I think we should

review the API in the light that in the future we might have a mix of

clients, some clients that are able to keep track of causality tokens

and either want to wait when a read request arrives, or pick a host to

use based on the token, and then there are "dumb" clients that want to

use write side waits. Also, it should be possible to configure which standbys are considered

for waiting on. Otherwise a reporting slave will occasionally catch up

enough to be considered "available" and then cause a latency peak when

a long query blocks apply again. Regards,

Ants Aasma

--

Cybertec Schönig & Schönig GmbH

Gröhrmühlgasse 26

A-2700 Wiener Neustadt

Web: http://www.postgresql-support.de

On Wed, Nov 11, 2015 at 3:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> This causes every writer to wait.

>

> What we want is to isolate the wait only to people performing a write-read

> sequence, so I think it should be readers that wait. Let's have that debate

> up front before we start reviewing the patch. One advantage of having writers wait is that the master and its read

slaves can't ever get too far apart. Suppose the master is generating

WAL much faster than the read slaves (or one of them) can replay it.

You might say it sucks to slow down the master to the speed the slaves

can keep up with, and that's true. On the other hand, if the master

is allowed to run ahead, then a process that sends a read query to a

standby which has gotten far behind might have to wait minutes or

hours for it to catch up. I think a lot of people are enabling

synchronous replication today just for the purpose of avoiding this

problem - keeping the two machines "together in time" makes the

overall system behavior a lot more predictable. Also, if we made readers wait, wouldn't that require a network

roundtrip to the master every time a query on a reader wanted a new

snapshot? That seems like it would be unbearably expensive. --

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

On 11/11/15 4:22 AM, Thomas Munro wrote:

> My thinking was that the reason for wanting to load balance over a set

> of hot standbys is because you have a very read-heavy workload, so it

> makes sense to tax the writers and leave the many dominant readers

> unburdened, so (3) should be better than (2) for the majority of users

> who want such a configuration. One problem I can see is that even if you have a read-heavy workload,

the writes can still be a bottleneck, since they are necessarily bound

to one node. And so if the feature proposal is, we can make your reads

more consistent but the writes will become slower, then that's not a

good deal. More generally, no matter whether you pick the writers or the readers to

wait, if you assume that read-only slaves are an application performance

feature, then it's questionable how much better such applications will

perform overall when network-bound waits are introduced in the system. I think in practice applications that are busy enough to worry about

this don't really work like that anyway. For example, the writes should

go to a message queue and are written out whenever, with a copy kept in

a cache for display in the meantime. Maybe there could be additional

features to make managing this easier. I think there are a lot of different variations of this in practice, not

only depending on the workload and other measurables, but also

business-dependent decisions on application behavior and degradability.

On Thu, Nov 12, 2015 at 12:10 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote: > On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro

> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> > On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>

> wrote:

> >> On 11/11/2015 10:23 AM, Simon Riggs wrote:

> >>> Thanks for working on this issue.

> >>

> >> +1.

>

> +1. I have seen a lot of interest for something along these lines.

>

> >> I'm thinking the client should get some kind of a token back from the

> >> commit, and it could use the token on the standby, to wait for that

> commit

> >> to be applied. The token could be just the XID, or the LSN of the commit

> >> record. Or the application could generate the token and pass it to the

> >> server in the commit, similar to how 2PC works. So the interaction

> would be

> >> something like:

> >>

> >> In master:

> >> BEGIN;

> >> INSERT INTO FOO ...;

> >> COMMIT;

> >> Server returns: COMMITted with token 1234

> >>

> >> Later, in standby:

> >> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

> >> SELECT * FROM foo;

> >> ...

>

> To avoid read anomalies (backwards timetravel) it should also be

> possible to receive a token from read-only transactions based on the

> latest snapshot used.

>

> > My thinking was that the reason for wanting to load balance over a set of

> > hot standbys is because you have a very read-heavy workload, so it makes

> > sense to tax the writers and leave the many dominant readers unburdened,

> so

> > (3) should be better than (2) for the majority of users who want such a

> > configuration. (Note also that it's not a requirement to tax every

> write;

> > with this proposal you can set causal_reads to off for those transactions

> > where you know there is no possibility of a causally dependent read).

> >

> > As for (1), my thinking was that most application developers would

> probably

> > prefer not to have to deal with that type of interface. For users who do

> > want to do that, it would be comparatively simple to make that possible,

> and

> > would not conflict with this proposal. This proposal could be used by

> > people retrofitting load balancing to an existing applications with

> relative

> > ease, or simply not wanting to have to deal with LSNs and complexity. (I

> > have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)

> > separately, which could be called on a standby with the lsn obtained from

> > pg_current_xlog_location() on the primary any time after a COMMIT

> completes,

> > but I was thinking of that as a different feature addressing a different

> > user base: people prepared to do more work to squeeze out some extra

> > performance.)

>

> Although I still think that 1) is the correct long term solution I

> must say that I agree with the reasoning presented. I think we should

> review the API in the light that in the future we might have a mix of

> clients, some clients that are able to keep track of causality tokens

> and either want to wait when a read request arrives, or pick a host to

> use based on the token, and then there are "dumb" clients that want to

> use write side waits.

> Exactly! I see the causality tokens approach (thank you for that terminology) not so

much as a "long term" solution, but rather as an expert feature likely to

interest a small number of sophisticated users willing to take on more

responsibility in exchange for greater control. We should definitely add

support for that, and I expect the patch would be fairly simple and short. But I believe the vast majority of users would like to be able to run new

and existing plain SQL on any node and see the data they just wrote, with

graceful failure modes, and without extra conceptual load or invasive code

changes. So I think we should cater for that mode of usage that too. Also, it should be possible to configure which standbys are considered

> for waiting on. Otherwise a reporting slave will occasionally catch up

> enough to be considered "available" and then cause a latency peak when

> a long query blocks apply again.

> Good point. Here's a new version which adds the GUC

causal_reads_standby_names, defaulting to '*' (but as before, the feature

is not activated until you set causal_reads_timeout). Now you can list

standby names explicitly if you want a way to exclude certain standbys.

Also, I noticed that cascaded standbys shouldn't be available for causal

reads, so I added a check for that. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-v2.patch application/octet-stream 70.4 KB

On 11 November 2015 at 09:22, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: > 1. Reader waits with exposed LSNs, as Heikki suggests. This is what

> BerkeleyDB does in "read-your-writes" mode. It means that application

> developers have the responsibility for correctly identifying transactions

> with causal dependencies and dealing with LSNs (or whatever equivalent

> tokens), potentially even passing them to other processes where the

> transactions are causally dependent but run by multiple communicating

> clients (for example, communicating microservices). This makes it

> difficult to retrofit load balancing to pre-existing applications and (like

> anything involving concurrency) difficult to reason about as applications

> grow in size and complexity. It is efficient if done correctly, but it is

> a tax on application complexity.

> Agreed. This works if you have a single transaction connected thru a pool

that does statement-level load balancing, so it works in both session and

transaction mode. I was in favour of a scheme like this myself, earlier, but have more

thoughts now. We must also consider the need for serialization across sessions or

transactions. In transaction pooling mode, an application could get assigned a different

session, so a token would be much harder to pass around. 2. Reader waits for a conservatively chosen LSN. This is roughly what

> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =

> 1" modes. Read transactions would start off by finding the current end

> of WAL on the primary, since that must be later than any commit that

> already completed, and then waiting for that to apply locally. That means

> every read transaction waits for a complete replication lag period,

> potentially unnecessarily. This is tax on readers with unnecessary waiting.

> This tries to make it easier for users by forcing all users to experience a

causality delay. Given the whole purpose of multi-node load balancing is

performance, referencing the master again simply defeats any performance

gain, so you couldn't ever use it for all sessions. It could be a USERSET

parameter, so could be turned off in most cases that didn't need it. But

its easier to use than (1). Though this should be implemented in the pooler. 3. Writer waits, as proposed. In this model, there is no tax on readers

> (they have zero overhead, aside from the added complexity of dealing with

> the possibility of transactions being rejected when a standby falls behind

> and is dropped from 'available' status; but database clients must already

> deal with certain types of rare rejected queries/failures such as

> deadlocks, serialization failures, server restarts etc). This is a tax on

> writers.

> This would seem to require that all readers must first check with the

master as to which standbys are now considered available, so it looks like

(2). The alternative is that we simply send readers to any standby and allow the

pool to work out separately whether the standby is still available, which

mostly works, but it doesn't handle sporadic slow downs on particular

standbys very well (if at all). I think we need to look at whether this does actually give us anything, or

whether we are missing the underlying Heisenberg reality. More later. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: > On 11 November 2015 at 09:22, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

> wrote:

>

>

>> 1. Reader waits with exposed LSNs, as Heikki suggests. This is what

>> BerkeleyDB does in "read-your-writes" mode. It means that application

>> developers have the responsibility for correctly identifying transactions

>> with causal dependencies and dealing with LSNs (or whatever equivalent

>> tokens), potentially even passing them to other processes where the

>> transactions are causally dependent but run by multiple communicating

>> clients (for example, communicating microservices). This makes it

>> difficult to retrofit load balancing to pre-existing applications and (like

>> anything involving concurrency) difficult to reason about as applications

>> grow in size and complexity. It is efficient if done correctly, but it is

>> a tax on application complexity.

>>

>

> Agreed. This works if you have a single transaction connected thru a pool

> that does statement-level load balancing, so it works in both session and

> transaction mode.

>

> I was in favour of a scheme like this myself, earlier, but have more

> thoughts now.

>

> We must also consider the need for serialization across sessions or

> transactions.

>

> In transaction pooling mode, an application could get assigned a different

> session, so a token would be much harder to pass around.

>

> 2. Reader waits for a conservatively chosen LSN. This is roughly what

>> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =

>> 1" modes. Read transactions would start off by finding the current end

>> of WAL on the primary, since that must be later than any commit that

>> already completed, and then waiting for that to apply locally. That means

>> every read transaction waits for a complete replication lag period,

>> potentially unnecessarily. This is tax on readers with unnecessary waiting.

>>

>

> This tries to make it easier for users by forcing all users to experience

> a causality delay. Given the whole purpose of multi-node load balancing is

> performance, referencing the master again simply defeats any performance

> gain, so you couldn't ever use it for all sessions. It could be a USERSET

> parameter, so could be turned off in most cases that didn't need it. But

> its easier to use than (1).

>

> Though this should be implemented in the pooler.

>

> 3. Writer waits, as proposed. In this model, there is no tax on readers

>> (they have zero overhead, aside from the added complexity of dealing with

>> the possibility of transactions being rejected when a standby falls behind

>> and is dropped from 'available' status; but database clients must already

>> deal with certain types of rare rejected queries/failures such as

>> deadlocks, serialization failures, server restarts etc). This is a tax on

>> writers.

>>

>

> This would seem to require that all readers must first check with the

> master as to which standbys are now considered available, so it looks like

> (2).

> No -- in (3), that is this proposal, standbys don't check with the primary

when you run a transaction. Instead, the primary sends a constant stream

of authorizations (in the form of keepalives sent every

causal_reads_timeout / 2 in the current patch) to the standby, allowing it

to consider itself available for a short time into the future (currently

now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I

can elaborate on that logic in a separate email). At the start of a

transaction in causal reads mode (the first call to GetTransaction to be

specific), the standby knows immediately without communicating with the

primary whether it can proceed or must raise the error. In the happy case,

the reader simply compares the most recently received authorization's

expiry time with the system clock and proceeds. In the worst case, when

contact is lost between primary and standby, the primary must stall

causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).

Doing that makes sure that no causal reads commit can return (see

CausalReadsCommitCanReturn) before the lost standby has definitely started

raising the error for causal_reads queries (because its most recent

authorization has expired), in case it is still alive and handling requests

from clients. It is not at all like (2), which introduces a conservative wait at the

start of every read transaction, slowing all readers down. In (3), readers

don't wait, they run (or are rejected) as fast as possible, but instead the

primary has to do extra things. Hence my categorization of (2) as a 'tax

on readers', and of (3) as a 'tax on writers'. The idea is that a site

with a high ratio of reads to writes would prefer zero-overhead reads. > The alternative is that we simply send readers to any standby and allow

> the pool to work out separately whether the standby is still available,

> which mostly works, but it doesn't handle sporadic slow downs on particular

> standbys very well (if at all).

> This proposal does handle sporadic slowdowns on standbys: it drops them

from the set of available standbys if they don't apply fast enough, all the

while maintaining the guarantee. Though occurs to me that it probably

needs some kind of defence against too much flapping between available and

unavailable (maybe some kind of back off on the 'joining' phase that

standbys go through when they transition from unavailable to available in

the current patch, which I realize I haven't described yet -- but I don't

want to get bogged down in details, while we're talking about the 30,000

foot view). > I think we need to look at whether this does actually give us anything, or

> whether we are missing the underlying Heisenberg reality.

> --

Thomas Munro

http://www.enterprisedb.com



On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: > On 11 November 2015 at 09:22, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

> wrote:

>

>

>> 1. Reader waits with exposed LSNs, as Heikki suggests. This is what

>> BerkeleyDB does in "read-your-writes" mode. It means that application

>> developers have the responsibility for correctly identifying transactions

>> with causal dependencies and dealing with LSNs (or whatever equivalent

>> tokens), potentially even passing them to other processes where the

>> transactions are causally dependent but run by multiple communicating

>> clients (for example, communicating microservices). This makes it

>> difficult to retrofit load balancing to pre-existing applications and (like

>> anything involving concurrency) difficult to reason about as applications

>> grow in size and complexity. It is efficient if done correctly, but it is

>> a tax on application complexity.

>>

>

> Agreed. This works if you have a single transaction connected thru a pool

> that does statement-level load balancing, so it works in both session and

> transaction mode.

>

> I was in favour of a scheme like this myself, earlier, but have more

> thoughts now.

>

> We must also consider the need for serialization across sessions or

> transactions.

>

> In transaction pooling mode, an application could get assigned a different

> session, so a token would be much harder to pass around.

> Sorry for the double reply, I just wanted to add a couple more thoughts. As discussed elsewhere in the thread, I think it makes absolute sense to

offer some kind of support for causality tokens, I don't see that on its

own as enough for most users. (At the least, it would be good to have

pg_wait_for_xlog_replay_location(lsn, timeout), but perhaps explicit BEGIN

syntax as suggested by Heikki, or a new field in the libpq protocol which

can be attached to any statement, and likewise for the commit LSN of

results). It's true that a pooling system/middleware could spy on your sessions and

insert causality token handling imposing a global ordering of visibility

for you, so that naive users don't have to deal with them. Whenever it

sees a COMMIT result (assuming they are taught to return LSNs), it could

update a highest-LSN-seen variable, and transparently insert a wait for

that LSN into every transaction that it sees beginning. But then you would

have to push all your queries through a single point that can see

everything across all Postgres servers, and maintain this global high LSN. In contrast, my writer-waits proposal makes different trade-offs to provide

causal reads as a built-in feature without an external single point

observer of all transactions. --

Thomas Munro

http://www.enterprisedb.com



On 12 November 2015 at 18:25, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: > I don't want to get bogged down in details, while we're talking about the

> 30,000 foot view).

> Hmm, if that's where we're at, I'll summarize my thoughts. All of this discussion presupposes we are distributing/load balancing

queries so that reads and writes might occur on different nodes. We need a good balancer. Any discussion of this that ignores the balancer

component is only talking about half the solution. What we need to do is

decide whether functionality should live in the balancer or the core. Your option (1) is viable, but only in certain cases. We could add support

for some token/wait mechanism but as you say, this would require

application changes not pooler changes. Your option (2) is wider but also worse in some ways. It can be implemented

in a pooler. Your option (3) doesn't excite me much. You've got a load of stuff that

really should happen in a pooler. And at its core we have

synchronous_commit = apply but with a timeout rather than a wait. So

anyway, consider me nudged to finish my patch to provide capability for

that by 1 Jan. On a related note, any further things like "GUC causal_reads_standby_names"

should be implemented by Node Registry as a named group of nodes. We can

have as many arbitrary groups of nodes as we want. If that sounds strange

look back at exactly why GUCs are called GUCs. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> Hmm, if that's where we're at, I'll summarize my thoughts.

>

> All of this discussion presupposes we are distributing/load balancing

> queries so that reads and writes might occur on different nodes. Agreed. I think that's a pretty common pattern, though certainly not

the only one. > We need a good balancer. Any discussion of this that ignores the balancer

> component is only talking about half the solution. What we need to do is

> decide whether functionality should live in the balancer or the core. I'm all in favor of having a load-balancer in core, but that seems

completely unrelated to the patch at hand. > Your option (1) is viable, but only in certain cases. We could add support

> for some token/wait mechanism but as you say, this would require application

> changes not pooler changes. Agreed. > Your option (2) is wider but also worse in some ways. It can be implemented

> in a pooler.

>

> Your option (3) doesn't excite me much. You've got a load of stuff that

> really should happen in a pooler. And at its core we have synchronous_commit

> = apply but with a timeout rather than a wait. So anyway, consider me nudged

> to finish my patch to provide capability for that by 1 Jan. I don't see how either option (2) or option (3) could be implemented

in a pooler. How would that work? To be frank, it's starting to seem to me like you are just trying to

block this patch so you can have time to develop your own version

instead. I hope that's not the case, because it would be quite

unfair. When Thomas originally posted the patch, you complained that

"This causes every writer to wait. What we want is to isolate the wait

only to people performing a write-read sequence, so I think it should

be readers that wait. Let's have that debate up front before we start

reviewing the patch." Now, you seem to be saying that's OK, because

you want to post a patch to do exactly the same thing under the name

synchronous_commit=apply, but you want it to be your own patch,

leaving out the other stuff that Thomas has put into this one. That could be the right thing to do, but how about we discuss it a

bit? The timeout stuff that Thomas has added here is really useful.

Without that, if a machine goes down, we wait forever. That's the

right thing to do if we're replicating to make sure transactions can

never be lost, but it's a bad idea if we're replicating for load

balancing. In the load balancing case, you want to drop sync slaves

quickly to ensure the cluster remains available, and you need them to

know they are out of sync so that the load balancer doesn't get

confused. That's exactly what is implemented here. If you have an idea for a simpler implementation, great, but I think

we need something. I don't see how it's going to work to make it

entirely the pooler's job to figure out whether the cluster is in sync

- it needs a push from the core server. Here, that push is easy to

find: if a particular replica starts returning the "i'm out of sync"

error when you query it, then stop routing queries to that replica

until the error clears (which the pooler can find out by polling it

with some trivial query). That's a great deal more useful than

synchronous_commit=apply without such a feature. > On a related note, any further things like "GUC causal_reads_standby_names"

> should be implemented by Node Registry as a named group of nodes. We can

> have as many arbitrary groups of nodes as we want. If that sounds strange

> look back at exactly why GUCs are called GUCs. I think a node registry is a good idea, and my impression from the

session in Vienna is quite a few other hackers do, too. But I also

don't think it's remotely reasonable to make that a precondition for

accepting this patch. --

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

On Sun, Nov 15, 2015 at 11:41 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: > On 12 November 2015 at 18:25, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

> wrote:

>

>

>> I don't want to get bogged down in details, while we're talking about

>> the 30,000 foot view).

>>

>

> Hmm, if that's where we're at, I'll summarize my thoughts.

>

> All of this discussion presupposes we are distributing/load balancing

> queries so that reads and writes might occur on different nodes.

>

> We need a good balancer. Any discussion of this that ignores the balancer

> component is only talking about half the solution. What we need to do is

> decide whether functionality should live in the balancer or the core.

>

> Your option (1) is viable, but only in certain cases. We could add support

> for some token/wait mechanism but as you say, this would require

> application changes not pooler changes.

>

> Your option (2) is wider but also worse in some ways. It can be

> implemented in a pooler.

>

> Your option (3) doesn't excite me much. You've got a load of stuff that

> really should happen in a pooler. And at its core we have

> synchronous_commit = apply but with a timeout rather than a wait. So

> anyway, consider me nudged to finish my patch to provide capability for

> that by 1 Jan.

> Just to be clear, this patch doesn't use a "timeout rather than a wait".

It always waits for the current set of available causal reads standbys to

apply the commit. It's just that nodes get kicked out of that set pretty

soon if they don't keep up, a bit like a RAID controller dropping a failing

disk. And it does so using a protocol that ensures that the dropped

standby starts raising the error, even if contact has been lost with it, so

the causal reads guarantee is maintained at all times for all clients. On a related note, any further things like "GUC causal_reads_standby_names"

> should be implemented by Node Registry as a named group of nodes. We can

> have as many arbitrary groups of nodes as we want. If that sounds strange

> look back at exactly why GUCs are called GUCs.

> Agreed, the application_name whitelist stuff is clunky. I left it out of

the first version I posted, not wanting the focus of this proposal to be

side-tracked. But as Ants Aasma pointed out, some users might need

something like that, so I posted a 2nd version that follows the established

example, again not wanting to distract with anything new in that area. Of

course that would eventually be replaced/improved as part of a future node

topology management project. --

Thomas Munro

http://www.enterprisedb.com



On 15 November 2015 at 14:50, Robert Haas <robertmhaas(at)gmail(dot)com> wrote: > On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>

> wrote:

> > Hmm, if that's where we're at, I'll summarize my thoughts.

> >

> > All of this discussion presupposes we are distributing/load balancing

> > queries so that reads and writes might occur on different nodes.

>

> Agreed. I think that's a pretty common pattern, though certainly not

> the only one.

> It looks to me this functionality is only of use in a pooler. Please

explain how else this would be used. > > Your option (2) is wider but also worse in some ways. It can be

> implemented

> > in a pooler.

> >

> > Your option (3) doesn't excite me much. You've got a load of stuff that

> > really should happen in a pooler. And at its core we have

> synchronous_commit

> > = apply but with a timeout rather than a wait.

>

> I don't see how either option (2) or option (3) could be implemented

> in a pooler. How would that work?

> My starting thought was that (1) was the only way forwards. Through

discussion, I now see that its not the best solution for the general case. The pooler knows which statements are reads and writes, it also knows about

transaction boundaries, so it is possible for it to perform the waits for

either (2) or (3). The pooler *needs* to know which nodes it can route

queries to, so it looks to me that the pooler is the best place to put

waits and track status of nodes, no matter when we wait. I don't see any

benefit in having other nodes keep track of node status since that will

just replicate work that *must* be performed in the pooler. I would like to see a load balancing pooler in Postgres. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 16 November 2015 at 18:44, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: >

> The pooler knows which statements are reads and writes

> I think that's an iffy assumption. It's one we tend to make because

otherwise read/write pooling won't work, but in PostgreSQL there's really

no way to know when calling a function. What does SELECT get_user_stats() do? The pooler has _no_ _idea_ unless manually configured with knowledge

about particular user defined functions. In the absence of such knowledge it can: - send the work to a replica and report the ERROR to the user if it fails

due to an attempted write;

- send the work to a replica, capture an ERROR due to attempted write, and

retry on the master;

- send everything it's not sure about to the master If a pooler had insight into the catalogs and if we had readonly /

readwrite attributes on functions, it could be smarter. > I would like to see a load balancing pooler in Postgres.

> Given the number of times I say "no, no, don't raise max_connections to

2000 to solve your performance problems, lower it to around 100 and put

pgbouncer in front if your application doesn't support connection pooling

internally" .... yes! --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On 16 November 2015 at 11:01, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote: > On 16 November 2015 at 18:44, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>

>>

>> The pooler knows which statements are reads and writes

>>

>

> I think that's an iffy assumption.

> It's not an assumption, its a requirement. If it can't do this in some

manner then you can't use a load balancing pooler. Randomly submitting things works as well, since it leads to a write error

when you try to write data on a read only server, so you do then learn

whether it is a read or a write. Once you know its a write, you submit to

master. But you still need to be careful of other effects, so that isn't

recommended. It's one we tend to make because otherwise read/write pooling won't work,

> but in PostgreSQL there's really no way to know when calling a function.

>

> What does

>

> SELECT get_user_stats()

>

> do? The pooler has _no_ _idea_ unless manually configured with knowledge

> about particular user defined functions.

>

> In the absence of such knowledge it can:

>

> - send the work to a replica and report the ERROR to the user if it fails

> due to an attempted write;

> - send the work to a replica, capture an ERROR due to attempted write, and

> retry on the master;

> - send everything it's not sure about to the master

>

> If a pooler had insight into the catalogs and if we had readonly /

> readwrite attributes on functions, it could be smarter.

> pgpool supports white/black function listing for exactly this reason. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 15 November 2015 at 10:41, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: > So anyway, consider me nudged to finish my patch to provide capability

> for that by 1 Jan.

> My earlier patch aimed to allow WALReceiver to wait on both a latch and a

socket as well as allow WALWriter to be active, so that WALReceiver could

reply more quickly and handle greater workload. As I explained previously

when we discussed that in recent posts, it is necessary infrastructure to

have anybody wait on anything higher than remote-fsync. I aim to complete

that by 1 Jan. --

Simon Riggs http://www.2ndQuadrant.com/

<http://www.2ndquadrant.com/>

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, Nov 17, 2015 at 12:44 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote: > On 15 November 2015 at 10:41, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>

>

>> So anyway, consider me nudged to finish my patch to provide capability

>> for that by 1 Jan.

>>

>

> My earlier patch aimed to allow WALReceiver to wait on both a latch and a

> socket as well as allow WALWriter to be active, so that WALReceiver could

> reply more quickly and handle greater workload. As I explained previously

> when we discussed that in recent posts, it is necessary infrastructure to

> have anybody wait on anything higher than remote-fsync. I aim to complete

> that by 1 Jan.

> Right, handing write/fsync work off to WALWriter in standbys makes a lot of

sense for any kind of writer-waits system, so that WALReceiver doesn't

spend time in long syscalls which wouldn't play nicely with signals

(whether from 'kill' or SetLatch) and can deal with network IO with the

lowest possible latency. I would like to help test/review that, if that

could be useful. The SIGUSR1 code in the WalReceiverMain and WalRecvWakeup in this patch

works well enough for now for proof-of-concept purposes until then. --

Thomas Munro

http://www.enterprisedb.com



On Mon, Nov 16, 2015 at 5:44 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 15 November 2015 at 14:50, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>

>> wrote:

>> > Hmm, if that's where we're at, I'll summarize my thoughts.

>> >

>> > All of this discussion presupposes we are distributing/load balancing

>> > queries so that reads and writes might occur on different nodes.

>>

>> Agreed. I think that's a pretty common pattern, though certainly not

>> the only one.

> It looks to me this functionality is only of use in a pooler. Please explain

> how else this would be used. I think you're right. I mean, you could have the pooling built into

the application, but some place connections have to be being farmed

out to different nodes, or there's no point to using this feature.

Some people may not want to use this feature, but those who do are

using some form of pooling at some level. >> > Your option (2) is wider but also worse in some ways. It can be

>> > implemented

>> > in a pooler.

>> >

>> > Your option (3) doesn't excite me much. You've got a load of stuff that

>> > really should happen in a pooler. And at its core we have

>> > synchronous_commit

>> > = apply but with a timeout rather than a wait.

>>

>> I don't see how either option (2) or option (3) could be implemented

>> in a pooler. How would that work?

>

> My starting thought was that (1) was the only way forwards. Through

> discussion, I now see that its not the best solution for the general case.

>

> The pooler knows which statements are reads and writes, it also knows about

> transaction boundaries, so it is possible for it to perform the waits for

> either (2) or (3). As Craig says, it may not: pgbouncer, for example, won't. pgpool

will, except when it's wrong because some apparently read-only

function is actually writing data. But even if the pooler does know,

that isn't enough for it to perform the waits for (2) or (3) without

some support for the server. If it wants to wait for a particular

transaction to be applied on the standby, it needs to know how long to

wait, and without some support for the server, it has no way of

knowing. Now that could be done by doing (1) and then having the

pooler perform the waits, but now every pooler has to be taught how to

do that. pgpool needs to know, pgbouncer needs to know, every

JDBC-based connection pooler needs to know. Uggh. Thomas's solution

is nice because it works with any pooler. The other point I'd make, which I think may be what you said before

but I think is worth making very explicit, is that (1) supposes that

we know which reads are dependent on which previous writes. In the

real world, that's probably frequently untrue. If someone does SELECT

sum(salary) FROM emp on the standby, there's no particular write to

the emp table that they want to wait for: they want to wait for ALL

such writes previously acknowledged as committed. Now, even when the

dependent writes can be identified, it may be convenient to just wait

for all of them instead of a particular subset that we know are

related. But I bet there will be many cases where identification is

impractical or impossible, and thus I suspect (1) won't be very

useful. I think (2) and (3) both have good prospects for being useful, but I

suspect that the performance consequences of (3), which is what Thomas

actually implemented, although possibly severe, are still likely to be

only a fraction of the cost of (2). Having to potentially wait every

time a standby takes a snapshot just sounds awful to me. > I would like to see a load balancing pooler in Postgres. Me, too, but I don't expect that to be possible in the near future,

and I think this is awfully useful until it does. --

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

On 11/12/15 1:11 PM, Thomas Munro wrote:

> It's true that a pooling system/middleware could spy on your sessions

> and insert causality token handling imposing a global ordering of

> visibility for you, so that naive users don't have to deal with them.

> Whenever it sees a COMMIT result (assuming they are taught to return

> LSNs), it could update a highest-LSN-seen variable, and transparently

> insert a wait for that LSN into every transaction that it sees

> beginning. But then you would have to push all your queries through a

> single point that can see everything across all Postgres servers, and

> maintain this global high LSN. I think that depends on what you're doing. Frequently you don't care

about anyone elses writes, just your own. In that case, there's no need

for a shared connection pooler, you just have to come back to the same one. There's also a 4th option: until a commit has made it out to some number

of slaves, re-direct all reads from a session back to the master. That

might sound horrible for master performance, but in reality I think it'd

normally be fine. Generally, you only care about this when you're going

to read data that you've just written, which means the data's still in

shared buffers.

--

Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX

Experts in Analytics, Data Architecture and PostgreSQL

Data in Trouble? Get it in Treble! http://BlueTreble.com

Hi, Here is a new version of the patch with a few small improvements: 1. Adopted the term '[read] lease', replacing various hand-wavy language

in the comments and code. That seems to be the established term for this

approach[1]. 2. Reduced the stalling time on failure. When things go wrong with a

standby (such as losing contact with it), instead of stalling for a

conservative amount of time longer than any lease that might have been

granted, the primary now stalls only until the expiry of the last lease

that actually was granted to a given dropped standby, which should be

sooner. 3. Fixed a couple of bugs that showed up in testing and review (some bad

flow control in the signal handling, and a bug in a circular buffer), and

changed the recovery->walreceiver wakeup signal handling to block the

signal except while waiting in walrcv_receive (it didn't seem a good idea

to interrupt arbitrary syscalls in walreceiver so I thought that would be a

improvement; but of course that area's going to be reworked by Simon's

patch anyway, as discussed elsewhere). Restating the central idea using the new terminology: So long as they are

replaying fast enough, the primary grants a series of causal reads leases

to standbys allowing them to handle causal reads queries locally without

any inter-node communication for a limited time. Leases are promises that

the primary will wait for the standby to apply commit records OR be dropped

from the set of available causal reads standbys and know that it has been

dropped, before the primary returns from commit, in order to uphold the

causal reads guarantee. In the worst case it can do that by waiting for

the most recently granted lease to expire. I've also attached a couple of things which might be useful when trying the

patch out: test-causal-reads.c which can be used to test performance and

causality under various conditions, and test-causal-reads.sh which can be

used to bring up a primary and a bunch of local hot standbys to talk to.

(In the hope of encouraging people to take the patch for a spin...) [1] Originally from a well known 1989 paper on caching, but in the context

of databases and synchronous replication see for example the recent papers

on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google

Megastore). Of course a *lot* more is going on in those very different

algorithms, but at some level "read leases" are being used to allow

local-node-only reads for a limited time while upholding some kind of

global consistency guarantee, in some of those consensus database systems.

I spent a bit of time talking about consistency levels to database guru and

former colleague Alex Scotti who works on a Paxos-based system, and he gave

me the initial idea to try out a lease-based consistency system for

Postgres streaming rep. It seems like a very useful point in the space of

trade-offs to me. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size test-causal-reads.sh application/x-sh 1.8 KB test-causal-reads.c text/x-csrc 4.5 KB causal-reads-v3.patch application/octet-stream 72.2 KB

On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> Here is a new version of the patch with a few small improvements:

> ...

> [causal-reads-v3.patch] That didn't apply after 6e7b3359 (which fix a typo in a comment that I

moved). Here is a new version that does. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-v4.patch application/octet-stream 72.2 KB

+1 to both the feature and the concept of how it's implemented.

Haven't looked at the code though. This feature would be very useful for us at Trustly.

This would mean we got get rid of an entire system component in our

architecture (=memcached) which we only use to write data which must

be immediately readable at the sync slave after the master commits.

The only such data we currently have is the backoffice sessionid which

must be readable on the slave, otherwise the read-only calls which we

route to the slave might fail because it's missing. On Wed, Nov 11, 2015 at 6:37 AM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>

> Hi hackers,

>

> Many sites use hot standby servers to spread read-heavy workloads over more hardware, or at least would like to. This works well today if your application can tolerate some time lag on standbys. The problem is that there is no guarantee of when a particular commit will become visible for clients connected to standbys. The existing synchronous commit feature is no help here because it guarantees only that the WAL has been flushed on another server before commit returns. It says nothing about whether it has been applied or whether it has been applied on the standby that you happen to be talking to.

>

> A while ago I posted a small patch[1] to allow synchronous_commit to wait for remote apply on the current synchronous standby, but (as Simon Riggs rightly pointed out in that thread) that part isn't the main problem. It seems to me that the main problem for a practical 'writer waits' system is how to support a dynamic set of servers, gracefully tolerating failures and timing out laggards, while also providing a strong guarantee during any such transitions. Now I would like to propose something to do that, and share a proof-of-concept patch.

>

>

> === PROPOSAL ===

>

> The working name for the proposed feature is "causal reads", because it provides a form of "causal consistency"[2] (and "read-your-writes" consistency) no matter which server the client is connected to. There is a similar feature by the same name in another product (albeit implemented differently -- 'reader waits'; more about that later). I'm not wedded to the name.

>

> The feature allows arbitrary read-only transactions to be run on any hot standby, with a specific guarantee about the visibility of preceding transactions. The guarantee is that if you set a new GUC "causal_reads = on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins after tx1 successfully returns, then tx2 will either see tx1 or fail with a new error "standby is not available for causal reads", no matter which server it runs on. A discovery mechanism is also provided, giving an instantaneous snapshot of the set of standbys that are currently available for causal reads (ie won't raise the error), in the form of a new column in pg_stat_replication.

>

> For example, a web server might run tx1 to insert a new row representing a message in a discussion forum on the primary server, and then send the user to another web page that runs tx2 to load all messages in the forum on an arbitrary hot standby server. If causal_reads = on in both tx1 and tx2 (for example, because it's on globally), then tx2 is guaranteed to see the new post, or get a (hopefully rare) error telling the client to retry on another server.

>

> Very briefly, the approach is:

> 1. The primary tracks apply lag on each standby (including between commits).

> 2. The primary deems standbys 'available' for causal reads if they are applying WAL and replying to keepalives fast enough, and periodically sends the standby an authorization to consider itself available for causal reads until a time in the near future.

> 3. Commit on the primary with "causal_reads = on" waits for all 'available' standbys either to apply the commit record, or to cease to be 'available' and begin raising the error if they are still alive (because their authorizations have expired).

> 4. Standbys can start causal reads transactions only while they have an authorization with an expiry time in the future; otherwise they raise an error when an initial snapshot is taken.

>

> In a follow-up email I can write about the design trade-offs considered (mainly 'writer waits' vs 'reader waits'), comparison with some other products, method of estimating replay lag, wait and timeout logic and how it maintains the guarantee in various failure scenarios, logic for standbys joining and leaving, implications of system clock skew between servers, or any other questions you may have, depending on feedback/interest (but see comments in the attached patch for some of those subjects). For now I didn't want to clog up the intertubes with too large a wall of text.

>

>

> === PROOF-OF-CONCEPT ===

>

> Please see the POC patch attached. It adds two new GUCs. After setting up one or more hot standbys as per usual, simply add "causal_reads_timeout = 4s" to the primary's postgresql.conf and restart. Now, you can set "causal_reads = on" in some/all sessions to get guaranteed causal consistency. Expected behaviour: the causal reads guarantee is maintained at all times, even when you overwhelm, kill, crash, disconnect, restart, pause, add and remove standbys, and the primary drops them from the set it waits for in a timely fashion. You can monitor the system with the replay_lag and causal_reads_status in pg_stat_replication and some state transition LOG messages on the primary. (The patch also supports "synchronous_commit = apply", but it's not clear how useful that is in practice, as already discussed.)

>

> Lastly, a few notes about how this feature related to some other work:

>

> The current version of this patch has causal_reads as a feature separate from synchronous_commit, from a user's point of view. The thinking behind this is that load balancing and data loss avoidance are separate concerns: synchronous_commit deals with the latter, and causal_reads with the former. That said, existing SyncRep machinery is obviously used (specifically SyncRep queues, with a small modification, as a way to wait for apply messages to arrive from standbys). (An earlier prototype had causal reads as a new level for synchronous_commit and associated states as new walsender states above 'streaming'. When contemplating how to combine this proposal with the multiple-synchronous-standby patch, some colleagues and I came around to the view that the concerns are separate. The reason for wanting to configure complicated quorum definitions is to control data loss risks and has nothing to do with load balancing requirements, so we thought the features should probably be separate.)

>

> The multiple-synchronous-servers patch[3] could be applied or not independently of this feature as a result of that separation, as it doesn't use synchronous_standby_names or indeed any kind of statically defined quorum.

>

> The standby WAL writer patch[4] would significantly improve walreceiver performance and smoothness which would work very well with this proposal.

>

> Please let me know what you think!

>

> Thanks,

>

>

> [1] http://www.postgresql.org/message-id/flat/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag(at)mail(dot)gmail(dot)com

>

> [2] From http://queue.acm.org/detail.cfm?id=1466448

>

> "Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.

>

> Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model."

>

> [3] http://www.postgresql.org/message-id/flat/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ(at)mail(dot)gmail(dot)com

>

> [4] http://www.postgresql.org/message-id/flat/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ(at)mail(dot)gmail(dot)com

>

> --

> Thomas Munro

> http://www.enterprisedb.com

>

>

> --

> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)

> To make changes to your subscription:

> http://www.postgresql.org/mailpref/pgsql-hackers

> --

Joel Jacobson Mobile: +46703603801

Trustly.com | Newsroom | LinkedIn | Twitter

On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro

> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> Here is a new version of the patch with a few small improvements:

>> ...

>> [causal-reads-v3.patch]

>

> That didn't apply after 6e7b3359 (which fix a typo in a comment that I

> moved). Here is a new version that does. That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so

here is a new version. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-v5.patch application/octet-stream 72.6 KB

On Wed, Jan 20, 2016 at 1:12 PM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro

> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro

>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>>> Here is a new version of the patch with a few small improvements:

>>> ...

>>> [causal-reads-v3.patch]

>>

>> That didn't apply after 6e7b3359 (which fix a typo in a comment that I

>> moved). Here is a new version that does.

>

> That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so

> here is a new version. You should try to register it to a CF, though it may be too late for

9.6 if that's rather intrusive.

--

Michael



Hi Thomas, On 2016/01/20 13:12, Thomas Munro wrote:

> That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so

> here is a new version. - if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)

+ if (is_highest_priority_sync_standby) [ ... ] - if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

- {

- walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;

- numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH); [ ... ] + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->write)

+ {

+ walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;

+ numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH,

+ MyWalSnd->flush); There seems to be a copy-pasto there - shouldn't that be: + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) Thanks,

Amit

On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote

<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> There seems to be a copy-pasto there - shouldn't that be:

>

> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) Indeed, thanks! New patch attached. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-v6.patch application/octet-stream 72.6 KB

From: Thom Brown <thom(at)linux(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-21 20:40:08 Message-ID: CAA-aLv7sOFOAK4C9FFeQ07gQhEgETkjoc-_WotXVDg3NKC-oKQ@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 3 February 2016 at 10:46, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote

> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>> There seems to be a copy-pasto there - shouldn't that be:

>>

>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

>

> Indeed, thanks! New patch attached. I've given this a test drive, and it works exactly as described. But one thing which confuses me is when a standby, with causal_reads

enabled, has just finished starting up. I can't connect to it

because: FATAL: standby is not available for causal reads However, this is the same message when I'm successfully connected, but

it's lagging, and the primary is still waiting for the standby to

catch up: ERROR: standby is not available for causal reads What is the difference here? The problem being reported appears to be

identical, but in the first case I can't connect, but in the second

case I can (although I still can't issue queries). Thom

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Thom Brown <thom(at)linux(dot)com> Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-21 23:18:55 Message-ID: CAEepm=3oHet0DcW+q4ex7G0LLz0-QbBh_7HUFf5aU9tVFk7oxw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown <thom(at)linux(dot)com> wrote:

> On 3 February 2016 at 10:46, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote

>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>>> There seems to be a copy-pasto there - shouldn't that be:

>>>

>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

>>

>> Indeed, thanks! New patch attached.

>

> I've given this a test drive, and it works exactly as described. Thanks for trying it out! > But one thing which confuses me is when a standby, with causal_reads

> enabled, has just finished starting up. I can't connect to it

> because:

>

> FATAL: standby is not available for causal reads

>

> However, this is the same message when I'm successfully connected, but

> it's lagging, and the primary is still waiting for the standby to

> catch up:

>

> ERROR: standby is not available for causal reads

>

> What is the difference here? The problem being reported appears to be

> identical, but in the first case I can't connect, but in the second

> case I can (although I still can't issue queries). Right, you get the error at login before it has managed to connect to

the primary, and for a short time after while it's in 'joining' state,

or potentially longer if there is a backlog of WAL to apply. The reason is that when causal_reads = on in postgresql.conf (as

opposed to being set for an individual session or role), causal reads

logic is used for snapshots taken during authentication (in fact the

error is generated when trying to take a snapshot slightly before

authentication proper begins, in InitPostgres). I think that's a

desirable feature: if you have causal reads on and you create/alter a

database/role (for example setting a new password) and commit, and

then you immediately try to connect to that database/role on a standby

where you have causal reads enabled system-wide, then you get the

causal reads guarantee during authentication: you either see the

effects of your earlier transaction or you see the error. As you have

discovered, there is a small window after a standby comes up where it

will show the error because it hasn't got a lease yet so it can't let

you log in yet because it could be seeing a stale catalog (your user

may not exist on the standby yet, or have been altered in some way, or

your database may not exist yet, etc). Does that make sense? --

Thomas Munro

http://www.enterprisedb.com

From: Thom Brown <thom(at)linux(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-22 00:39:10 Message-ID: CAA-aLv764BM0mp5rk0B-Ooq0tB1LiXfB_ckCAw=DEjqVAuJDew@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 21 February 2016 at 23:18, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown <thom(at)linux(dot)com> wrote:

>> On 3 February 2016 at 10:46, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote

>>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>>>> There seems to be a copy-pasto there - shouldn't that be:

>>>>

>>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

>>>

>>> Indeed, thanks! New patch attached.

>>

>> I've given this a test drive, and it works exactly as described.

>

> Thanks for trying it out!

>

>> But one thing which confuses me is when a standby, with causal_reads

>> enabled, has just finished starting up. I can't connect to it

>> because:

>>

>> FATAL: standby is not available for causal reads

>>

>> However, this is the same message when I'm successfully connected, but

>> it's lagging, and the primary is still waiting for the standby to

>> catch up:

>>

>> ERROR: standby is not available for causal reads

>>

>> What is the difference here? The problem being reported appears to be

>> identical, but in the first case I can't connect, but in the second

>> case I can (although I still can't issue queries).

>

> Right, you get the error at login before it has managed to connect to

> the primary, and for a short time after while it's in 'joining' state,

> or potentially longer if there is a backlog of WAL to apply.

>

> The reason is that when causal_reads = on in postgresql.conf (as

> opposed to being set for an individual session or role), causal reads

> logic is used for snapshots taken during authentication (in fact the

> error is generated when trying to take a snapshot slightly before

> authentication proper begins, in InitPostgres). I think that's a

> desirable feature: if you have causal reads on and you create/alter a

> database/role (for example setting a new password) and commit, and

> then you immediately try to connect to that database/role on a standby

> where you have causal reads enabled system-wide, then you get the

> causal reads guarantee during authentication: you either see the

> effects of your earlier transaction or you see the error. As you have

> discovered, there is a small window after a standby comes up where it

> will show the error because it hasn't got a lease yet so it can't let

> you log in yet because it could be seeing a stale catalog (your user

> may not exist on the standby yet, or have been altered in some way, or

> your database may not exist yet, etc).

>

> Does that make sense? Ah, alles klar. Yes, that makes sense now. I've been trying to break

it the past few days, and this was the only thing which I wasn't clear

on. The parameters all work as described The replay_lag is particularly cool. Didn't think it was possible to

glean this information on the primary, but the timings are correct in

my tests. +1 for this patch. Looks like this solves the problem that

semi-synchronous replication tries to solve, although arguably in a

more sensible way. Thom

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com> To: Thom Brown <thom(at)linux(dot)com> Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-27 13:20:42 Message-ID: CAB7nPqR+aZ7o00zA0MV6o9mvNixPWOUvrLKW_Am8QdYp_Wo0qw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom(at)linux(dot)com> wrote:

> On 21 February 2016 at 23:18, Thomas Munro

> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> The replay_lag is particularly cool. Didn't think it was possible to

> glean this information on the primary, but the timings are correct in

> my tests.

>

> +1 for this patch. Looks like this solves the problem that

> semi-synchronous replication tries to solve, although arguably in a

> more sensible way. Yeah, having extra logic at application layer to check if a certain

LSN position has been applied or not is doable, but if we can avoid it

that's a clear plus. This patch has no documentation. I will try to figure out by myself

how the new parameters interact with the rest of the syncrep code

while looking at it but if we want to move on to get something

committable for 9.6 it would be good to get some documentation soon.

--

Michael

From: Thom Brown <thom(at)linux(dot)com> To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com> Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-27 22:54:36 Message-ID: CAA-aLv6c2LER+--TXdJuqDDsv2t7uyWFGV3O4WxOuP2suNVCBg@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 27 February 2016 at 13:20, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:

> On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom(at)linux(dot)com> wrote:

>> On 21 February 2016 at 23:18, Thomas Munro

>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> The replay_lag is particularly cool. Didn't think it was possible to

>> glean this information on the primary, but the timings are correct in

>> my tests.

>>

>> +1 for this patch. Looks like this solves the problem that

>> semi-synchronous replication tries to solve, although arguably in a

>> more sensible way.

>

> Yeah, having extra logic at application layer to check if a certain

> LSN position has been applied or not is doable, but if we can avoid it

> that's a clear plus.

>

> This patch has no documentation. I will try to figure out by myself

> how the new parameters interact with the rest of the syncrep code

> while looking at it but if we want to move on to get something

> committable for 9.6 it would be good to get some documentation soon. Could we rename "apply" to "remote_apply"? It seems more consistent

with "remote_write", and matches its own enum entry too. Thom

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com> Cc: Thom Brown <thom(at)linux(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-29 06:20:31 Message-ID: CAEepm=03Fs+6eX17AScshgAm4LVkCpwSbqYSaP-LzAuoMSBsgA@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Sun, Feb 28, 2016 at 2:20 AM, Michael Paquier

<michael(dot)paquier(at)gmail(dot)com> wrote:

> On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom(at)linux(dot)com> wrote:

>> On 21 February 2016 at 23:18, Thomas Munro

>> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> The replay_lag is particularly cool. Didn't think it was possible to

>> glean this information on the primary, but the timings are correct in

>> my tests.

>>

>> +1 for this patch. Looks like this solves the problem that

>> semi-synchronous replication tries to solve, although arguably in a

>> more sensible way.

>

> Yeah, having extra logic at application layer to check if a certain

> LSN position has been applied or not is doable, but if we can avoid it

> that's a clear plus.

>

> This patch has no documentation. I will try to figure out by myself

> how the new parameters interact with the rest of the syncrep code

> while looking at it but if we want to move on to get something

> committable for 9.6 it would be good to get some documentation soon. Thanks for looking at the patch! Here is a new version with the

following changes: 1. Some draft user documentation has been added, as requested. 2. The new synchronous commit level (separate from the causal reads

feature, but implemented for completeness) is now called

"remote_apply" instead of "apply", as suggested by Thom Brown. 3. There is a draft README.causal_reads file which provides some

developer notes about state transitions, leases and clock skew. 4. The 'joining' state management has been improved (it's now based

on xlog positions rather than time; see the README and comments for

details). 5. The ps title is now restored after it is modified during causal

reads-related waiting. 6. I assigned an errcode (40P02) for causal reads failures (useful

for clients/middleware/libraries that might want to handle this error

automatically), as suggested by a couple of people off-list. --

Thomas Munro

http://www.enterprisedb.com



Attachment Content-Type Size causal-reads-v7.patch application/octet-stream 107.7 KB

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> Cc: Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-29 08:05:52 Message-ID: 56D3FBE0.5030309@lab.ntt.co.jp Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers



Hi Thomas, On 2016/02/29 15:20, Thomas Munro wrote:

> Thanks for looking at the patch! Here is a new version with the

> following changes:

>

> 1. Some draft user documentation has been added, as requested. Just to clarify, in: + servers. A transaction that is run with

<varname>causal_reads</> set

+ to <literal>on</> is guaranteed either to see the effects of all

+ completed transactions run on the primary with the setting on, or to

+ receive an error "standby is not available for causal reads". "A transaction that is run" means "A transaction that is run on a

standby", right? By the way, is there some discussion in our existing

documentation to refer to about causal consistency in single node case? I

don't know maybe that will help ease into the new feature. Grepping the

existing source tree doesn't reveal the term "causal", so maybe even a

single line in the patch mentioning "single node operation trivially

implies (or does it?) causal consistency" would help. Thoughts? Thanks,

Amit

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-29 09:05:47 Message-ID: CAEepm=2eUNUMgt2D6AkfU5bGGJ1rUNTcHPTOD7oPGvyASpXy0Q@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote

<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>

> Hi Thomas,

>

> On 2016/02/29 15:20, Thomas Munro wrote:

>> Thanks for looking at the patch! Here is a new version with the

>> following changes:

>>

>> 1. Some draft user documentation has been added, as requested.

>

> Just to clarify, in:

>

> + servers. A transaction that is run with

> <varname>causal_reads</> set

> + to <literal>on</> is guaranteed either to see the effects of all

> + completed transactions run on the primary with the setting on, or to

> + receive an error "standby is not available for causal reads".

>

> "A transaction that is run" means "A transaction that is run on a

> standby", right? Well, it could be any server, standby or primary. Of course standbys

are the interesting case since it it was already true that if you run

two sequential transactions run on the primary, the second can see the

effect of the first, but I like the idea of a general rule that

applies anywhere, allowing you not to care which server it is. > By the way, is there some discussion in our existing

> documentation to refer to about causal consistency in single node case? I

> don't know maybe that will help ease into the new feature. Grepping the

> existing source tree doesn't reveal the term "causal", so maybe even a

> single line in the patch mentioning "single node operation trivially

> implies (or does it?) causal consistency" would help. Thoughts? Hmm. Where should such a thing go? I probably haven't introduced the

term well enough. I thought for a moment about putting something

here: http://www.postgresql.org/docs/devel/static/sql-commit.html "All changes made by the transaction become visible to others ..." --

which others? But I backed out, that succinct account of COMMIT is 20

years old, and in any case visibility is tied to committing, not

specifically to the COMMIT command. But perhaps this patch really

should include something there that refers back to the causal reads

section. --

Thomas Munro

http://www.enterprisedb.com

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-02-29 11:38:22 Message-ID: CAB7nPqTOgBmetwO5Gt=KFrwYpw7=ri39AO0oXS_p9PJrfye8Sw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Feb 29, 2016 at 6:05 PM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> "All changes made by the transaction become visible to others ..." --

> which others? But I backed out, that succinct account of COMMIT is 20

> years old, and in any case visibility is tied to committing, not

> specifically to the COMMIT command. But perhaps this patch really

> should include something there that refers back to the causal reads

> section. Luckily enough, read uncommitted behaves like read-committed in PG,

making this true :)

--

Michael

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data Date: 2016-03-01 01:46:03 Message-ID: 56D4F45B.9080408@lab.ntt.co.jp Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers



Hi, On 2016/02/29 18:05, Thomas Munro wrote:

> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:

>> + servers. A transaction that is run with

>> <varname>causal_reads</> set

>> + to <literal>on</> is guaranteed either to see the effects of all

>> + completed transactions run on the primary with the setting on, or to

>> + receive an error "standby is not available for causal reads".

>>

>> "A transaction that is run" means "A transaction that is run on a

>> standby", right?

>

> Well, it could be any server, standby or primary. Of course standbys

> are the interesting case since it it was already true that if you run

> two sequential transactions run on the primary, the second can see the

> effect of the first, but I like the idea of a general rule that

> applies anywhere, allowing you not to care which server it is. I meant actually in context of that sentence only. >> By the way, is there some discussion in our existing

>> documentation to refer to about causal consistency in single node case? I

>> don't know maybe that will help ease into the new feature. Grepping the

>> existing source tree doesn't reveal the term "causal", so maybe even a

>> single line in the patch mentioning "single node operation trivially

>> implies (or does it?) causal consistency" would help. Thoughts?

>

> Hmm. Where should such a thing go? I probably haven't introduced the

> term well enough. I thought for a moment about putting something

> here:

>

> http://www.postgresql.org/docs/devel/static/sql-commit.html

>

> "All changes made by the transaction become visible to others ..." --

> which others? But I backed out, that succinct account of COMMIT is 20

> years old, and in any case visibility is tied to committing, not

> specifically to the COMMIT command. But perhaps this patch really

> should include something there that refers back to the causal reads

> section. I see. I agree this is not exactly material for the COMMIT page. Perhaps

somewhere under "Chapter 13. Concurrency Control" with cross-reference

to/from "25.5. Hot Standby". Might be interesting to hear from others as

well. Thanks,

Amit

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>,