Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Hi all Some time ago I ran into an issue where a user encountered data corruption

after a storage error. PostgreSQL played a part in that corruption by

allowing checkpoint what should've been a fatal error. TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at

least on Linux. When fsync() returns success it means "all writes since the

last fsync have hit disk" but we assume it means "all writes since the last

SUCCESSFUL fsync have hit disk". Pg wrote some blocks, which went to OS dirty buffers for writeback.

Writeback failed due to an underlying storage error. The block I/O layer

and XFS marked the writeback page as failed (AS_EIO), but had no way to

tell the app about the failure. When Pg called fsync() on the FD during the

next checkpoint, fsync() returned EIO because of the flagged page, to tell

Pg that a previous async write failed. Pg treated the checkpoint as failed

and didn't advance the redo start position in the control file. All good so far. But then we retried the checkpoint, which retried the fsync(). The retry

succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*. The write never made it to disk, but we completed the checkpoint, and

merrily carried on our way. Whoops, data loss. The clear-error-and-continue behaviour of fsync is not documented as far as

I can tell. Nor is fsync() returning EIO unless you have a very new linux

man-pages with the patch I wrote to add it. But from what I can see in the

POSIX standard we are not given any guarantees about what happens on

fsync() failure at all, so we're probably wrong to assume that retrying

fsync( ) is safe. If the server had been using ext3 or ext4 with errors=remount-ro, the

problem wouldn't have occurred because the first I/O error would've

remounted the FS and stopped Pg from continuing. But XFS doesn't have that

option. There may be other situations where this can occur too, involving

LVM and/or multipath, but I haven't comprehensively dug out the details yet. It proved possible to recover the system by faking up a backup label from

before the first incorrectly-successful checkpoint, forcing redo to repeat

and write the lost blocks. But ... what a mess. I posted about the underlying fsync issue here some time ago: https://stackoverflow.com/q/42434872/398670 but haven't had a chance to follow up about the Pg specifics. I've been looking at the problem on and off and haven't come up with a good

answer. I think we should just PANIC and let redo sort it out by repeating

the failed write when it repeats work since the last checkpoint. The API offered by async buffered writes and fsync offers us no way to find

out which page failed, so we can't just selectively redo that write. I

think we do know the relfilenode associated with the fd that failed to

fsync, but not much more. So the alternative seems to be some sort of

potentially complex online-redo scheme where we replay WAL only the

relation on which we had the fsync() error, while otherwise servicing

queries normally. That's likely to be extremely error-prone and hard to

test, and it's trying to solve a case where on other filesystems the whole

DB would grind to a halt anyway. I looked into whether we can solve it with use of the AIO API instead, but

the mess is even worse there - from what I can tell you can't even reliably

guarantee fsync at all on all Linux kernel versions. We already PANIC on fsync() failure for WAL segments. We just need to do

the same for data forks at least for EIO. This isn't as bad as it seems

because AFAICS fsync only returns EIO in cases where we should be stopping

the world anyway, and many FSes will do that for us. There are rather a lot of pg_fsync() callers. While we could handle this

case-by-case for each one, I'm tempted to just make pg_fsync() itself

intercept EIO and PANIC. Thoughts? --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> TL;DR: Pg should PANIC on fsync() EIO return. Surely you jest. > Retrying fsync() is not OK at

> least on Linux. When fsync() returns success it means "all writes since the

> last fsync have hit disk" but we assume it means "all writes since the last

> SUCCESSFUL fsync have hit disk". If that's actually the case, we need to push back on this kernel brain

damage, because as you're describing it fsync would be completely useless. Moreover, POSIX is entirely clear that successful fsync means all

preceding writes for the file have been completed, full stop, doesn't

matter when they were issued. regards, tom lane

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

>> TL;DR: Pg should PANIC on fsync() EIO return.

>

> Surely you jest. Any callers of pg_fsync in the backend code are careful enough to check

the returned status, sometimes doing retries like in mdsync, so what is

proposed here would be a regression.

--

Michael



From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Michael Paquier <michael(at)paquier(dot)xyz> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 02:48:27 Message-ID: CAEepm=2JnwtkZ1PAuPMx=UtG21VPQFfRrVzVTWjEPQmYR-zyng@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

>> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

>>> TL;DR: Pg should PANIC on fsync() EIO return.

>>

>> Surely you jest.

>

> Any callers of pg_fsync in the backend code are careful enough to check

> the returned status, sometimes doing retries like in mdsync, so what is

> proposed here would be a regression. Craig, is the phenomenon you described the same as the second issue

"Reporting writeback errors" discussed in this article? https://lwn.net/Articles/724307/ "Current kernels might report a writeback error on an fsync() call,

but there are a number of ways in which that can fail to happen." That's... I'm speechless. --

Thomas Munro

http://www.enterprisedb.com

From: Justin Pryzby <pryzby(at)telsasoft(dot)com> To: Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 05:00:31 Message-ID: 20180329050031.GG28454@telsasoft.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:

> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> >> TL;DR: Pg should PANIC on fsync() EIO return.

> >

> > Surely you jest.

>

> Any callers of pg_fsync in the backend code are careful enough to check

> the returned status, sometimes doing retries like in mdsync, so what is

> proposed here would be a regression. The retries are the source of the problem ; the first fsync() can return EIO,

and also *clears the error* causing a 2nd fsync (of the same data) to return

success. (Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC). On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:

> Craig, is the phenomenon you described the same as the second issue

> "Reporting writeback errors" discussed in this article?

> https://lwn.net/Articles/724307/ Worse, the article acknowledges the behavior without apparently suggesting to

change it: "Storing that value in the file structure has an important benefit: it makes

it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT

CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR

OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION." I believe I reproduced the problem behavior using dmsetup "error" target, see

attached. strace looks like this: kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux 1 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3

2 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

3 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

4 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

5 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

6 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

7 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

8 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560

9 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)

10 dup(2) = 4

11 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)

12 brk(NULL) = 0x1299000

13 brk(0x12ba000) = 0x12ba000

14 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0

15 write(4, "write(1): No space left on devic"..., 34write(1): No space left on device

16 ) = 34

17 close(4) = 0

18 fsync(3) = -1 EIO (Input/output error)

19 dup(2) = 4

20 fcntl(4, F_GETFL) = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)

21 fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0

22 write(4, "fsync(1): Input/output error

", 29fsync(1): Input/output error

23 ) = 29

24 close(4) = 0

25 close(3) = 0

26 open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3

27 fsync(3) = 0

28 write(3, "\0", 1) = 1

29 fsync(3) = 0

30 exit_group(0) = ? 2: EIO isn't seen initially due to writeback page cache;

9: ENOSPC due to small device

18: original IO error reported by fsync, good

25: the original FD is closed

26: ..and file reopened

27: fsync on file with still-dirty data+EIO returns success BAD 10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror

should write to a separate FD (?) Also note, close() ALSO returned success..which you might think exonerates the

2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd

byte certainly never got written to DM error, and the failure status was lost

following fsync(). I get the exact same behavior if I break after one write() loop, such as to

avoid ENOSPC. Justin



Attachment Content-Type Size eio.c text/x-csrc 1.0 KB

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Justin Pryzby <pryzby(at)telsasoft(dot)com> Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 05:06:22 Message-ID: CAEepm=1YNv1hic3MVRJiB817eofmL-wfiD=zhJnt0RjaHnfwig@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:

> The retries are the source of the problem ; the first fsync() can return EIO,

> and also *clears the error* causing a 2nd fsync (of the same data) to return

> success. What I'm failing to grok here is how that error flag even matters,

whether it's a single bit or a counter as described in that patch. If

write back failed, *the page is still dirty*. So all future calls to

fsync() need to try to try to flush it again, and (presumably) fail

again (unless it happens to succeed this time around). --

Thomas Munro

http://www.enterprisedb.com

From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 05:25:51 Message-ID: CAMsr+YEa4tv1UCBRQHzA1ycfdvryHFYJ1LhaJJNbjStO3=M9Hg@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 29 March 2018 at 13:06, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: > On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com>

> wrote:

> > The retries are the source of the problem ; the first fsync() can return

> EIO,

> > and also *clears the error* causing a 2nd fsync (of the same data) to

> return

> > success.

>

> What I'm failing to grok here is how that error flag even matters,

> whether it's a single bit or a counter as described in that patch. If

> write back failed, *the page is still dirty*. So all future calls to

> fsync() need to try to try to flush it again, and (presumably) fail

> again (unless it happens to succeed this time around).

> <http://www.enterprisedb.com>

> You'd think so. But it doesn't appear to work that way. You can see

yourself with the error device-mapper destination mapped over part of a

volume. I wrote a test case here. https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c I don't pretend the kernel behaviour is sane. And it's possible I've made

an error in my analysis. But since I've observed this in the wild, and seen

it in a test case, I strongly suspect that's what I've described is just

what's happening, brain-dead or no. Presumably the kernel marks the page clean when it dispatches it to the I/O

subsystem and doesn't dirty it again on I/O error? I haven't dug that deep

on the kernel side. See the stackoverflow post for details on what I found

in kernel code analysis. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 05:32:43 Message-ID: CAMsr+YFET=hWD7A5ULSpbbiZ6e-N-W7ajga1zZfzy=mko5qfqA@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 29 March 2018 at 10:48, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: > On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier <michael(at)paquier(dot)xyz>

> wrote:

> > On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

> >> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> >>> TL;DR: Pg should PANIC on fsync() EIO return.

> >>

> >> Surely you jest.

> >

> > Any callers of pg_fsync in the backend code are careful enough to check

> > the returned status, sometimes doing retries like in mdsync, so what is

> > proposed here would be a regression.

>

> Craig, is the phenomenon you described the same as the second issue

> "Reporting writeback errors" discussed in this article?

>

> https://lwn.net/Articles/724307/ A variant of it, by the looks. The problem in our case is that the kernel only tells us about the error

once. It then forgets about it. So yes, that seems like a variant of the

statement: > "Current kernels might report a writeback error on an fsync() call,

> but there are a number of ways in which that can fail to happen."

>

> That's... I'm speechless. Yeah. It's a bit nuts. I was astonished when I saw the behaviour, and that it appears undocumented. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On 29 March 2018 at 10:30, Michael Paquier <michael(at)paquier(dot)xyz> wrote: > On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> >> TL;DR: Pg should PANIC on fsync() EIO return.

> >

> > Surely you jest.

>

> Any callers of pg_fsync in the backend code are careful enough to check

> the returned status, sometimes doing retries like in mdsync, so what is

> proposed here would be a regression. I covered this in my original post. Yes, we check the return value. But what do we do about it? For fsyncs of

heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint

later, which will retry the fsync(). **Which will now appear to succeed**

because the kernel forgot that it lost our writes after telling us the

first time. So we do check the error code, which returns success, and we

complete the checkpoint and move on. But we only retried the fsync, not the writes before the fsync. So we lost data. Or rather, failed to detect that the kernel did so, so our

checkpoint was bad and could not be completed. The problem is that we keep retrying checkpoints *without* repeating the

writes leading up to the checkpoint, and retrying fsync. I don't pretend the kernel behaviour is sane, but we'd better deal with it

anyway. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote: > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> > TL;DR: Pg should PANIC on fsync() EIO return.

>

> Surely you jest.

> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as

well to avoid similar lost-page-write issues. It's not necessary on ext3/ext4 with errors=remount-ro, but that's only

because the FS stops us dead in our tracks. I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going

to lose a write, it should at minimum mark the FD as broken so no further

fsync() or anything else can succeed on the FD, and an app that cares about

durability must repeat the whole set of work since the prior succesful

fsync(). Just reporting it once and forgetting it is madness. But even if we convince the kernel folks of that, how do other platforms

behave? And how long before these kernels are out of use? We'd better deal

with it, crazy or no. Please see my StackOverflow post for the kernel-level explanation. Note

also the test case link there. https://stackoverflow.com/a/42436054/398670 > Retrying fsync() is not OK at

> > least on Linux. When fsync() returns success it means "all writes since

> the

> > last fsync have hit disk" but we assume it means "all writes since the

> last

> > SUCCESSFUL fsync have hit disk".

>

> If that's actually the case, we need to push back on this kernel brain

> damage, because as you're describing it fsync would be completely useless.

> It's not useless, it's just telling us something other than what we think

it means. The promise it seems to give us is that if it reports an error

once, everything *after* that is useless, so we should throw our toys,

close and reopen everything, and redo from the last known-good state. Though as Tomas posted below, it provides rather weaker guarantees than I

thought in some other areas too. See that lwn.net article he linked. > Moreover, POSIX is entirely clear that successful fsync means all

> preceding writes for the file have been completed, full stop, doesn't

> matter when they were issued.

> I can't find anything that says so to me. Please quote relevant spec. I'm working from

http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which

states that "The fsync() function shall request that all data for the open file

descriptor named by fildes is to be transferred to the storage device

associated with the file described by fildes. The nature of the transfer is

implementation-defined. The fsync() function shall not return until the

system has completed that action or until an error is detected." My reading is that POSIX does not specify what happens AFTER an error is

detected. It doesn't say that error has to be persistent and that

subsequent calls must also report the error. It also says: "If the fsync() function fails, outstanding I/O operations are not

guaranteed to have been completed." but that doesn't clarify matters much either, because it can be read to

mean that once there's been an error reported for some IO operations

there's no guarantee those operations are ever completed even after a

subsequent fsync returns success. I'm not seeking to defend what the kernel seems to be doing. Rather, saying

that we might see similar behaviour on other platforms, crazy or not. I

haven't looked past linux yet, though. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>>

>> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

>> > TL;DR: Pg should PANIC on fsync() EIO return.

>>

>> Surely you jest.

>

> No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as

> well to avoid similar lost-page-write issues. I found your discussion with kernel hacker Jeff Layton at

https://lwn.net/Articles/718734/ in which he said: "The stackoverflow

writeup seems to want a scheme where pages stay dirty after a

writeback failure so that we can try to fsync them again. Note that

that has never been the case in Linux after hard writeback failures,

AFAIK, so programs should definitely not assume that behavior." The article above that says the same thing a couple of different ways,

ie that writeback failure leaves you with pages that are neither

written to disk successfully nor marked dirty. If I'm reading various articles correctly, the situation was even

worse before his errseq_t stuff landed. That fixed cases of

completely unreported writeback failures due to sharing of PG_error

for both writeback and read errors with certain filesystems, but it

doesn't address the clean pages problem. Yeah, I see why you want to PANIC. >> Moreover, POSIX is entirely clear that successful fsync means all

>> preceding writes for the file have been completed, full stop, doesn't

>> matter when they were issued.

>

> I can't find anything that says so to me. Please quote relevant spec.

>

> I'm working from

> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which

> states that

>

> "The fsync() function shall request that all data for the open file

> descriptor named by fildes is to be transferred to the storage device

> associated with the file described by fildes. The nature of the transfer is

> implementation-defined. The fsync() function shall not return until the

> system has completed that action or until an error is detected."

>

> My reading is that POSIX does not specify what happens AFTER an error is

> detected. It doesn't say that error has to be persistent and that subsequent

> calls must also report the error. It also says: FWIW my reading is the same as Tom's. It says "all data for the open

file descriptor" without qualification or special treatment after

errors. Not "some". > I'm not seeking to defend what the kernel seems to be doing. Rather, saying

> that we might see similar behaviour on other platforms, crazy or not. I

> haven't looked past linux yet, though. I see no reason to think that any other operating system would behave

that way without strong evidence... This is openly acknowledged to be

"a mess" and "a surprise" in the Filesystem Summit article. I am not

really qualified to comment, but from a cursory glance at FreeBSD's

vfs_bio.c I think it's doing what you'd hope for... see the code near

the comment "Failed write, redirty." --

Thomas Munro

http://www.enterprisedb.com

On 29 March 2018 at 20:07, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: > On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>

> wrote:

> > On 28 March 2018 at 11:53, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> >>

> >> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> >> > TL;DR: Pg should PANIC on fsync() EIO return.

> >>

> >> Surely you jest.

> >

> > No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC

> as

> > well to avoid similar lost-page-write issues.

>

> I found your discussion with kernel hacker Jeff Layton at

> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow

> writeup seems to want a scheme where pages stay dirty after a

> writeback failure so that we can try to fsync them again. Note that

> that has never been the case in Linux after hard writeback failures,

> AFAIK, so programs should definitely not assume that behavior."

>

> The article above that says the same thing a couple of different ways,

> ie that writeback failure leaves you with pages that are neither

> written to disk successfully nor marked dirty.

>

> If I'm reading various articles correctly, the situation was even

> worse before his errseq_t stuff landed. That fixed cases of

> completely unreported writeback failures due to sharing of PG_error

> for both writeback and read errors with certain filesystems, but it

> doesn't address the clean pages problem.

>

> Yeah, I see why you want to PANIC.

> In more ways than one ;) > I'm not seeking to defend what the kernel seems to be doing. Rather,

> saying

> > that we might see similar behaviour on other platforms, crazy or not. I

> > haven't looked past linux yet, though.

>

> I see no reason to think that any other operating system would behave

> that way without strong evidence... This is openly acknowledged to be

> "a mess" and "a surprise" in the Filesystem Summit article. I am not

> really qualified to comment, but from a cursory glance at FreeBSD's

> vfs_bio.c I think it's doing what you'd hope for... see the code near

> the comment "Failed write, redirty." Ok, that's reassuring, but doesn't help us on the platform the great

majority of users deploy on :( "If on Linux, PANIC" Hrm. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



From: Catalin Iacob <iacobcatalin(at)gmail(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 16:20:00 Message-ID: CAHg_5gqXwiun=inh=2QomnvqvRYb_jrYcnfiEekZ=hQKbY2XBA@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> I found your discussion with kernel hacker Jeff Layton at

> https://lwn.net/Articles/718734/ in which he said: "The stackoverflow

> writeup seems to want a scheme where pages stay dirty after a

> writeback failure so that we can try to fsync them again. Note that

> that has never been the case in Linux after hard writeback failures,

> AFAIK, so programs should definitely not assume that behavior." And a bit below in the same comments, to this question about PG: "So,

what are the options at this point? The assumption was that we can

repeat the fsync (which as you point out is not the case), or shut

down the database and perform recovery from WAL", the same Jeff Layton

seems to agree PANIC is the appropriate response:

"Replaying the WAL synchronously sounds like the simplest approach

when you get an error on fsync. These are uncommon occurrences for the

most part, so having to fall back to slow, synchronous error recovery

modes when this occurs is probably what you want to do.".

And right after, he confirms the errseq_t patches are about always

detecting this, not more:

"The main thing I working on is to better guarantee is that you

actually get an error when this occurs rather than silently corrupting

your data. The circumstances where that can occur require some

corner-cases, but I think we need to make sure that it doesn't occur." Jeff's comments in the pull request that merged errseq_t are worth

reading as well:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 > The article above that says the same thing a couple of different ways,

> ie that writeback failure leaves you with pages that are neither

> written to disk successfully nor marked dirty.

>

> If I'm reading various articles correctly, the situation was even

> worse before his errseq_t stuff landed. That fixed cases of

> completely unreported writeback failures due to sharing of PG_error

> for both writeback and read errors with certain filesystems, but it

> doesn't address the clean pages problem. Indeed, that's exactly how I read it as well (opinion formed

independently before reading your sentence above). The errseq_t

patches landed in v4.13 by the way, so very recently. > Yeah, I see why you want to PANIC. Indeed. Even doing that leaves question marks about all the kernel

versions before v4.13, which at this point is pretty much everything

out there, not even detecting this reliably. This is messy.

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Catalin Iacob <iacobcatalin(at)gmail(dot)com> Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-29 21:18:14 Message-ID: CAEepm=1KFaVPdOxYkP6bmtevOZHfdHTNf8bjZWSkJxoxy0X+7A@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob <iacobcatalin(at)gmail(dot)com> wrote:

> Jeff's comments in the pull request that merged errseq_t are worth

> reading as well:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750 Wow. It looks like there may be a separate question of when each

filesystem adopted this new infrastructure? >> Yeah, I see why you want to PANIC.

>

> Indeed. Even doing that leaves question marks about all the kernel

> versions before v4.13, which at this point is pretty much everything

> out there, not even detecting this reliably. This is messy. The pre-errseq_t problems are beyond our control. There's nothing we

can do about that in userspace (except perhaps abandon OS-buffered IO,

a big project). We just need to be aware that this problem exists in

certain kernel versions and be grateful to Layton for fixing it. The dropped dirty flag problem is something we can and in my view

should do something about, whatever we might think about that design

choice. As Andrew Gierth pointed out to me in an off-list chat about

this, by the time you've reached this state, both PostgreSQL's buffer

and the kernel's buffer are clean and might be reused for another

block at any time, so your data might be gone from the known universe

-- we don't even have the option to rewrite our buffers in general.

Recovery is the only option. Thank you to Craig for chasing this down and +1 for his proposal, on Linux only. --

Thomas Munro

http://www.enterprisedb.com

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-31 13:24:28 Message-ID: 20180331132428.GA31905@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > >> Yeah, I see why you want to PANIC.

> >

> > Indeed. Even doing that leaves question marks about all the kernel

> > versions before v4.13, which at this point is pretty much everything

> > out there, not even detecting this reliably. This is messy. There may still be a way to reliably detect this on older kernel

versions from userspace, but it will be messy whatsoever. On EIO

errors, the kernel will not restore the dirty page flags, but it

will flip the error flags on the failed pages. One could mmap()

the file in question, obtain the PFNs (via /proc/pid/pagemap)

and enumerate those to match the ones with the error flag switched

on (via /proc/kpageflags). This could serve at least as a detection

mechanism, but one could also further use this info to logically

map the pages that failed IO back to the original file offsets,

and potentially retry IO just for those file ranges that cover

the failed pages. Just an idea, not tested. Best regards,

Anthony

From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-31 16:13:09 Message-ID: CAMsr+YHczzQJPGr94Y_Zw34Yzuw8UkzmxEB9eWuFaALRSxY-pA@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 31 March 2018 at 21:24, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote: > On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

>

> > >> Yeah, I see why you want to PANIC.

> > >

> > > Indeed. Even doing that leaves question marks about all the kernel

> > > versions before v4.13, which at this point is pretty much everything

> > > out there, not even detecting this reliably. This is messy.

>

> There may still be a way to reliably detect this on older kernel

> versions from userspace, but it will be messy whatsoever. On EIO

> errors, the kernel will not restore the dirty page flags, but it

> will flip the error flags on the failed pages. One could mmap()

> the file in question, obtain the PFNs (via /proc/pid/pagemap)

> and enumerate those to match the ones with the error flag switched

> on (via /proc/kpageflags). This could serve at least as a detection

> mechanism, but one could also further use this info to logically

> map the pages that failed IO back to the original file offsets,

> and potentially retry IO just for those file ranges that cover

> the failed pages. Just an idea, not tested.

> That sounds like a huge amount of complexity, with uncertainty as to how

it'll behave kernel-to-kernel, for negligble benefit. I was exploring the idea of doing selective recovery of one relfilenode,

based on the assumption that we know the filenode related to the fd that

failed to fsync(). We could redo only WAL on that relation. But it fails

the same test: it's too complex for a niche case that shouldn't happen in

the first place, so it'll probably have bugs, or grow bugs in bitrot over

time. Remember, if you're on ext4 with errors=remount-ro, you get shut down even

harder than a PANIC. So we should just use the big hammer here. I'll send a patch this week. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> To: Craig Ringer <craig(at)2ndquadrant(dot)com> Cc: Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-31 16:38:12 Message-ID: 10852.1522514292@sss.pgh.pa.us Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> So we should just use the big hammer here. And bitch, loudly and publicly, about how broken this kernel behavior is.

If we make enough of a stink maybe it'll get fixed. regards, tom lane

From: Michael Paquier <michael(at)paquier(dot)xyz> To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-01 00:20:38 Message-ID: 20180401002038.GA2211@paquier.xyz Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

>> So we should just use the big hammer here.

>

> And bitch, loudly and publicly, about how broken this kernel behavior is.

> If we make enough of a stink maybe it'll get fixed. That won't fix anything released already, so as per the information

gathered something has to be done anyway. The discussion of this thread

is spreading quite a lot actually. Handling things at a low-level looks like a better plan for the backend.

Tools like pg_basebackup and pg_dump also issue fsync's on the data

created, we should do an equivalent for them, with some exit() calls in

file_utils.c. As of now failures are logged to stderr but not

considered fatal.

--

Michael



From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Craig Ringer <craig(at)2ndquadrant(dot)com> Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-01 00:58:22 Message-ID: 20180401005822.GJ11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:

> On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop(at)altatus(dot)com>

> wrote:

>

> On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

>

> > >> Yeah, I see why you want to PANIC.

> > >

> > > Indeed. Even doing that leaves question marks about all the kernel

> > > versions before v4.13, which at this point is pretty much everything

> > > out there, not even detecting this reliably. This is messy.

>

> There may still be a way to reliably detect this on older kernel

> versions from userspace, but it will be messy whatsoever. On EIO

> errors, the kernel will not restore the dirty page flags, but it

> will flip the error flags on the failed pages. One could mmap()

> the file in question, obtain the PFNs (via /proc/pid/pagemap)

> and enumerate those to match the ones with the error flag switched

> on (via /proc/kpageflags). This could serve at least as a detection

> mechanism, but one could also further use this info to logically

> map the pages that failed IO back to the original file offsets,

> and potentially retry IO just for those file ranges that cover

> the failed pages. Just an idea, not tested.

>

> That sounds like a huge amount of complexity, with uncertainty as to how

> it'll behave kernel-to-kernel, for negligble benefit. Those interfaces have been around since the kernel 2.6 times and are

rather stable, but I was merely responding to your original post comment

regarding having a way of finding out which page(s) failed. I assume

that indeed there would be no benefit, especially since those errors

are usually not transient (typically they come from hard medium faults),

and although a filesystem could theoretically mask the error by allocating

a different logical block, I am not aware of any implementation that

currently does that. > I was exploring the idea of doing selective recovery of one relfilenode,

> based on the assumption that we know the filenode related to the fd that

> failed to fsync(). We could redo only WAL on that relation. But it fails

> the same test: it's too complex for a niche case that shouldn't happen in

> the first place, so it'll probably have bugs, or grow bugs in bitrot over

> time. Fully agree, those cases should be sufficiently rare that a complex

and possibly non-maintainable solution is not really warranted. > Remember, if you're on ext4 with errors=remount-ro, you get shut down even

> harder than a PANIC. So we should just use the big hammer here. I am not entirely sure what you mean here, does Pg really treat write()

errors as fatal? Also, the kind of errors that ext4 detects with this

option is at the superblock level and govern metadata rather than actual

data writes (recall that those are buffered anyway, no actual device IO

has to take place at the time of write()). Best regards,

Anthony

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-01 01:14:46 Message-ID: 20180401011446.GK11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

> Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> > So we should just use the big hammer here.

>

> And bitch, loudly and publicly, about how broken this kernel behavior is.

> If we make enough of a stink maybe it'll get fixed. It is not likely to be fixed (beyond what has been done already with the

manpage patches and errseq_t fixes on the reporting level). The issue is,

the kernel needs to deal with hard IO errors at that level somehow, and

since those errors typically persist, re-dirtying the pages would not

really solve the problem (unless some filesystem remaps the request to a

different block, assuming the device is alive). Keeping around dirty

pages that cannot possibly be written out is essentially a memory leak,

as those pages would stay around even after the application has exited. Best regards,

Anthony

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Catalin Iacob <iacobcatalin(at)gmail(dot)com> Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-01 18:24:51 Message-ID: CAEepm=0B9f0O7jLE3ipUTqC3V6NO2LNbwE9Hp=3BxGbZPqEyQg@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Fri, Mar 30, 2018 at 10:18 AM, Thomas Munro

<thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> ... on Linux only. Apparently I was too optimistic. I had looked only at FreeBSD, which

keeps the page around and dirties it so we can retry, but the other

BSDs apparently don't (FreeBSD changed that in 1999). From what I can

tell from the sources below, we have: Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies

FreeBSD, Illumos: retrying fsync() after EIO tells the truth Maybe my drive-by assessment of those kernel routines is wrong and

someone will correct me, but I'm starting to think you might be better

to assume the worst on all systems. Perhaps a GUC that defaults to

panicking, so that users on those rare OSes could turn that off? Even

then I'm not sure if the failure mode will be that great anyway or if

it's worth having two behaviours. Thoughts? http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html

https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059

https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867

https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631

https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441 --

Thomas Munro

http://www.enterprisedb.com

From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> Cc: Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 15:03:42 Message-ID: CAMsr+YHtosoQKzHh-nAmyG75cAPTzTtwyk871d+1O-sNQRdeyg@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 2 April 2018 at 02:24, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>

wrote: >

> Maybe my drive-by assessment of those kernel routines is wrong and

> someone will correct me, but I'm starting to think you might be better

> to assume the worst on all systems. Perhaps a GUC that defaults to

> panicking, so that users on those rare OSes could turn that off? Even

> then I'm not sure if the failure mode will be that great anyway or if

> it's worth having two behaviours. Thoughts?

>

>

I see little benefit to not just PANICing unconditionally on EIO, really.

It shouldn't happen, and if it does, we want to be pretty conservative and

adopt a data-protective approach. I'm rather more worried by doing it on ENOSPC. Which looks like it might be

necessary from what I recall finding in my test case + kernel code reading.

I really don't want to respond to a possibly-transient ENOSPC by PANICing

the whole server unnecessarily. BTW, the support team at 2ndQ is presently working on two separate issues

where ENOSPC resulted in DB corruption, though neither of them involve logs

of lost page writes. I'm planning on taking some time tomorrow to write a

torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in

the test case I linked to in my original StackOverflow post. If this is just an EIO issue then I see no point doing anything other than

PANICing unconditionally. If it's a concern for ENOSPC too, we should try harder to fail more nicely

whenever we possibly can. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services



From: Andres Freund <andres(at)anarazel(dot)de> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 18:13:46 Message-ID: 20180402181345.h6j4z5agkjccr2vh@alap3.anarazel.de Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

Hi, On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

> On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

> > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> > > So we should just use the big hammer here.

> >

> > And bitch, loudly and publicly, about how broken this kernel behavior is.

> > If we make enough of a stink maybe it'll get fixed.

>

> It is not likely to be fixed (beyond what has been done already with the

> manpage patches and errseq_t fixes on the reporting level). The issue is,

> the kernel needs to deal with hard IO errors at that level somehow, and

> since those errors typically persist, re-dirtying the pages would not

> really solve the problem (unless some filesystem remaps the request to a

> different block, assuming the device is alive). Throwing away the dirty pages *and* persisting the error seems a lot

more reasonable. Then provide a fcntl (or whatever) extension that can

clear the error status in the few cases that the application that wants

to gracefully deal with the case. > Keeping around dirty

> pages that cannot possibly be written out is essentially a memory leak,

> as those pages would stay around even after the application has exited. Why do dirty pages need to be kept around in the case of persistent

errors? I don't think the lack of automatic recovery in that case is

what anybody is complaining about. It's that the error goes away and

there's no reasonable way to separate out such an error from some

potential transient errors. Greetings, Andres Freund

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Andres Freund <andres(at)anarazel(dot)de> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 18:53:20 Message-ID: 20180402185320.GM11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

> Hi,

>

> On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

> > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

> > > Craig Ringer <craig(at)2ndquadrant(dot)com> writes:

> > > > So we should just use the big hammer here.

> > >

> > > And bitch, loudly and publicly, about how broken this kernel behavior is.

> > > If we make enough of a stink maybe it'll get fixed.

> >

> > It is not likely to be fixed (beyond what has been done already with the

> > manpage patches and errseq_t fixes on the reporting level). The issue is,

> > the kernel needs to deal with hard IO errors at that level somehow, and

> > since those errors typically persist, re-dirtying the pages would not

> > really solve the problem (unless some filesystem remaps the request to a

> > different block, assuming the device is alive).

>

> Throwing away the dirty pages *and* persisting the error seems a lot

> more reasonable. Then provide a fcntl (or whatever) extension that can

> clear the error status in the few cases that the application that wants

> to gracefully deal with the case. Given precisely that the dirty pages which cannot been written-out are

practically thrown away, the semantics of fsync() (after the 4.13 fixes)

are essentially correct: the first call indicates that a writeback error

indeed occurred, while subsequent calls have no reason to indicate an error

(assuming no other errors occurred in the meantime). The error reporting is thus consistent with the intended semantics (which

are sadly not properly documented). Repeated calls to fsync() simply do not

imply that the kernel will retry to writeback the previously-failed pages,

so the application needs to be aware of that. Persisting the error at the

fsync() level would essentially mean moving application policy into the

kernel. Best regards,

Anthony

From: Andres Freund <andres(at)anarazel(dot)de> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 19:32:45 Message-ID: 20180402193245.urpdavk3wtaycxfz@alap3.anarazel.de Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

> On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

> > Throwing away the dirty pages *and* persisting the error seems a lot

> > more reasonable. Then provide a fcntl (or whatever) extension that can

> > clear the error status in the few cases that the application that wants

> > to gracefully deal with the case.

>

> Given precisely that the dirty pages which cannot been written-out are

> practically thrown away, the semantics of fsync() (after the 4.13 fixes)

> are essentially correct: the first call indicates that a writeback error

> indeed occurred, while subsequent calls have no reason to indicate an error

> (assuming no other errors occurred in the meantime). Meh^2. "no reason" - except that there's absolutely no way to know what state

the data is in. And that your application needs explicit handling of

such failures. And that one FD might be used in a lots of different

parts of the application, that fsyncs in one part of the application

might be an ok failure, and in another not. Requiring explicit actions

to acknowledge "we've thrown away your data for unknown reason" seems

entirely reasonable. > The error reporting is thus consistent with the intended semantics (which

> are sadly not properly documented). Repeated calls to fsync() simply do not

> imply that the kernel will retry to writeback the previously-failed pages,

> so the application needs to be aware of that. Which isn't what I've suggested. > Persisting the error at the fsync() level would essentially mean

> moving application policy into the kernel. Meh. Greetings, Andres Freund

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Andres Freund <andres(at)anarazel(dot)de> Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 20:38:06 Message-ID: 20180402203806.GN11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

> On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

> > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

> > > Throwing away the dirty pages *and* persisting the error seems a lot

> > > more reasonable. Then provide a fcntl (or whatever) extension that can

> > > clear the error status in the few cases that the application that wants

> > > to gracefully deal with the case.

> >

> > Given precisely that the dirty pages which cannot been written-out are

> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)

> > are essentially correct: the first call indicates that a writeback error

> > indeed occurred, while subsequent calls have no reason to indicate an error

> > (assuming no other errors occurred in the meantime).

>

> Meh^2.

>

> "no reason" - except that there's absolutely no way to know what state

> the data is in. And that your application needs explicit handling of

> such failures. And that one FD might be used in a lots of different

> parts of the application, that fsyncs in one part of the application

> might be an ok failure, and in another not. Requiring explicit actions

> to acknowledge "we've thrown away your data for unknown reason" seems

> entirely reasonable. As long as fsync() indicates error on first invocation, the application

is fully aware that between this point of time and the last call to fsync()

data has been lost. Persisting this error any further does not change this

or add any new info - on the contrary it adds confusion as subsequent write()s

and fsync()s on other pages can succeed, but will be reported as failures. The application will need to deal with that first error irrespective of

subsequent return codes from fsync(). Conceptually every fsync() invocation

demarcates an epoch for which it reports potential errors, so the caller

needs to take responsibility for that particular epoch. Callers that are not affected by the potential outcome of fsync() and

do not react on errors, have no reason for calling it in the first place

(and thus masking failure from subsequent callers that may indeed care). Best regards,

Anthony

From: Stephen Frost <sfrost(at)snowman(dot)net> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 20:58:08 Message-ID: 20180402205808.GZ24540@tamriel.snowman.net Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

Greetings, * Anthony Iliopoulos (ailiop(at)altatus(dot)com) wrote:

> On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

> > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

> > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

> > > > Throwing away the dirty pages *and* persisting the error seems a lot

> > > > more reasonable. Then provide a fcntl (or whatever) extension that can

> > > > clear the error status in the few cases that the application that wants

> > > > to gracefully deal with the case.

> > >

> > > Given precisely that the dirty pages which cannot been written-out are

> > > practically thrown away, the semantics of fsync() (after the 4.13 fixes)

> > > are essentially correct: the first call indicates that a writeback error

> > > indeed occurred, while subsequent calls have no reason to indicate an error

> > > (assuming no other errors occurred in the meantime).

> >

> > Meh^2.

> >

> > "no reason" - except that there's absolutely no way to know what state

> > the data is in. And that your application needs explicit handling of

> > such failures. And that one FD might be used in a lots of different

> > parts of the application, that fsyncs in one part of the application

> > might be an ok failure, and in another not. Requiring explicit actions

> > to acknowledge "we've thrown away your data for unknown reason" seems

> > entirely reasonable.

>

> As long as fsync() indicates error on first invocation, the application

> is fully aware that between this point of time and the last call to fsync()

> data has been lost. Persisting this error any further does not change this

> or add any new info - on the contrary it adds confusion as subsequent write()s

> and fsync()s on other pages can succeed, but will be reported as failures. fsync() doesn't reflect the status of given pages, however, it reflects

the status of the file descriptor, and as such the file, on which it's

called. This notion that fsync() is actually only responsible for the

changes which were made to a file since the last fsync() call is pure

foolishness. If we were able to pass a list of pages or data ranges to

fsync() for it to verify they're on disk then perhaps things would be

different, but we can't, all we can do is ask to "please flush all the

dirty pages associated with this file descriptor, which represents this

file we opened, to disk, and let us know if you were successful." Give us a way to ask "are these specific pages written out to persistant

storage?" and we would certainly be happy to use it, and to repeatedly

try to flush out pages which weren't synced to disk due to some

transient error, and to track those cases and make sure that we don't

incorrectly assume that they've been transferred to persistent storage. > The application will need to deal with that first error irrespective of

> subsequent return codes from fsync(). Conceptually every fsync() invocation

> demarcates an epoch for which it reports potential errors, so the caller

> needs to take responsibility for that particular epoch. We do deal with that error- by realizing that it failed and later

*retrying* the fsync(), which is when we get back an "all good!

everything with this file descriptor you've opened is sync'd!" and

happily expect that to be truth, when, in reality, it's an unfortunate

lie and there are still pages associated with that file descriptor which

are, in reality, dirty and not sync'd to disk. Consider two independent programs where the first one writes to a file

and then calls the second one whose job it is to go out and fsync(),

perhaps async from the first, those files. Is the second program

supposed to go write to each page that the first one wrote to, in order

to ensure that all the dirty bits are set so that the fsync() will

actually return if all the dirty pages are written? > Callers that are not affected by the potential outcome of fsync() and

> do not react on errors, have no reason for calling it in the first place

> (and thus masking failure from subsequent callers that may indeed care). Reacting on an error from an fsync() call could, based on how it's

documented and actually implemented in other OS's, mean "run another

fsync() to see if the error has resolved itself." Requiring that to

mean "you have to go dirty all of the pages you previously dirtied to

actually get a subsequent fsync() to do anything" is really just not

reasonable- a given program may have no idea what was written to

previously nor any particular reason to need to know, on the expectation

that the fsync() call will flush any dirty pages, as it's documented to

do. Thanks! Stephen



From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Stephen Frost <sfrost(at)snowman(dot)net> Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 23:05:44 Message-ID: 20180402230543.GO11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

Hi Stephen, On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

>

> fsync() doesn't reflect the status of given pages, however, it reflects

> the status of the file descriptor, and as such the file, on which it's

> called. This notion that fsync() is actually only responsible for the

> changes which were made to a file since the last fsync() call is pure

> foolishness. If we were able to pass a list of pages or data ranges to

> fsync() for it to verify they're on disk then perhaps things would be

> different, but we can't, all we can do is ask to "please flush all the

> dirty pages associated with this file descriptor, which represents this

> file we opened, to disk, and let us know if you were successful."

>

> Give us a way to ask "are these specific pages written out to persistant

> storage?" and we would certainly be happy to use it, and to repeatedly

> try to flush out pages which weren't synced to disk due to some

> transient error, and to track those cases and make sure that we don't

> incorrectly assume that they've been transferred to persistent storage. Indeed fsync() is simply a rather blunt instrument and a narrow legacy

interface but further changing its established semantics (no matter how

unreasonable they may be) is probably not the way to go. Would using sync_file_range() be helpful? Potential errors would only

apply to pages that cover the requested file ranges. There are a few

caveats though: (a) it still messes with the top-level error reporting so mixing it

with callers that use fsync() and do care about errors will produce

the same issue (clearing the error status). (b) the error-reporting granularity is coarse (failure reporting applies

to the entire requested range so you still don't know which particular

pages/file sub-ranges failed writeback) (c) the same "report and forget" semantics apply to repeated invocations

of the sync_file_range() call, so again action will need to be taken

upon first error encountered for the particular ranges. > > The application will need to deal with that first error irrespective of

> > subsequent return codes from fsync(). Conceptually every fsync() invocation

> > demarcates an epoch for which it reports potential errors, so the caller

> > needs to take responsibility for that particular epoch.

>

> We do deal with that error- by realizing that it failed and later

> *retrying* the fsync(), which is when we get back an "all good!

> everything with this file descriptor you've opened is sync'd!" and

> happily expect that to be truth, when, in reality, it's an unfortunate

> lie and there are still pages associated with that file descriptor which

> are, in reality, dirty and not sync'd to disk. It really turns out that this is not how the fsync() semantics work

though, exactly because the nature of the errors: even if the kernel

retained the dirty bits on the failed pages, retrying persisting them

on the same disk location would simply fail. Instead the kernel opts

for marking those pages clean (since there is no other recovery

strategy), and reporting once to the caller who can potentially deal

with it in some manner. It is sadly a bad and undocumented convention. > Consider two independent programs where the first one writes to a file

> and then calls the second one whose job it is to go out and fsync(),

> perhaps async from the first, those files. Is the second program

> supposed to go write to each page that the first one wrote to, in order

> to ensure that all the dirty bits are set so that the fsync() will

> actually return if all the dirty pages are written? I think what you have in mind are the semantics of sync() rather

than fsync(), but as long as an application needs to ensure data

are persisted to storage, it needs to retain those data in its heap

until fsync() is successful instead of discarding them and relying

on the kernel after write(). The pattern should be roughly like:

write() -> fsync() -> free(), rather than write() -> free() -> fsync().

For example, if a partition gets full upon fsync(), then the application

has a chance to persist the data in a different location, while

the kernel cannot possibly make this decision and recover. > > Callers that are not affected by the potential outcome of fsync() and

> > do not react on errors, have no reason for calling it in the first place

> > (and thus masking failure from subsequent callers that may indeed care).

>

> Reacting on an error from an fsync() call could, based on how it's

> documented and actually implemented in other OS's, mean "run another

> fsync() to see if the error has resolved itself." Requiring that to

> mean "you have to go dirty all of the pages you previously dirtied to

> actually get a subsequent fsync() to do anything" is really just not

> reasonable- a given program may have no idea what was written to

> previously nor any particular reason to need to know, on the expectation

> that the fsync() call will flush any dirty pages, as it's documented to

> do. I think we are conflating a few issues here: having the OS kernel being

responsible for error recovery (so that subsequent fsync() would fix

the problems) is one. This clearly is a design which most kernels have

not really adopted for reasons outlined above (although having the FS

layer recovering from hard errors transparently is open for discussion

from what it seems [1]). Now, there is the issue of granularity of

error reporting: userspace could benefit from a fine-grained indication

of failed pages (or file ranges). Another issue is that of reporting

semantics (report and clear), which is also a design choice made to

avoid having higher-resolution error tracking and the corresponding

memory overheads [1]. Best regards,

Anthony [1] https://lwn.net/Articles/718734/

From: Andres Freund <andres(at)anarazel(dot)de> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 23:23:24 Message-ID: 20180402232324.hdt2345l6srm2upw@alap3.anarazel.de Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:

> Would using sync_file_range() be helpful? Potential errors would only

> apply to pages that cover the requested file ranges. There are a few

> caveats though: To quote sync_file_range(2):

Warning

This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the

file's metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there

are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an over‐

write. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When

writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not

sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with

volatile disk write caches. Given the lack of metadata safety that seems entirely a no go. We use

sfr(2), but only to force the kernel's hand around writing back earlier

without throwing away cache contents. > > > The application will need to deal with that first error irrespective of

> > > subsequent return codes from fsync(). Conceptually every fsync() invocation

> > > demarcates an epoch for which it reports potential errors, so the caller

> > > needs to take responsibility for that particular epoch.

> >

> > We do deal with that error- by realizing that it failed and later

> > *retrying* the fsync(), which is when we get back an "all good!

> > everything with this file descriptor you've opened is sync'd!" and

> > happily expect that to be truth, when, in reality, it's an unfortunate

> > lie and there are still pages associated with that file descriptor which

> > are, in reality, dirty and not sync'd to disk.

>

> It really turns out that this is not how the fsync() semantics work

> though Except on freebsd and solaris, and perhaps others. >, exactly because the nature of the errors: even if the kernel

> retained the dirty bits on the failed pages, retrying persisting them

> on the same disk location would simply fail. That's not guaranteed at all, think NFS. > Instead the kernel opts for marking those pages clean (since there is

> no other recovery strategy), and reporting once to the caller who can

> potentially deal with it in some manner. It is sadly a bad and

> undocumented convention. It's broken behaviour justified post facto with the only rational that

was available, which explains why it's so unconvincing. You could just

say "this ship has sailed, and it's to onerous to change because xxx"

and this'd be a done deal. But claiming this is reasonable behaviour is

ridiculous. Again, you could just continue to error for this fd and still throw away

the data. > > Consider two independent programs where the first one writes to a file

> > and then calls the second one whose job it is to go out and fsync(),

> > perhaps async from the first, those files. Is the second program

> > supposed to go write to each page that the first one wrote to, in order

> > to ensure that all the dirty bits are set so that the fsync() will

> > actually return if all the dirty pages are written?

>

> I think what you have in mind are the semantics of sync() rather

> than fsync() If you open the same file with two fds, and write with one, and fsync

with another that's definitely supposed to work. And sync() isn't a

realistic replacement in any sort of way because it's obviously

systemwide, and thus entirely and completely unsuitable. Nor does it

have any sort of better error reporting behaviour, does it? Greetings, Andres Freund

From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-02 23:27:35 Message-ID: CAMsr+YEp21emoQG_dEqgUxckGNgZ8oQfab0giK6XLTyFbRY02Q@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On 3 April 2018 at 07:05, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote: > Hi Stephen,

>

> On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

> >

> > fsync() doesn't reflect the status of given pages, however, it reflects

> > the status of the file descriptor, and as such the file, on which it's

> > called. This notion that fsync() is actually only responsible for the

> > changes which were made to a file since the last fsync() call is pure

> > foolishness. If we were able to pass a list of pages or data ranges to

> > fsync() for it to verify they're on disk then perhaps things would be

> > different, but we can't, all we can do is ask to "please flush all the

> > dirty pages associated with this file descriptor, which represents this

> > file we opened, to disk, and let us know if you were successful."

> >

> > Give us a way to ask "are these specific pages written out to persistant

> > storage?" and we would certainly be happy to use it, and to repeatedly

> > try to flush out pages which weren't synced to disk due to some

> > transient error, and to track those cases and make sure that we don't

> > incorrectly assume that they've been transferred to persistent storage.

>

> Indeed fsync() is simply a rather blunt instrument and a narrow legacy

> interface but further changing its established semantics (no matter how

> unreasonable they may be) is probably not the way to go.

> They're undocumented and extremely surprising semantics that are arguably a

violation of the POSIX spec for fsync(), or at least a surprising

interpretation of it. So I don't buy this argument. It really turns out that this is not how the fsync() semantics work

> though, exactly because the nature of the errors: even if the kernel

> retained the dirty bits on the failed pages, retrying persisting them

> on the same disk location would simply fail. *might* simply fail. It depends on why the error ocurred. I originally identified this behaviour on a multipath system. Multipath

defaults to "throw the writes away, nobody really cares anyway" on error.

It seems to figure a higher level will retry, or the application will

receive the error and retry. (See no_path_retry in multipath config. AFAICS the default is insanely

dangerous and only suitable for specialist apps that understand the quirks;

you should use no_path_retry=queue). Instead the kernel opts

> for marking those pages clean (since there is no other recovery

> strategy), and reporting once to the caller who can potentially deal

> with it in some manner. It is sadly a bad and undocumented convention.

> It could mark the FD. It's not just undocumented, it's a slightly creative interpretation of the

POSIX spec for fsync. > > Consider two independent programs where the first one writes to a file

> > and then calls the second one whose job it is to go out and fsync(),

> > perhaps async from the first, those files. Is the second program

> > supposed to go write to each page that the first one wrote to, in order

> > to ensure that all the dirty bits are set so that the fsync() will

> > actually return if all the dirty pages are written?

>

> I think what you have in mind are the semantics of sync() rather

> than fsync(), but as long as an application needs to ensure data

> are persisted to storage, it needs to retain those data in its heap

> until fsync() is successful instead of discarding them and relying

> on the kernel after write(). This is almost exactly what we tell application authors using PostgreSQL:

the data isn't written until you receive a successful commit confirmation,

so you'd better not forget it. We provide applications with *clear boundaries* so they can know *exactly*

what was, and was not, written. I guess the argument from the kernel is the

same is true: whatever was written since the last *successful* fsync is

potentially lost and must be redone. But the fsync behaviour is utterly undocumented and dubiously standard. > I think we are conflating a few issues here: having the OS kernel being

> responsible for error recovery (so that subsequent fsync() would fix

> the problems) is one. This clearly is a design which most kernels have

> not really adopted for reasons outlined above [citation needed] What do other major platforms do here? The post above suggests it's a bit

of a mix of behaviours. Now, there is the issue of granularity of

> error reporting: userspace could benefit from a fine-grained indication

> of failed pages (or file ranges). Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map

a sync failure back to an individual AIO write. But it seems AIO just adds more problems and fixes none. Flush behaviour

with AIO from what I can tell is inconsistent version to version and

generally unhelpful. The kernel should really report such sync failures

back to the app on its AIO write mapping, but it seems nothing of the sort

happens. --

Craig Ringer http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services





> On Apr 2, 2018, at 16:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:

>

> They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it. Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is. --

-- Christophe Pettus

xof(at)thebuild(dot)com

On April 2, 2018 5:03:39 PM PDT, Christophe Pettus <xof(at)thebuild(dot)com> wrote:

>

>> On Apr 2, 2018, at 16:27, Craig Ringer <craig(at)2ndQuadrant(dot)com> wrote:

>>

>> They're undocumented and extremely surprising semantics that are

>arguably a violation of the POSIX spec for fsync(), or at least a

>surprising interpretation of it.

>

>Even accepting that (I personally go with surprising over violation, as

>if my vote counted), it is highly unlikely that we will convince every

>kernel team to declare "What fools we've been!" and push a change...

>and even if they did, PostgreSQL can look forward to many years of

>running on kernels with the broken semantics. Given that, I think the

>PANIC option is the soundest one, as unappetizing as it is. Don't we pretty much already have agreement in that? And Craig is the main proponent of it? Andres --

Sent from my Android device with K-9 Mail. Please excuse my brevity.



> On Apr 2, 2018, at 17:05, Andres Freund <andres(at)anarazel(dot)de> wrote:

>

> Don't we pretty much already have agreement in that? And Craig is the main proponent of it? For sure on the second sentence; the first was not clear to me. --

-- Christophe Pettus

xof(at)thebuild(dot)com

From: Peter Geoghegan <pg(at)bowt(dot)ie> To: Andres Freund <andres(at)anarazel(dot)de> Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Christophe Pettus <xof(at)thebuild(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 00:48:00 Message-ID: CAH2-WznVL4NVmaw52+mSLz0gFTcatuKKjRq4VPyfCCqj1iP1FQ@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

>>Even accepting that (I personally go with surprising over violation, as

>>if my vote counted), it is highly unlikely that we will convince every

>>kernel team to declare "What fools we've been!" and push a change...

>>and even if they did, PostgreSQL can look forward to many years of

>>running on kernels with the broken semantics. Given that, I think the

>>PANIC option is the soundest one, as unappetizing as it is.

>

> Don't we pretty much already have agreement in that? And Craig is the main proponent of it? I wonder how bad it will be in practice if we PANIC. Craig said "This

isn't as bad as it seems because AFAICS fsync only returns EIO in

cases where we should be stopping the world anyway, and many FSes will

do that for us". It would be nice to get more information on that. --

Peter Geoghegan

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> To: Craig Ringer <craig(at)2ndquadrant(dot)com> Cc: Catalin Iacob <iacobcatalin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 01:29:28 Message-ID: CAEepm=2KSqu-fj8gEbLSE=uNcWWWpZ4bcjFtqYTGSCp0Lr_cSw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> I see little benefit to not just PANICing unconditionally on EIO, really. It

> shouldn't happen, and if it does, we want to be pretty conservative and

> adopt a data-protective approach.

>

> I'm rather more worried by doing it on ENOSPC. Which looks like it might be

> necessary from what I recall finding in my test case + kernel code reading.

> I really don't want to respond to a possibly-transient ENOSPC by PANICing

> the whole server unnecessarily. Yeah, it'd be nice to give an administrator the chance to free up some

disk space after ENOSPC is reported, and stay up. Running out of

space really shouldn't take down the database without warning! The

question is whether the data remains in cache and marked dirty, so

that retrying is a safe option (since it's potentially gone from our

own buffers, so if the OS doesn't have it the only place your

committed data can definitely still be found is the WAL... recovery

time). Who can tell us? Do we need a per-filesystem answer? Delayed

allocation is a somewhat filesystem-specific thing, so maybe.

Interestingly, there don't seem to be many operating systems that can

report ENOSPC from fsync(), based on a quick scan through some

documentation: POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no

Illumos/Solaris, Linux, macOS: yes I don't know if macOS really means it or not; it just tells you to see

the errors for read(2) and write(2). By the way, speaking of macOS, I

was curious to see if the common BSD heritage would show here. Yeah,

somewhat. It doesn't appear to keep buffers on writeback error, if

this is the right code[1] (though it could be handling it somewhere

else for all I know). [1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695 --

Thomas Munro

http://www.enterprisedb.com

From: Robert Haas <robertmhaas(at)gmail(dot)com> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 02:54:26 Message-ID: CA+TgmoYQuM-8F0mweBFVvhjFa04DOaxamd8_g5wfJAuOgwwXuw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:

> Given precisely that the dirty pages which cannot been written-out are

> practically thrown away, the semantics of fsync() (after the 4.13 fixes)

> are essentially correct: the first call indicates that a writeback error

> indeed occurred, while subsequent calls have no reason to indicate an error

> (assuming no other errors occurred in the meantime). Like other people here, I think this is 100% unreasonable, starting

with "the dirty pages which cannot been written out are practically

thrown away". Who decided that was OK, and on the basis of what

wording in what specification? I think it's always unreasonable to

throw away the user's data. If the writes are going to fail, then let

them keep on failing every time. *That* wouldn't cause any data loss,

because we'd never be able to checkpoint, and eventually the user

would have to kill the server uncleanly, and that would trigger

recovery. Also, this really does make it impossible to write reliable programs.

Imagine that, while the server is running, somebody runs a program

which opens a file in the data directory, calls fsync() on it, and

closes it. If the fsync() fails, postgres is now borked and has no

way of being aware of the problem. If we knew, we could PANIC, but

we'll never find out, because the unrelated process ate the error.

This is exactly the sort of ill-considered behavior that makes fcntl()

locking nearly useless. Even leaving that aside, a PANIC means a prolonged outage on a

prolonged system - it could easily take tens of minutes or longer to

run recovery. So saying "oh, just do that" is not really an answer.

Sure, we can do it, but it's like trying to lose weight by

intentionally eating a tapeworm. Now, it's possible to shorten the

checkpoint_timeout so that recovery runs faster, but then performance

drops because data has to be fsync()'d more often instead of getting

buffered in the OS cache for the maximum possible time. We could also

dodge this issue in another way: suppose that when we write a page

out, we don't consider it really written until fsync() succeeds. Then

we wouldn't need to PANIC if an fsync() fails; we could just re-write

the page. Unfortunately, this would also be terrible for performance,

for pretty much the same reasons: letting the OS cache absorb lots of

dirty blocks and do write-combining is *necessary* for good

performance. > The error reporting is thus consistent with the intended semantics (which

> are sadly not properly documented). Repeated calls to fsync() simply do not

> imply that the kernel will retry to writeback the previously-failed pages,

> so the application needs to be aware of that. Persisting the error at the

> fsync() level would essentially mean moving application policy into the

> kernel. I might accept this argument if I accepted that it was OK to decide

that an fsync() failure means you can forget that the write() ever

happened in the first place, but it's hard to imagine an application

that wants that behavior. If the application didn't care about

whether the bytes really got to disk or not, it would not have called

fsync() in the first place. If it does care, reporting the error only

once is never an improvement. --

Robert Haas

EnterpriseDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company

From: Peter Geoghegan <pg(at)bowt(dot)ie> To: Robert Haas <robertmhaas(at)gmail(dot)com> Cc: Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 03:45:30 Message-ID: CAH2-Wz=y_K7t2KtGAm1o31okhdYb5+Jt=hV4V0F2nxSgJysAAw@mail.gmail.com Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Also, this really does make it impossible to write reliable programs.

> Imagine that, while the server is running, somebody runs a program

> which opens a file in the data directory, calls fsync() on it, and

> closes it. If the fsync() fails, postgres is now borked and has no

> way of being aware of the problem. If we knew, we could PANIC, but

> we'll never find out, because the unrelated process ate the error.

> This is exactly the sort of ill-considered behavior that makes fcntl()

> locking nearly useless. I fear that the conventional wisdom from the Kernel people is now "you

should be using O_DIRECT for granular control". The LWN article

Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o: "Monakhov asked why a counter was needed; Layton said it was to handle

multiple overlapping writebacks. Effectively, the counter would record

whether a writeback had failed since the file was opened or since the

last fsync(). Ts'o said that should be fine; applications that want

more information should use O_DIRECT. For most applications, knowledge

that an error occurred somewhere in the file is all that is necessary;

applications that require better granularity already use O_DIRECT." --

Peter Geoghegan

From: Anthony Iliopoulos <ailiop(at)altatus(dot)com> To: Robert Haas <robertmhaas(at)gmail(dot)com> Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, ailiop(at)altatus(dot)com Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 10:35:39 Message-ID: 20180403103538.GP11627@technoir Views: Raw Message | Whole Thread | Download mbox | Resend email Lists: pgsql-hackers

Hi Robert, On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:

> On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:

> > Given precisely that the dirty pages which cannot been written-out are

> > practically thrown away, the semantics of fsync() (after the 4.13 fixes)

> > are essentially correct: the first call indicates that a writeback error

> > indeed occurred, while subsequent calls have no reason to indicate an error

> > (assuming no other errors occurred in the meantime).

>

> Like other people here, I think this is 100% unreasonable, starting

> with "the dirty pages which cannot been written out are practically

> thrown away". Who decided that was OK, and on the basis of what

> wording in what specification? I think it's always unreasonable to If you insist on strict conformance to POSIX, indeed the linux

glibc configuration and associated manpage are probably wrong in

stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation

matches that of the flexibility allowed by not supporting SIO.

There's a long history of brokenness between linux and posix,

and I think there was never an intention of conforming to the

standard. > throw away the user's data. If the writes are going to fail, then let

> them keep on failing every time. *That* wouldn't cause any data loss,

> because we'd never be able to checkpoint, and eventually the user

> would have to kill the server uncleanly, and that would trigger

> recovery. I believe (as tried to explain earlier) there is a certain assumption

being made that the writer and original owner of data is responsible

for dealing with potential errors in order to avoid data loss (which

should be only of interest to the original writer anyway). It would

be very questionable for the interface to persist the error while

subsequent writes and fsyncs to different offsets may as well go through.

Another process may need to write into the file and fsync, while being

unaware of those newly introduced semantics is now faced with EIO

because some unrelated previous process failed some earlier writes

and did not bother to clear the error for those writes. In a similar

scenario where the second process is aware of the new semantics, it would

naturally go ahead and clear the global error in order to proceed

with its own write()+fsync(), which would essentially amount to the

same problematic semantics you have now. > Also, this really does make it impossible to write reliable programs.

> Imagine that, while the server is running, somebody runs a program

> which opens a file in the data directory, calls fsync() on it, and

> closes it. If the fsync() fails, postgres is now borked and has no

> way of being aware of the problem. If we knew, we could PANIC, but

> we'll never find out, because the unrelated process ate the error.

> This is exactly the sort of ill-considered behavior that makes fcntl()

> locking nearly useless. Fully agree, and the errseq_t fixes have dealt exactly with the issue

of making sure that the error is reported to all file descriptors that

*happen to be open at the time of error*. But I think one would have a

hard time defending a modification to the kernel where this is further

extended to cover cases where: process A does write() on some file offset which fails writeback,

fsync() gets EIO and exit()s. process B does write() on some other offset which succeeds writeback,

but fsync() gets EIO due to (uncleared) failures of earlier process. This would be a highly user-visible change of semantics from edge-

triggered to level-triggered behavior. > dodge this issue in another way: suppose that when we write a page

> out, we don't consider it really written until fsync() succeeds. Then That's the only way to think about fsync() guarantees unless you

are on a kernel that keeps retrying to persist dirty pages. Assuming

such a model, after repeated and unrecoverable hard failures the

process would have to explicitly inform the kernel to drop the dirty

pages. All the process could do at that point is read back to userspace

the dirty/failed pages and attempt to rewrite them at a different place

(which is current possible too). Most applications would not bother

though to inform the kernel and drop the permanently failed pages;

and thus someone eventually would hit the case that a large amount

of failed writeback pages are running his server out of memory,

at which point people will complain that those semantics are completely

unreasonable. > we wouldn't need to PANIC if an fsync() fails; we could just re-write

> the page. Unfortunately, this would also be terrible for performance,

> for pretty much the same reasons: letting the OS cache absorb lots of

> dirty blocks and do write-combining is *necessary* for good

> performance. Not sure I understand this case. The application may indeed re-write

a bunch of pages that have failed and proceed with fsync(). The kernel

will deal with combining the writeback of all the re-written pages. But

further the necessity of combining for performance really depends on

the exact storage medium. At the point you start caring about

write-combining, the kernel community will naturally redirect you to

use DIRECT_IO. > > The error reporting is thus consistent with the intended semantics (which

> > are sadly not properly documented). Repeated calls to fsync() simply do not

> > imply that the kernel will retry to writeback the previously-failed pages,

> > so the application needs to be aware of that. Persisting the error at the

> > fsync() level would essentially mean moving application policy into the

> > kernel.

>

> I might accept this argument if I accepted that it was OK to decide

> that an fsync() failure means you can forget that the write() ever

> happened in the first place, but it's hard to imagine an application

> that wants that behavior. If the application didn't care about

> whether the bytes really got to disk or not, it would not have called

> fsync() in the first place. If it does care, reporting the error only

> once is never an improvement. Again, conflating two separate issues, that of buffering and retrying

failed pages and that of error reporting. Yes it would be convenient

for applications not to have to care at all about recovery of failed

write-backs, but at some point they would have to face this issue one

way or another (I am assuming we are always talking about hard failures,

other kinds of failures are probably already being dealt with transparently

at the kernel level). As for the reporting, it is also unreasonable to effectively signal

and persist an error on a file-wide granularity while it pertains

to subsets of that file and other writes can go through, but I am

repeating myself. I suppose that if the check-and-clear semantics are problematic for

Pg, one could suggest a kernel patch that opts-in to a level-triggered

reporting of fsync() on a per-descriptor basis, which seems to be

non-intrusive and probably sufficient to cover your expected use-case. Best regards,

Anthony

From: Greg Stark <stark(at)mit(dot)edu> To: Anthony Iliopoulos <ailiop(at)altatus(dot)com> Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-04-03 11:26:05 Message-ID: CAM-