How I averted a disaster and recovered data by deleting it.

As some may know, I'm running a Mastodon instance. It's a quite big instance with about 1000 weekly active users and 13k local accounts in total.

My current setup is a Hetzner Cloud CX41 VM running LXC containers with LXD. They are stored on a dedicated zpool on a separate partition of the disk. In this case, I have a dedicated container for the PostgreSQL server.

I was quite surprised when I looked at the state of my zpool last week:

[email protected] ~# zpool status -v pool: zpool-lxd state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0h56m with 7 errors on Sun Mar 3 18:29:34 2019 config: NAME STATE READ WRITE CKSUM zpool-lxd ONLINE 0 0 0 sda3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0x1f40>:<0x6f840> <0x1f40>:<0x1858a> <0x1956>:<0x18e13> <0x1956>:<0x18bcc> <0xc63>:<0x3604c> <0x8bbf>:<0x952e> zpool-lxd/containers/postgresql:/rootfs/var/lib/postgresql/11/main/base/20415/20819.9

The hexadecimal stuff are deleted auto-snapshots of some datasets, but what is concerning is the PostgreSQL file. I tried multiple ZFS scrubs without success.

I checked the PostgreSQL logs and there was indeed a few failing queries:

[email protected] ~# grep 'Input/output error' /var/log/postgresql/postgresql-11-main.log | tail 2019-03-06 20:54:19.224 CET [11371] ERROR: could not read block 1264349 in file "base/20415/20819.9": Input/output error 2019-03-06 20:54:19.225 CET [11284] [email protected] ERROR: could not read block 1264349 in file "base/20415/20819.9": Input/output error 2019-03-06 22:05:04.540 CET [11533] [email protected] ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error 2019-03-07 00:02:48.709 CET [12615] [email protected] ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error 2019-03-07 02:25:30.946 CET [13670] [email protected] ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error 2019-03-07 04:07:40.608 CET [14447] [email protected] ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error 2019-03-07 06:09:58.351 CET [15667] [email protected] ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error 2019-03-07 06:56:37.034 CET [16238] [email protected] ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error 2019-03-07 06:56:57.828 CET [16505] ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error 2019-03-07 06:56:57.830 CET [15870] [email protected] ERROR: could not read block 1264341 in file "base/20415/20819.9": Input/output error

Now, the PSQL server is handling about ~200 queries per second, so there are actually very few queries failing.

I could also confirm there was some errors in the application's logs:

Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] method=GET path=/api/v1/timelines/tag/fun format=html controller=Api::V1::Timelines::TagController action=show status=500 error='ActiveRecord::StatementInvalid: PG::IoError: ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] ActiveRecord::StatementInvalid (PG::IoError: ERROR: could not read block 1264346 in file "base/20415/20819.9": Input/output error Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/application_controller.rb:109:in `cache_collection' Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:25:in `cached_tagged_statuses' Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:21:in `load_statuses' Mar 06 22:05:04 mstdn bundle[4703]: [7160cec0-24f8-4930-b249-f85f755f900b] app/controllers/api/v1/timelines/tag_controller.rb:10:in `show' ... 2019-03-04 06:08:21.734 CET [19320] [email protected] ERROR: could not read block 843725 in file "base/20415/20819.6": Input/output error 2019-03-04 06:08:21.734 CET [19320] [email protected] CONTEXT: parallel worker 2019-03-04 06:08:21.734 CET [19320] [email protected] STATEMENT: SELECT DISTINCT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = 0 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = 210 AND "statuses"."account_id" NOT IN (245693, 20285) AND "accounts"."silenced" = FALSE ORDER BY "statuses"."id" DESC LIMIT 4 ... 2019-03-04 04:20:04.149 CET [18597] [email protected] ERROR: could not read block 843720 in file "base/20415/20819.6": Input/output error 2019-03-04 04:20:04.149 CET [18597] [email protected] STATEMENT: SELECT DISTINCT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = 0 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = 2721 AND "accounts"."silenced" = FALSE ORDER BY "statuses"."id" DESC LIMIT 20

This is still concerning because the corrupted file is about 1 GB.

The most important issue was that all my dumps for the past few days have been failing:

[email protected]:~$ pg_dump -Fc mstdn -f mstdn.dump pg_dump: Dumping the contents of table "statuses" failed: PQgetResult() failed. pg_dump: Error message from server: ERROR: could not read block 1264336 in file "base/20415/20819.9": Input/output error pg_dump: The command was: COPY public.statuses (id, uri, text, created_at, updated_at, in_reply_to_id, reblog_of_id, url, sensitive, visibility, spoiler_text, reply, language, conversation_id, local, account_id, application_id, in_reply_to_account_id) TO stdout;

The public.statuses is by far the biggest table in the database since it contains all of the local and foreign toots (43M rows) so I can't skip it.

It turned out I couldn't do anything that required reading this file.

I couldn't:

dump the database

copy the database raw files (with either cp or dd )

or ) export the LXD container or a snapshot

migrate the LXD container or a snapshot to another host

export the containers' ZFS dataset

Each of these would fail when they encountered of the corrupted blocks of the file.

Also, it was out of question to restore a backup since it would mean I'd lost multiple days of data, which is way worse that a few failing requests a day. But I couldn't leave thing like this.

My zpool seems corrupted in a way or another, since it reports some corrupted files, then after a few hours, they're not corrupted anymore, and some other files seem to be corrupted. These symptoms indicate a failing disk, but I have virtual disk on top of the hypervisor's SSD RAID... I really don't know what's wrong.

Anyway, out of all the "corrupted" files, /var/lib/postgresql/11/main/base/20415/20819.9 was not getting better, so it meant it was really corrupted.

Lost for lost, I decided to do something that would really make me loose this data forever: remove the corrupted blocks from the file.

FYI, I launched a new VM from an older snaphost in order to test this before doing it on the production VM.

As shown above, dd would fail when it encountered the first corrupted block.

The solution is to use the noerror option, which means dd will ignore the errors and continue. Our new file will have the same content but will be empty where the corrupted blocks were.

The result is - somewhat - a success. I'm not sure what the default block size is, but dd failed 256 times around 662 MiB.

[email protected] ~# dd if=/var/lib/postgresql/11/main/base/20415/20819.9 of=20819.9.dd conv=sync,noerror dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error 1355008+0 records in 1355008+0 records out 693764096 bytes (694 MB, 662 MiB) copied, 14.0675 s, 49.3 MB/s dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error 1355008+1 records in 1355009+0 records out 693764608 bytes (694 MB, 662 MiB) copied, 14.0689 s, 49.3 MB/s dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error 1355008+2 records in 1355010+0 records out 693765120 bytes (694 MB, 662 MiB) copied, 14.0696 s, 49.3 MB/s ... dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error 1355008+254 records in 1355262+0 records out 693894144 bytes (694 MB, 662 MiB) copied, 14.3103 s, 48.5 MB/s dd: error reading '/var/lib/postgresql/11/main/base/20415/20819.9': Input/output error 1355008+255 records in 1355263+0 records out 693894656 bytes (694 MB, 662 MiB) copied, 14.3117 s, 48.5 MB/s 2096896+256 records in 2097152+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 22.1875 s, 48.4 MB/s

Some simple calculations. Errors start at 693764096 and ends at 693894656 which means 130560 B = 127.5 KiB of corrupted/lost data. That's 0,000003% of the database size, so I guess I should be able to live with it. It's also infuriating (but logical) that such a tiny amount of data can make such a mess.

Let's replace the file:

Of course I stopped the PostgreSQL server during the operation.

Since then, I have not had a single failing query related to I/O:

Success! I can now dump the database too.

As expected the 20819.9 file was not reported as corrupted by zpool status -v , but 20819.9.old was, so I deleted it since I didn't need it anymore.

I'm not an expert in file systems or databases, but I thought it was an experience worth sharing. I definitely lost a bit of non-critical data, but it was the best outcome. I had very little downtime and I didn't have to restore dumps which would have make me loss way more data. The same issue affecting the accounts table could have been disastrous.

I still have errors that come and go on my zpool. I am a bit concerned since I don't know the source of these errors. Is the underlying storage corrupted or unreliable? Did I corrupt something myself? Is it a bug with ZOL? LXD? I don't know.

This VM is not even a month old but I guess I have to start a new one from scratch and migrate my containers. Will use ZFS though? I'm not sure.