On Saturday (July 20) we moved over a terabyte of data from one storage system to another. We made the move because the amount of data we have to store simply did not fit on our servers, and our preliminary tests showed that the new system will only be using about ⅓ of the disk space.

Migration went through fine, however we started seeing a higher I/O load after we finished, and suddenly one SSD drive in one of our database servers stopped working. Not a big deal, we thought — we obviously store data in mirrored mode on several servers — so we asked our hosting provider to swap the drive. We had to re-sync the data to the new disk, so you might have noticed the site being flacky.

However while data was syncing, another SSD drive went down in a different server. We got it replaced, and started syncing two database shards at the same time. At this point we lost 2 more drives, one of which unfortunately was on the server in one of the shards that were repairing.

While you can fly a plane with an engine off, unfortunately when all your engines stop your flight is over. Right now we have to restore our last pre-migration full database backup, and apply incremental updates to bring database to a fresh and (hopefully) consistent state.

Here comes the worst news - this will probably take a day or two.

Sorry about that.

This is a tough and incredibly stressful situation, but it looks like we have no other choice. We understand your frustration (actually, we are in the same boat: we are RSS junkies and built The Old Reader for ourselves and friends) and we are doing everything to make it as fast and painless as possible and live happily ever after.

After that, we will deploy bug fixes along with new things and improvements we have already developed. During last year we adapted and successfully expanded first from 2000 to 5000 users practically overnight, then from 10 000 to 160 000 in several weeks and from 200 000 to 400 000 in four months, so we are considering this as a new level-up for the project (although bumpy and painful one).

It’s 5 AM right now and backup restoration has already begun. We are monitoring and working on The Old Reader nearly 24 hours a day. We will keep frequent (but not annoying updates) in Twitter and will answer all your questions.

We deeply apologize for what has happened but we intend to come back in a much better way.

Thank you very much for your patience, support and understanding,

The Old Reader team.

UPD:

July 25, 21:07 UTC

Back online! We hope this outage lasting July 25 19:12 UTC

If everything goes as planned, we should be back in 4-5 hours.

July 25 15:50 UTC

Import — check;

Indexes — check;

Balancing data between shards and configuring replicasets — in progress.

July 25 08:45 UTC

It looks like we have managed to upload the data. If indexes get generated correctly, we might be back online later today.

July 25 2:00 UTC

Continuing the upload, hoping it goes as planned, counting hours.

July 24 14:00 UTC

Proceeding with restore. More details hopefully in the evening.

July 23 18:00 UTC

We have managed to create a consistent dump of our database and started uploading it to the database servers.