TL;DR

We were already prepared for a scenario like the Aberdeen Cloud breakdown, owing to our disaster recovery plan. Fortunately we didn't have to set it in motion. Each night we have a simple script which takes off-site backups of all of our hosted sites. We've made the source code available on github, so hopefully this will help others prepare for the likes of the Aberdeen Cloud implosion, and perhaps we can share ideas on how to improve each other's disaster recovery plans.

Our experience with the Aberdeen Cloud incident

We have a large number of sites hosted by various cloud services. Since autumn 2013, we'd mainly used Aberdeen Cloud, and in autumn 2015 we started to explore other options to see what else the market had to offer. Platform.sh was the one that we decided to give a serious test for new clients.

Soon after that, Aberdeen Cloud began to seem a bit flaky. Longer response times on support tickets. Solr services started to fail, along with random outages of various other services. We accepted that for a while, but after having lost and regained SSH access to all of our sites (including git and rsync), we eventually decided that enough was enough and we couldn't put our trust in them anymore.

We had to migrate everything to alternative hosting platforms. Given the similar price points and the fact that they seem well funded and offer excellent support, we decided on Platform.sh. And so began Project Exodus.

Over the next three weeks we migrated over 20 sites to Platform.sh in a staged approach. I wouldn't say that it was a straight-forward process. Lots of clients had specific quirks to their setup. For example, some needed a PHPBB forum, others had FTP access for uploading files, some integrated with external systems that required firewall changes, etc, while others had custom .htaccess redirection rules that needed to be rewritten for Nginx. However, we were very lucky and had completed Project Exodus nearly a month before Aberdeen Cloud finally came tumbling down.

So what if you were not as fortunate as us, and still have sites whose assets are no longer accessible? Stuck in cyberspace, or maybe just plain deleted?

Well, I'm not sure there is a lot you can do. Maybe read Code Engima's blog post about the different things they've tried. However, to put it bluntly, it sucks! Enough said about the matter.

Now it's time to ensure you're never caught out again.

But what can we do to prevent this from happening again?

All companies probably have their own way to deal with this kind of scenario, but I'll tell you about how Annertech deals with backups and recoveries.

We have two sets of backups. A backup from each day, for each production/live/master environment, which is hosted by the cloud service. Then there is an off-site backup (again, daily) used for disaster recovery. The latter one is the important one in this scenario.

The idea is that even if Amazon (which hosts Aberdeen Cloud services) pulled the plug, we would still have access to our clients' data.

We have a server, hosted by a different company, that:

Pulls down a copy of the database, from every cloud-hosted site, every night, and saves it for four days, before it deletes it again. Runs `rsync` on every cloud-hosted site, to get an up-to-date version of the files folder, every night.

Sounds simple? It is. All it requires is that your hosting partner supports running Drush on your remote sites and you're good. If you run Drupal sites, and your hosting partner doesn't support running Drush on the remote sites, find somebody else who does. It's that important!

Regarding the code for the sites; we keep our source code repositories on dedicated git services. And, more than likely, we'll also have a copy or two on developers machines.

I'd like to show you the two backup scripts that I made, one for Aberdeen Cloud and one for Platform.sh.

The code is meant to work in our setup only, and is not (yet) generic enough to just work out of the box elsewhere. The release of the scripts is meant to give you a leg up and some inspiration. This is by no means the final end point for these scripts - we are continually evaluating and improving our system, and I look forward to hearing what ideas you have on where we could take it from here too.

The entire repository of code can be "stolen" from github.

When you have a disaster recovery plan you also need to make sure that it actually works. You can do this by downloading the latest backup from each of your sites once a month, installing and then testing them. I've tested a site, where the DB file was corrupt, but only for that site, so make sure that you test all of them. The setup of test sites can also be automated by a script so you don't have to setup 10, 50, or 300 sites and test each manually. Scripts are your friends. Make good use of them and have them do all the hard work.

Now, if you really want to push this further, you should implement a "Smoke test" in all of your installation profiles, so that you can trigger that to see if the site is alive; or perhaps tie it in with a Jenkins server.

If something is unclear, feel free to put a comment below. If you feel like this could be improved, feel free to contribute with a pull request. We are all ears.