I wanted to write this long article relating the problems I had installing, setting and using a tool that has potential, but you must be careful. And to show how not always what is sold really is what is delivered.

I had the task some time ago of helping the team I worked in choosing a new storage system to replace the very aged EMC/Dell CX4-120 we had for many years and that was losing disks at the same rate I was losing my hair.

After talking with many resellers and looking into possible options, we decided to buy a new pure EMC VNX 5200, due to many reasons. Company preferences, integration with the Dell blade servers we’ve bought little before, new backup system (we’ve got a new DataDomain as well) and the big promise. The RecoverPoint.

We performed a DRP test for the company every year, taking all the major applications to the secondary site and putting everything to run there. So we could keep the company running in case of an emergency (although if the emergency was something like what happened to the twin towers and we were there, very little could be done, but, anyway…). And the way we used to do it is cloning the machines between the data centers manually, and during the nights, because network bandwidth was a huge concern at the time. We had only 10 mbps at the secondary site, so we could not copy VMs between sites during the day or we would keep the remote users from using all the systems connected to the network.

When Recoverpoint was sold the promises were huge. On-the-fly replication, data security, bandwidth control, Recovery Point Objective (RPO) - zero or seconds, WAN compression and de-duplication. What we saw in the end was very different than that, unfortunately.

The promise sold…

Initial configurations

First of all, we connected the VNX to the Dell blade enclosure using fibre channel, using the new switches acquired with the enclosure. The performance was good and we were able to run the whole environment with a better response time and bigger performance. The VMs rebooted in seconds and the flash cache really showed some improvements for overall performance. Time to take the next step.

We connected the storage to the blade using iSCSI, which was required to install Recoverpoint and make the vRPAs talk to the storage. According to the documentation, the copy actions would all be performed through iSCSI. I found this interesting because this way would not cause any kind of interference in the fibre channel ports.

After the physical connections were done, it was time to execute the first installation. Downloaded the vRPA appliances from EMC site, I started deploying the OVAs. At this time, all very simple. Deploying an OVA actually is an easy task, right? No mystery here. You choose the file, perform an initial configuration for the NICs and you’re done.

Possible sizes of vRPAs you may have.

After the VMs are deployed and up, you must use a tool to configure the environment. RecoverPoint Deployment Manager (DM). And here’s where you may start having trouble.

The version 2.1 was very unstable. Many times the tool just hanged in the middle of the process (which takes a while, be patient), sometimes it just couldn’t connect to the vRPAs (and it was connected to the host directly!), the vRPAs could not talk to the storage (physically connected to it) and some other minor issues. When the newer 2.2 version came out, as well as the vRPAs above 4.1, things started to become a little better.

Another thing I could not understand. You had to go through the tool at least 5 times:

You set up the vRPAs (one time for each site where you will install it, at least twice, then);

You set up the clusters (one time for each site where you will install it, at least twice, again);

You connect the clusters.

Every time you perform one of these tasks, you would have to finish it, re-open it and begin to the next step.

First configure them…

During the setup you configure many important things like addresses and connections, but what is not written anywhere is: You have to do a lot of things manually, not only on the wizard!

For example, there’s a part where you set the IPs for the NICs on the vRPAs. Like this:

But if you don’t set the gateways and routes on the tool (and sometimes it didn’t stick…), the clusters do not talk to each other. Even being redundant, since the Linux kernel on the vRPAs know how to set up the routes and gateways, you still have to connect directly to them later and set it up, or it won’t work.

The black screen. Where I say things get done!

The connections for iSCSI are easier to be done. But be sure to use completely different ranges and short masks.

iSCSI configurations

If everything goes fine, you may get these screens after all:

Be grateful at this time. But don’t be relieved yet. It’s not finished.

OK. If you could see this, you’re finally good to go. See how many steps? Take a deep breath.

Now you must reopen the tool and connect the clusters.

If everything goes right with your routes and connections, you will get these screens:

OK. You could connect your clusters and everything looks fine. Let’s take a look at what happens now.

Storage configurations

You have to set a few things on your storage before you can start:

The RecoverPoint will only replicate an entire Datastore from VMware. You cannot select specific machines to be replicated. So you’ll have to divide the machines into volumes to be replicated to minimize the performance impact (will talk about it later).

You must create volumes for Journals to be used by every Consistency Group.

The infrastructure is pictured like this:

What happens next?

Time to configure a datastore to be replicated. What happens to the machines?

First of all, be prepared to have a strong link between the sites to be able to hold the transfers. If you can, get a dedicated link only for this traffic. Not kidding. What you have it will eat. And if your machines have a large amount of updates (like database servers), it will be even worse.

The VNX5200 had a major performance impact when the replica was enabled. Even with its flash cache and everything. Here is the performance for some operations on replicated or non-replicated datastores:

Copying 1.5 GB file to a server on a non-replicated volume - ~1700 MB/s

Copying 1.5 GB file to a server on a replicated volumes (asynchronous) - ~350 MB/s

Copying 1.5 GB file to a server on a replicated volumes (synchronous) - ~103 MB/s

Migrating a VM to a non-replicated volume (~60 GB) - 8m35s

Migrating a VM to a replicated volume (~60 GB) - 123m31s

Can you imagine this performance drop on a heavy database server?

Looking at this data you may think “Wait. That makes no sense! If it took 123 minutes to migrate 60 GB, it’s around 8 MB/s!”. But that’s right. When migrating a live machine to a replicated volume, that’s the performance you get for creating the snapshot, moving the machine, updating and removing the snapshot. All that under surveillance from the vRPA. You’ll have to deal with that.

What else can happen?

Upgrades!

First time it was installed, the vRPAs were on version 4.1. The EMC released an upgrade, which was close to an entire redeploy, due to the size of the files. It practically was. How does the process work?

Download the .ISO file from EMC

Put it on an FTP server on your own network (better than downloading it for every vRPA you’re upgrading)

Log into the vRPAs and select the Upgrade option. Is there somewhere. Point to your FTP server and let it go.

Just don’t do it all at the same time!!! Yes. I did it. What happens after the vRPAs get the new code? They reboot. And lose connection! You must log into it again and reconfigure the interfaces. But someone must be there alive to welcome the new-brained vRPA! Or else you’ll lose your entire environment, like I did.

The only way I could see what happened was when the machines restarted after the upgrade, they’ve lost access to something like the “quorum volume” on a Microsoft cluster (bad example, I know). It's the Repository. The machines could not be reconnected together and I lost my entire environment. And a whole weekend reinstalling and reconfiguring it.

Synchronous vs. Asynchronous modes

Another very bad experience was when we tried to use Synchronous mode. Synchronous replications mean that the changes are only commited to the storage volume after the replicator got an “OK” from the other side. If you have a huge link and an entire flash storage, it must be beautiful. At first it looked like a good thing. The volumes were synchronizing, the full resyncs have decreased (happens a lot on asynchronous replications), everything looked fine. Then what? The ENTIRE VNX STORAGE HANGED! Unbelievable. In the middle of a regular day, the storage simply stopped responding. I started checking the vSphere and the datastores started to gray out, one by one. All the VMs were crashing. It was mayhem!

The only way we could reestablish access to everything was pulling the plug. On everything. We had to shut down the storage, the hypervisors, the blade enclosure. Neither EMC nor VMware could help us finding out what had happened.

After a thorough investigation of the wreckage we could find what we understood as the queue. When you use synchronous replication the other side must confirm it got the changes so it can be commited locally. The storage had a queue so big it got totally lost. And it didn’t even present an alert before or after the whole thing happened.

Flash interface

Yes. It’s that bad. The Java console for administering the storages is not something I like very much already. But the whole RecoverPoint console runs on Flash! Adobe Flash! The filled-with-failures-updated-every-22-hour Flash. Come on, EMC… you can do better than that.

So what’s the conclusion?

Be very careful what you’ll put under the RecoverPoint vRPAs. The performance drop is really significant and the bandwidth consumption may cause you a lot of trouble.

Only use synchronous mode if you have a very large bandwidth to hold the traffic. Our case was the only known to drop the whole storage (and they still don’t agree with that). Could be something else, I know. But the storage had never failed before enabling RecoverPoint and never failed again after it was shut down. Yes, we had to shut it down by the time we realized we did not have enough infrastructure and was not inclined withheld the performance issues.

I wanted a pair of physical RPAs to perform a test between my locations and see the results, since the DataDomain, physical appliance, had a spectacular performance on deduplicating, compacting and sending data over the same link. But the project was discontinued before we could do that.

The product was very interesting at first. But seeing how it behaved “in the field” disappointed us with what we’ve got. Maybe more time was needed to perform more tests, certainly it was not made to run on links with less than 100 Mbps. But it’s kind of hard to convince your bosses you can’t use the new and expensive equipment yet, because you still want to do some other tests. They see it as “playing around”. It is not usually healthy for your carreer!

Thanks for reading it! And thanks Mariusz on www.settlersoman.com for letting me use some of your pictures!

Joao Felipe Moradei