Distributed storage options for containers and VMs

When you move to running VMs or containers beyond a single host a number of things become important; networking, storage, orchestration and with it a bunch of other sticky problems like ha, load balancing, failover, sessions and persistence management. This would be a good time to circle back to Stateless and Swiss engineer post where we talk about scale out and scale up scenarios. We have already covered networking for containers in some depth and failover and load balancing. We have touched on storage with the article on using LXC containers with Gluster. This is particularly important in the context of this post and those looking for solutions for distributed storage for containers should read that first. We will cover orchestration in a series of upcoming posts with Salt, but let's take a quick overview of storage today and why you may need to use the above tools once you scale beyond a single host.

The need for shared storage

'Scale up' is when you have say a mail server and update the server's hardware to handle more users, but there is a technology limit to upgrading a single box, it's not scalable and is also a single point of failure. 'Scale' out is having multiple mail servers which is scalable and takes you away from a single point of failure but gives you new problems to address. When you scale beyond a single host you need to think about replication or shared storage so state is maintained across servers. Replication is when you replicate data across hosts, for instance if you are running a mail server or any app and you need to scale to 2 instances you will need to replicate data across the 2 mail servers. Or have the 2 servers share the storage so users see the same inbox. This is maintaining state. You will also need a way distribute load to the 2 servers which is what a load balancer does. Now take this to n servers and you have a bit of a problem on your hands. Replicating across n servers in real time is a challenge. A lot of databases operate in clustered mode, for instance Mysql has clustering built in, and Mysql instances can be interconnected across hosts, but when you scale there are other factors beyond the database, you have user data, session data, app data that needs to be kept in sync. Some apps are designed from ground up to be distributed and these problems are addressed inside the app, but most aren't. This is a persistent problem in scale out scenarios and there are a number of solutions.

SAN and NAS

For a lot of enterprises the solution is using shared storage solution, a SAN or NAS of some kind. SAN stands for storage area network and containing tons of dedicated storage in the unit that is offered over the network to storage consumers. A NAS is the same thing. The difference is a SAN exposes the storage as a block device over the iSCSI or fibre channel protocol and a NAS over a network file system like NFS. Now most people are used to storage operating at around 100MBs for local disks and 500MBs for SSDs. The idea of having it on the network may seem going backwards but with enterprise SANs connecting over 10Gbit ethernet networks or 40Gbit plus Infiniband networks, you can get those sorts of speeds over the network so you don't need to have local storage. Also typically these are deployed on networks dedicated just for storage use. So with network storage your app instances can store their data in a single place that can be accessed by all nodes. Scaling out becomes that much easier. SAN/NAS can be engineered to be highly available, redundant and with backups. From a management perspective it is much easier to manage storage in a central way rather than at the individual host level. With even 5 hosts, managing local storage in a predictable way can be a handful from a distributed, availability, redundancy and backup perspective. With centralized storage over the network the individual hosts don't matter that much and can be added and removed at will providing a layer of flexibility. All the data is in the SAN or NAS including the VMs, that are deployed on the hosts and launched over network storage. This is where storage companies like EMC, Netapp etc make their zillions which newer entrants like Scale.io, Simplivity, Nutanix eye greedily. VMware also plays here with their VSAN solution. In a lot of cases even this is not enough, for web scale companies like Google, Facebook operating tens of data centres with millions of users, SAN or NAS are bottlenecks, and they need to find more scale out distributed solutions. And this is also the approach solutions like Gluster take which is distributed file systems. Its important to remember at this point most workloads are deployed in VMs, so most hosts have tons of VMs in them managed by a hypervisor like ESX or a local OS. And with the fantastic flexibility and performance of containers we will see more deployed for the same use cases. VMs themselves are deployed in shared network accessible storage. This underpins capabilities like VM mobility and live migration across hosts. With shared storage for containers you can of course launch containers from multiple hosts but uninterrupted live migration needs a few additional bits like copying state, maintaining network, which the new project by the LXC team, LXD should shortly bring. This is why the underlying networking and storage layer and designs become important for scale out scenarios. Its gives you a lot of flexibility and redundancy.

More down to earth options

There are more down to earth options for distributed storage. Well if you are running your own infrastructure there are plenty of NAS solutions in the market to use and Gluster remains an excellent solution to explore. Of course having a dedicated network for storage is important for performance. There are vendors like Synology, Qnap, Thessus who provided reasonably priced NAS units. However 10Gbit performance is inaccessible to all but those with the deepest pockets. Infiniband and 10Gbit ethernet remain pricey, so you are limited to 1Gbit networks which is basically 100MBs, much slower than your average SSD. You can use bonding to mitigate this somewhat by bonding say 4 1Gbit nics to take that 100MBs to around 300MBs with the right equipment but you have to remember this is bandwidth shared by all your hosts even if you have a dedicated storage network. You can also build your own NAS unit with open source NAS oriented distros like FreeNAS, Nas4free, OMV, Nappit, Rockstor etc or explore options like DRBD with heartbeat for redundancy and failover. You of course don't even need a NAS based distros, simply using Btrfs inbuilt raid and other advanced capabilities with standard tools like DRBD, Keepalived, Heartbeat etc can give you a high level of availability, fail over and redundancy.

Distributed or shared storage in the cloud?

If you are on the cloud or renting dedicated servers or VMs from providers your options are limited unless you are renting a private network so rather than shared storage, replication becomes the only practical option, though some providers like Amazon etc do give you shared storage and other components to build distributed architectures. Options for replication include longstanding Linux tools like DRBD, Lsyncd and distributed file systems like Gluster and Ceph. Gluster is great for replicating storage across a large number of nodes with the performance improving for every extra node added to Gluster. What Gluster conceptually does is take local storage and replicate it across hosts and make it available over the network. It's a fantastic solution but the caveat is the performance for small files is not great, but it's improving rapidly.

Gluster

If you are running you own infrastructure Gluster now offers a new hugely improved libgfapi driver that is integrated with Qemu . This new feature offers enhanced performance letting you deploy your VMs in Gluster nodes. You couldn't do this over the Gluster Fuse interface due to the poor throughput. Ceph is also an option for distributed file systems and has recently been acquired by Redhat, who also own Gluster, but most benchmarks I have seen suggest for the time being Gluster is significantly faster than Ceph, though I would be glad to see benchmarks showing otherwise. If you have a number of VMs you can use Gluster to make data available across your network and perhaps even deploy your container on distributed layer however the performance is not going to be great until Gluster make the libgfapi driver accessible beyond Qemu to NFS. Gluster does seem to support the user land NFS implementation NFS Ganesha with libgfapi so this may be worth exploring and benchmarking. You can also use some iSCSI workarounds to use thelibgfapi driver. The performance benchmarks for libgfapi are excellent compared to fuse. We are not going to go into too much detail on using Gluster in this post as we already have a indepth guide on how to use Gluster with LXC containers which would apply to your VMs or host too. Let's also briefly look at using Lysncd and DRBD. Lsyncd is a great little tool that use inotify to detect changes in the file system in real time and triggers rsync to do a sync. It's near real time, overhead is low and performance is quite good so it may be useful for some use cases. You can even use DRBD over a loopback interface, but it's generally not recommended. DRBD is normally used in a network to connect 2 disks at the block level to provide some redundancy. We will cover it briefly to show how simple it is to set up a DRBD for redundancy.

Lsyncd

Lysncd is pretty simple 'one to many' replication, however it's not bi-directional. It's useful to sync from master to many slaves. There are some folks who have got it working bi directionally but the workaround do not look like they will scale beyond 2-3 nodes without extreme complexity and risk. Let's quickly test out Lsyncd. We are using an Ubuntu Trusty host for this exercise and this should work on Debian Wheezy too. For other distros the specifics may vary. We are presuming the master IP is 10.0.0.2 and slave 10.0.0.3 First let's install Lsyncd on the host you want to be the master apt-get install lynscd Let's create some config files for lsyncd mkdir /etc/lysncd/ touch /etc/lsyncd/lsyncd.conf.lua mkdir /var/log/lsyncd touch /var/log/lsyncd.{log,status} Now let's configure the master to sync the folder /var/www across slaves. Copy the config below to the /etc/lsyncd/lsyncd.conf.lua file changing the target IP of the slaves where the data should be synched nano /etc/lsyncd/lsyncd.conf.lua settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", rsync = { compress = true, acls = true, verbose = true, rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no" } } For multiple slaves you can repeat the 'sync' section of the config multiple times, changing the target IP for each slave. In our case we are using one slave as the target. Now the last step. Before this works we need to generate certificates so the master can log in to the slaves automatically. On the master ssh-keygen -t rsa The ssh-keygen utility will ask some questions, go with the defaults, do not create a passkey. The certificate will be created in the /root/.ssh folder. We are using root account. To run Lsyncd as an user, make sure the directories you are syncing are owned by the user running the lsyncd process, or else the process will not be able to write to the directories. Now create a .ssh directory in the root folder of the slaves if it doesn't exist and change it's permissions On the slaves mkdir /root/.ssh nano /root/.ssh/authorized_keys chmod 600 /root/.ssh chmod 400 /root/.ssh/authorized_keys Copy the /root/.ssh/id_rsa.pub file from the master to the /root/.ssh/authorized_keys file of all the slaves you intend to sync with. Now log into each of the slaves with ssh from the master. In my case the slave is on 10.0.0.3. ssh root@10.0.0.3 You are all set. Start the lsyncd process on the master service lsyncd start Now head to the /var/www folder on the master and test the replication. Let's start by creating some files on the master and see if it replicates to the slave. touch file{1..100} It should in a couple of seconds. Now let's try a larger file. dd if=/dev/zero of=testfile bs=1024k count=256 This too should replicate pretty quickly. As you can see you have near real time sync and you can extend it to multiple slaves and its not difficult to set up. The shortcoming with this set up is its one way master to slave, and writes to the slave will not be replicated to the master. One workaround is to run Lsyncd on both hosts and try to run a master-master configuration, the caveat is this will probably not scale well beyond 2-3 hosts and there is plenty of potential for a split brain situation where the masters are not in sync. You would need to install lsyncd on both hosts, set up the ssh-keygen on both sides so they can log into each other, and this is how the config would look. settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", delete="running", rsync = { compress = true, acls = true, verbose = true, rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no" } } Notice the additional 'delete=running' setting. This should work reasonably well across 2 hosts and you can probably extend it to 3 with the 2nd master syncing to the 3rd in a daisy chain, but beyond that there are a number of scenarios that would leave the shared folders in an inconsistent state or cause a race condition so one to many remains the recommended configuration for a robust setup. This Lsyncd config below is also reported to work for bi-directional sync settings { logfile = "/var/log/lsyncd/lsyncd.log", statusFile = "/var/log/lsyncd/lsyncd-status.log", statusInterval = 20 } sync { default.rsync, source="/var/www/", target="10.0.0.2:/var/www/", delay=10, rsync = { _extra = {"-ausS","--temp-dir=/tmp","-e","ssh -p 22" } }, } Lsyncd has a number of options to explore to fine tune performance.

DRBD

DRBD is a a block level replication technology that provides redundancy and can provide high availability when configured with heartbeat. DRBD works over 2 nodes in primary-secondary mode. You can also add a 3rd node which acts as a sort of backup node with delayed synchronization. DRBD is tried and tested but limited to 2 nodes. The 9.0 release expected soon will allow users to configure DRBD over multiple nodes. DRBD is supposed to be fast and has a number of config options to tweak its performance. Since this is block level the use case is limited to when you are running your own network and have access to disk at the block level. With the primary secondary config only the primary block device of the pair is available for use, which can then be mounted over nfs for use by various hosts. The other is replicated and can be mounted for manual fail over or by using heartbeat for automatic failover and HA. We are going to use a loopback device to show how easy it is setup DRBD, using DRDB over a loopback device is not recommended. We will do this exercise on 2 Ubuntu Trusty VMs in primary secondary mode and use a pretty simple config. Lets call VM1 'drbd01' and VM2 'drbd02' and are presuming a storage network on 10.0.0.2 and 10.0.0.3 for both hosts Let's prepare both hosts apt-get install drbd8-utils mkdir /srv/test & /srv/fs cd /srv/fs dd if=/dev/zero of=drbd.img bs=1024K count=1000 losetup /dev/loop0 /srv/fs/drbd.img Now edit the /etc/hosts on both hosts and associate the IPs with the drbd01 and drbd02 hosts nano /etc/hosts Its should look like below this for both DRBD hosts drbd01 10.0.0.2 drbd02 10.0.0.3 Also edit /etc/hostname on both hosts and associate each with their respective drbd name. Now on both drbd01 and drbd02 create a config file for drbd nano /etc/drbd.d/r0.res Its should look like this resource r0 { net { shared-secret "secret"; } on drbd01 { device /dev/drbd0; disk /dev/loop0; address 10.0.0.2:7788; meta-disk internal; } on drbd02 { device /dev/drbd0; disk /dev/loop0; address 10.0.0.3:7788; meta-disk internal; } } The DRBD config operates on the concept of resources. Our resource as you can see in the config file is called 'r0' and the config file is also named 'r0.res'. Now let's initialize the storage. On drbd01 and drbd02 nodes start the drbd service /etc/initd./drbd start drbdadm create-md r0 You should get no errors. Now on the host you would like to make the primary node run the command below drbdadm -- --overwrite-data-of-peer primary all This should work without error. Incase you get an error run this drbdadm adjust r0 Now you are all set and 2 block devices are connected to each other and replicating over the network Let's format and mount the drbd device in the primary node drbd01 mkfs.ext4 /dev/drbd0 mount /dev/drbd0 /srv/test Now let's go to the /srv/test folder we just mounted and run some tests dd if=/dev/zero of=/testfile bs=1M count=256 We just created a 256MB file. Now lets see if its replicated. To do this we will temporarily manually switch over the secondary node and check. First on drbd01 unmount the /srv/test folder and use drbdadm to make the primary node secondary, and on drbd02 make it primary and mount the drbd device. On drbd01 umount /srv/test drbdadm secondary r0 On drbd02 drbdadm primary r0 mount /srv/test Now you can go to /srv/test folder and you should see your testfile replicated. Now let's make a new set of filse here and manually switch over back to the primary touch file{1..100} umount /srv/test drbdadm secondary r0 Now back to drbd01 node drbdadm primary r0 mount /srv/test You should see the files you created in drbd01 now replicated in the folder. Use drbd overview to get status drbd-overview We used a very basic config to show you how easy it is. To make this config persistent you will need to use an init script to mount the loop devices. Here is a pretty good guide on using DRBD with heartbeat HA and failover. DRBD has a number of options you can use from protocols to configure async or sync modes, no barriers for more performance and more.

Myriad options

As we can see there a number of ways you can scale your workloads across use cases. For local use a NAS with its own storage network remains a good choice for ease of management and reasonable performance over a 1 Gbit dedicated storage network with bonding if required. For extra redundancy you can also configure DRBD across 2 NAS. Short of a NAS you can use a simple DRBD setup over 2 nodes as a network attached storage device for some level of redundancy. Lsyncd can be handy for smaller one off replication use cases both locally and on the cloud. Once course when using replication you have to remember you have to manage DB and session persistence too, with mysql clustering for instance and memcached for a typical php based app, sitting behind a Nginx or HAProxy load balancer configured with IP hashing. But Gluster is really coming into its own. For both local and cloud configurations Gluster can deliver decent levels of performance, tons of flexibility and availability. With libgfapi the performance is now significantly enhanced letting you run a VM store in Gluster. You can use Gluster to build your VSAN or just a distributed resilient storage layer. The best thing is it's relatively easy to to setup and scale. Build distributed storage nodes with LXC and Gluster