Some time back, I looked at what it would take to run a container based Minio S3 object store on top of vSAN. This involved using our vSphere Docker Volume Server (aka Project Hatchway, and the details can be found here. However, I wanted to evaluate what it would take to scale out the Minio S3 object store on top of vSAN, paying particular attention to features like distribution and availability, and to examine the various data services that can be provided by both vSAN and Minio.

I also wanted to take advantage of the new host-pinning feature in vSAN 6.7 (available via RPQ – special request). The main reason for this is to control both compute and data of Minio servers on a per vSAN node basis. The fact that I am using the host-pinning feature implies using an NumberOfFailuresToTolerate = 0 policy setting, in other words a single copy of the data. This is because I want to save on capacity usage on the vSAN datastore when I am leveraging Minio availability features.

I do want to give a shout out to Rob Girard of Minio who answered many of my questions during this exercise.

vSAN/vSphere Setup

Let me describe my vSAN/vSphere environment first of all. This is a 4-node all-flash vSAN cluster. To make use of Minio distribution/availability/erasure coding techniques, I deployed 16 Centos7 VMs, each of which will be a Minio server. Each VM had an extra 100GB disk, formatted as EXT2, mount on /S3, and this is dedicated to the S3 store. Visually, it looks something like this:

I am using a vSAN Policy of FTT=0 for each VM since I will have a distributed Minio deployment with Erasure Coding is will take care of host failure protection. The vSAN Policy is also leveraging the new vSAN 6.7 Host Pinning feature (available via RPQ only), so that both the compute and storage are placed on the same host. You can use the H5 client to verify the placement of components by navigating to Monitor > vSAN > Virtual Objects > Physical Placement and checking the ‘Group components by host placement”. Below we can see all the objects (home, disks, swap) from one virtual machine all exist on the same physical ESXi host.

From a cluster configuration perspective (HA/DRS), I created 4 VM Groups (with 4 VMs in each group) and 4 Host Groups (with a single ESXi host in each group). I then had 4 VM/Host groups which tied 4 VMs to a single host. The Host to VM Affinity has to be configured via MUST rules – 4 VMs MUST run on a single ESXi host. This also means that the VMs are not restarted anywhere else after a failure of the host to which they are pinned. We can let Minio Erasure Coding handle availability of the S3 store. This is where you set the MUST rule on the H5 client, stating that a group of VMs must run on a particular host.

Lastly, I disabled DRS for all VMs running Minio server via VM Overrides to prevent any attempt to move the VMs to other hosts to balance the cluster from a compute perspective.

Minio Server Setup

Remember that this needs to be done on all 16 VMs. The minio binaries can be pulled down directly as follows. The only thing needed is to make it an executable and place it somewhere in the $PATH. I moved it to /usr/local/bin.

[root@minio3 ~]# wget https://dl.minio.io/server/minio/release/linux-amd64/minio

[root@minio3 ~]# chmod +x minio [root@minio3 ~]# mv minio /usr/local/bin/

[root@minio1 ~]# minio version Version: 2018-06-09T03:43:35Z Release-Tag: RELEASE.2018-06-09T03-43-35Z Commit-ID: 371349787f0e324f072b03ed5a7842229d6b3174 [root@minio1 ~]#

The next step is to prepare a disk for use by minio server. Use fdisk to create a new partition to use all of the disk, and make sure to write the partition table.

[root@minio3 ~]# fdisk /dev/sdb Welcome to fdisk (util-linux 2.23.2). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Device does not contain a recognized partition table Building a new DOS disklabel with disk identifier 0x2fde013e. Command (m for help): n Partition type: p primary (0 primary, 0 extended, 4 free) e extended Select (default p): p Partition number (1-4, default 1): 1 First sector (2048-209715199, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-209715199, default 209715199): Using default value 209715199 Partition 1 of type Linux and of size 100 GiB is set Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.

Now we format the disk. I used the default EXT2, but there are probably others. I since learnt that Minio’s engineers recommend XFS as a best practice, with claims that it handles i-node management better, especially if there are many small objects [thanks Rob]. Next, I make a directory (I used /s3), and mount it.

[root@minio3 ~]# mkfs /dev/sdb1 mke2fs 1.42.9 (28-Dec-2013) Discarding device blocks: done Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 6553600 inodes, 26214144 blocks 1310707 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 800 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done

[root@minio3 ~]# mkdir /s3 [root@minio3 ~]# mount /dev/sdb1 /s3

To make this persist through reboots, the etc/fstab can be edited and the following entry added.

/dev/sdb1 /s3 ext2 defaults 0 0

Starting Minio Server in Distributed Mode

If Minio is deployed in distributed mode, multiple drives, and in our case multiple drives on different machines, can be pooled into a single object storage server. As drives are distributed across several nodes, distributed Minio can withstand multiple node failures and continue to provide an S3 store. In this setup, I will be introducing a full physical ESXi host failure which will impact 4 Minio server VMs, and we will see how the distributed S3 store is still accessible.

To leverage this distributed mode, Minio server is started by referencing multiple http or https instances, as shown in the start-up steps below.

First step is to set the following in the .bash_profile of every VM for root (or wherever you plan to run minio server from). This will set the same Access Key and Secret Key for every server that is started.

export MINIO_ACCESS_KEY=admin export MINIO_SECRET_KEY=password

The following command must be run on all 16 VMs. When the number of nodes that are needed to meet the “erasure coding” requirement are up (more on this later), the first node’s messages change from “Waiting for a minimum” to “Waiting for all other servers”. For example, in a 16 node cluster, the default Erasure Coding is N/2, meaning 8 disks are data and 8 disks are parity. Once the first 9 servers are up and running, the remaining nodes will state that they are waiting for the first node to format the disks. This will happen when all servers are running. Below is the output from the first node. Note that the messages state that it is waiting for 8 disks to come online, it is not until the 9th server is started that the messages change.

Long startup command format (must be run on all 16 minio servers):

[root@minio1 ~]# minio server http://minio1/s3 http://minio2/s3 http://minio3/s3 \ http://minio4/s3 http://minio5/s3 http://minio6/s3 http://minio7/s3 http://minio8/s3 \ http://minio9/s3 http://minio10/s3 http://minio11/s3 http://minio12/s3 \ http://minio13/s3 http://minio14/s3 http://minio15/s3 http://minio16/s3 Waiting for a minimum of 8 disks to come online (elapsed 0s) Waiting for a minimum of 8 disks to come online (elapsed 1s) Waiting for a minimum of 8 disks to come online (elapsed 4s) Waiting for all other servers to be online to format the disks. Waiting for all other servers to be online to format the disks. Waiting for all other servers to be online to format the disks. Waiting for all other servers to be online to format the disks. Waiting for all other servers to be online to format the disks. Status: 16 Online, 0 Offline. Endpoint: http://10.27.51.32:9000 http://172.17.0.1:9000 http://192.168.122.1:9000 http://127.0.0.1:9000 AccessKey: admin SecretKey: password Browser Access: http://10.27.51.32:9000 http://172.17.0.1:9000 http://192.168.122.1:9000 http://127.0.0.1:9000 Command-line Access: https://docs.minio.io/docs/minio-client-quickstart-guide $ mc config host add myminio http://10.27.51.32:9000 admin password Object API (Amazon S3 compatible): Go: https://docs.minio.io/docs/golang-client-quickstart-guide Java: https://docs.minio.io/docs/java-client-quickstart-guide Python: https://docs.minio.io/docs/python-client-quickstart-guide JavaScript: https://docs.minio.io/docs/javascript-client-quickstart-guide .NET: https://docs.minio.io/docs/dotnet-client-quickstart-guide

Short startup command format (must be run on all 16 minio servers):

The start-up command could also be run as “minio server http://minio{1…16}/s3”, where {1..16} are wildcards.

[root@minio12 ~]# minio server http://minio{1...16}/s3 Status: 13 Online, 3 Offline. Endpoint: http://10.27.51.51:9000 http://192.168.122.1:9000 http://127.0.0.1:9000 AccessKey: admin SecretKey: password Browser Access: http://10.27.51.51:9000 http://192.168.122.1:9000 http://127.0.0.1:9000 Command-line Access: https://docs.minio.io/docs/minio-client-quickstart-guide $ mc config host add myminio http://10.27.51.51:9000 admin password Object API (Amazon S3 compatible): Go: https://docs.minio.io/docs/golang-client-quickstart-guide Java: https://docs.minio.io/docs/java-client-quickstart-guide Python: https://docs.minio.io/docs/python-client-quickstart-guide JavaScript: https://docs.minio.io/docs/javascript-client-quickstart-guide .NET: https://docs.minio.io/docs/dotnet-client-quickstart-guide

Minio S3 Browser

Once all minio servers are deployed, you can now use a browser to create buckets on the S3 store, and upload/download files. This is available on port 9000, and you may need to open this port on the firewall on all the servers. On Centos, you would do that using the following commands, assuming public is the active zone:

firewall-cmd --get-active-zones firewall-cmd --zone=public --add-port=9000/tcp --permanent firewall-cmd --zone=public --add-port=9000/tcp --permanent

You can now go ahead and point the browser against the minio servers, port 9000. In the example below, I have already created a bucket and uploaded a folder called Software.

The red circle with the + sign allows you to do various actions. Click on the button, and it will reveal options to created buckets and upload.

Minio Client (mc)

Minio Client (mc) allows you to interface with the S3 server via the Centos command line. The first step is to download mc, then you need to add your minio server to the list (as shown below). Now you can begin to use the client to do lots of other stuff, such as query the status of the deployment.

root@minio1 ~]# wget https://dl.minio.io/client/mc/release/linux-amd64/mc [root@minio1 ~]# chmod +x mc mv mc /usr/local/bin/ [root@minio1 ~]# mc config host add minio1 http://10.27.51.32:9000 admin password [root@minio1 ~]#[root@minio1 ~]# Added `minio1` successfully. [root@minio1 ~]# mc config host list gcs : https://storage.googleapis.com YOUR-ACCESS-KEY-HERE YOUR-SECRET-KEY-HERE S3v2 dns local : http://localhost:9000 auto minio1: http://10.27.51.32:9000 admin password s3v4 auto play : https://play.minio.io:9000 Q3AM3UQ867SPQQA43P2F zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG S3v4 auto s3 : https://s3.amazonaws.com YOUR-ACCESS-KEY-HERE YOUR-SECRET-KEY-HERE S3v4 dns [root@minio1 ~]# mc admin info minio1 ● minio1:9000 Uptime : online since 40 minutes ago Version : 2018-06-09T03:43:35Z Region : SQS ARNs : <none> Stats : Incoming 38MiB, Outgoing 545KiB Storage : Used 333KiB Disks : 16 , 0 ● minio10:9000 Uptime : online since 40 minutes ago Version : 2018-06-09T03:43:35Z Region : SQS ARNs : <none> Stats : Incoming 4.0MiB, Outgoing 511KiB Storage : Used 333KiB Disks : 16 , 0 <...output repeats for all servers...>

I also leveraged a third-party freeware S3 browser from NetSDK LLC for creating buckets and transferring data from my desktop to the Minio S3 object store. It seems to work well for my purposes.

Minio Erasure Coding

When minio server is started, a configuration file (/root/.minio/config.json) is created. In this configuration file is a section called storageclass. For more information on the entries that go into the storageclass{} structure, see this https://github.com/minio/minio/tree/master/docs/erasure/storage-class. [Update – I had completely misunderstood how erasure coding worked on Minio. Rob, set me straight though, and I’ve updated the post]. By default, erasure coding is implemented as N/2, meaning that in a 16 disk system, 8 disks would be used for data and 8 disks used for parity. If we go with a RAID-5 equivalent, which is 3+1 or N/4, then we would need 12 data drives and 4 parity drives), we get a 1.34 (or 34%) overhead approximately, i.e. it would require 133MB of space on the S3 to store 100BM of actual data.

Environment variables/exports can also be used to create erasure coding settings. To get this level of erasure coding, the following environment variable can be used.

export MINIO_STORAGE_CLASS_STANDARD=EC:4

An EC:4 implies that 4 disks will be used for parity and the remaining will be data disks. EC:4 means there will be 4 parity blocks, regardless of number of servers. But to have EC:4, there’s an implication that you can’t have more parity than data, which means your cluster has to be a minimum of 8 disks to support it. In our case, EC:4 happens to equal N/4 when N = 16 (12 data disks, 4 parity disk). With 12 disks, it is 8 data and 4 parity. With 8 disks, it is 4 disks assigned to data and 4 assigned to parity and so on. The max amount of redundancy that can be configured is when the number of parity drives = the number of data disks. The disks themselves are completed dedicated to parity or data blocks. Every incoming object starts at a different disk. First Minio writes all of the data, followed by the required parity blocks.

In my setup, I have 16 VMs in the distributed Minio deployment, each contributing a single disk each. Erasure Coding is set to EC:4, implying I now have 12 data and 4 parity.

Note that if the config.json exists, it is the source of truth. It overrides any environment variable exports. Instead of editing it, you can always delete this file, set new environment variable exports, and then launch minio server again. This will behave as if it’s the first launch and recreate a new config.json.

As a test, I stopped the minio server on 4 out of the 16 VMs, and the S3 object store was still available. When I stopped the 5th minio server, I could not longer access the files/objects as expected. On bringing the 12th minio server back online, I had access once more, indicating that erasure coding was working as expected.

Minio Encrpytion

I wanted to evaluate the minio encryption feature, but was informed that I would need to startup minio using the https format rather than the http used earlier. This would involve (in my case) generating self-signed certs for every minio server, then copying the public cert to every server. Minio has a document here – https://docs.minio.io/docs/how-to-secure-access-to-minio-server-with-tls – for different ways on how to create a public and private key. The generate_cert.go script automatically creates files with a .PEM extension. I used a modified version of generate_cert.go, which created the public and private filenames expected by Minio → /root/.minio/certs/public.crt and /root/.minio/certs/private.key. However every server needs to have every other servers public.crt. The public.crt of every other server needs to be placed in /root/.minio/certs/CAs. So for a 16 node deployment, you will need the public.crt of every other server saved in that CA folder for each node. Once that is done, the servers can be started with the https reference, and now you can use the encryption feature to copy/mirror between buckets.

Some encryption tests

The final step is to use mc to initiate a mirror between two on-premises buckets, and replicate the content at the same time. The name of the server you call, and the bucket name, are string matches that invoke the key. You can copy (or mirror) objects from an un-encrypted state to an encrypted state in another bucket by specifying different keys.

I used the following mc command, as the contents of the source bucket are not encrypted, but I did want to encrypt the content of the destination bucket.

[root@minio1 ~]# mc mirror --encrypt-key "minio1/new-bucket-1b94196f=32byteslongsecretkeymustbegiven2" \ minio1/new-bucket-adfe8a34 minio1/new-bucket-1b94196f ...ware-esx-perccli-1.05.08.zip: 37.15 MB / 37.15 MB ┃▓▓▓▓▓▓▓▓▓▓▓\ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓\ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 6.67 MB/s 5s [root@minio1 ~]#

Instead of using mirroring, you can also copy between buckets with encrypted keys.

[root@minio1 ~]# mc cp --recursive --encrypt-key \ "minio1/new-bucket-1b94196f=32byteslongsecretkeymustbegiven2,minio1/new-bucket-e4a748ab=32byteslongsecretkeymustbegiven3" \ minio1/new-bucket-1b94196f minio1/new-bucket-e4a748ab ...ware-esx-perccli-1.05.08.zip: 37.15 MB / 37.15 MB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓\ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓\ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 100.00% 6.91 MB/s 5s [root@minio1 ~]#

Failure Scenario – Single ESXi/vSAN node, including 4 minio server VMs

This final test is to be able to suffer a complete ESXi host failure in my 4 node vSAN cluster, and still verify that minio is able to present the S3 object store. I also want to ensure that there is no attempt to restart the minio VMs that are on the failing host elsewhere in the cluster. The first thing we observe after the PSOD of the ESXi host are the vSphere alarms related to host connection and vSphere HA host status:

We also see the Minio VMs enter a state of disconnected. These are the only VMs left on the host, since we put in place various DRS and HA rules to prevent them being restarted anywhere else (they wouldn’t have their VMDK anyway, as it is pinned to the failed host). All other VMs have been restarted / moved elsewhere in the cluster.

We also start to see errors appear on the remaining Minio servers complaining about the missing disks. We also get errors when running minio client (mc) commands:

API: SYSTEM() Time: 14:23:20 IST 06/21/2018 Error: disk not found 1: cmd/xl-v1-utils.go:301:cmd.readXLMeta() 2: cmd/xl-v1-utils.go:331:cmd.readAllXLMetadata.func1() API: SYSTEM() Time: 14:23:20 IST 06/21/2018 Error: disk not found 1: cmd/xl-v1-utils.go:301:cmd.readXLMeta() 2: cmd/xl-v1-utils.go:331:cmd.readAllXLMetadata.func1() API: SYSTEM() Time: 14:23:20 IST 06/21/2018 Error: disk not found 1: cmd/xl-v1-utils.go:301:cmd.readXLMeta() 2: cmd/xl-v1-utils.go:331:cmd.readAllXLMetadata.func1() API: SYSTEM() Time: 14:23:30 IST 06/21/2018 Error: disk not found endpoint=minio7:9000 1: cmd/prepare-storage.go:39:cmd.glob..func4.1() 2: cmd/xl-sets.go:180:cmd.(*xlSets).connectDisks() 3: cmd/xl-sets.go:211:cmd.(*xlSets).monitorAndConnectEndpoints() API: SYSTEM() Time: 14:23:33 IST 06/21/2018 Error: disk not found endpoint=minio11:9000 1: cmd/prepare-storage.go:39:cmd.glob..func4.1() 2: cmd/xl-sets.go:180:cmd.(*xlSets).connectDisks() 3: cmd/xl-sets.go:211:cmd.(*xlSets).monitorAndConnectEndpoints() API: SYSTEM() Time: 14:23:36 IST 06/21/2018 Error: disk not found endpoint=minio15:9000 1: cmd/prepare-storage.go:39:cmd.glob..func4.1() 2: cmd/xl-sets.go:180:cmd.(*xlSets).connectDisks() 3: cmd/xl-sets.go:211:cmd.(*xlSets).monitorAndConnectEndpoints()

And when we query the system via the minio client. Here we see 4 disks with problems and 4 servers are offline (the 4 which were on the failed host).

[root@minio1 ~]# mc admin info minio1 ● minio1:9000 Uptime : online since 4 hours ago Version : 2018-06-09T03:43:35Z Region : SQS ARNs : <none> Stats : Incoming 85MiB, Outgoing 11MiB Storage : Used 253KiB Disks : 12 , 4 ● minio10:9000 Uptime : online since 4 hours ago Version : 2018-06-09T03:43:35Z Region : SQS ARNs : <none> Stats : Incoming 58MiB, Outgoing 25MiB Storage : Used 253KiB Disks : 12 , 4 ● minio11:9000 Uptime : Server is offline Error : Post https://minio11:9000/minio/admin: dial tcp 10.27.51.53:9000: connect: no route to host <<—output truncated -→>> ● minio9:9000 Uptime : online since 4 hours ago Version : 2018-06-09T03:43:35Z Region : SQS ARNs : <none> Stats : Incoming 58MiB, Outgoing 25MiB Storage : Used 253KiB Disks : 12 , 4

However I can still use an S3 client to connect and navigate the various S3 buckets, as well as create new ones. Everything continues to work as expected.

Recovery after Failure Event

In the previous failure scenario, the following were the recovery events:

Reset ESXi host and let it boot Power on Minio VMs which were in a powered off state after the host failure since they could not be restarted anywhere else due to DRS/HA rules Start the minio servers as before

No other actions were needed. Everything resolved and came back online.

Conclusion

That concludes my overview of running Minio S3 as a set of VMs on vSAN. You might ask why go to the trouble of deploying VMs with host-pinning+FTT=0 and use Minio erasure coding? Why not use vSAN erasure coding? Don’t both more or less give you the same thing? The answer is yes, at this scale, they both do pretty much the same.

[Update based on new information] But let’s that if you want to scale out your vSAN to a larger number of nodes, and use it for much larger S3 object store deployment. Now you have much far more erasure coding options. For example, if you had a 32 node vSAN cluster, and the sole purpose of it was to provide an S3 object store, you could expand on my setup and deploy across 32 nodes x 1 minio server VMs (32 VMs). However, with 32 nodes, you may want to tolerate more than 4 physical ESXi host failures. In this case, you could use minio erasure coding of >4. This means that you have for example 8 parity disks and 24 data disks. By using host-pinning and FTT= 0 as the vSAN policy, and assigning 1 VMs per host, you now have the ability to tolerate many 8 physical host servers in the 32 node cluster.

I initially thought that we could extrapolate the same calculation to a 64 node vSAN cluster, and be able to tolerate >8 failures, using the same approach described previously. However to scale Minio beyond an existing 16-node cluster, you would have to use federated mode and add an additional cluster. At this time you can’t grow Minio to more than a 32-node cluster. Also note that you cannot grow a Minio cluster after deployment (the number of disks must remain the same) – so you cannot scale from a 16 to 32 node minio cluster for example. Personally, I think this would be a nice feature if they could offer it. I guess the next question then is, as you scale your vSAN cluster from say 4 nodes to 8 nodes in the example above, can I redistribute the Minio VMs to the new nodes? The answer is no, at this point in time, since there is no way to migrate a host-pinned VM. So plan your capacity and sizing correctly upfront is the advice here I guess.

From a vSAN perspective, the added benefits are that you will also get all of the management and functionality of vSAN HCI (health checks, alerting, degraded disk handling, etc). And at the same time, if an application has the ability to protect itself, then you can ensure that the VMs do not consume any unnecessary additional vSAN datastore capacity with FTT=0.

I will state that there are a number of things that I have not yet looked at, including performance, reverse-proxy for Minio, logging, monitoring, on-premises replication and whether vSAN needs check-summing if Minio is also providing it. But at a base level, this could be a neat solution for someone looking to do on-premises S3 object store on top of vSAN. I have also not looked at whether this could be feasible in vSAN 2-node or stretched cluster deployments. But hopefully this post provides some food-for-thought on some other possibilities around vSAN use-cases. Speak to your local VMware representative if host-pinning is something you think you could use with your own self-protecting applications on vSAN.