Wow does time fly! I posted the initial build of this DIY SAN/NAS solution over a year ago and I sincerely apologize for not following up with the details on my solution sooner! I am providing a link to the original article below.

DIY SAN/NAS – quest for fast, reliable, shared storage with a twist of ZFS! (Part 1)

I have two excuses for such a delay. First is the typical “work and life has been busy” but the other is a bit more genuine – I didn’t want to publish this article too quickly if the solution did not work reliably. Granted, 3 – 6 months would have probably established reliability, but refer back to the first excuse.

The hardware

As you may recall from Part 1, I chose a Dell R510 II 12-Bay server for my storage node. At the time, E5-2670-based machines were still too expensive and acquiring a 24-bay Supermicro SC846 (like I have in my other lab) would have been much more expensive than it is currently. I wanted something with remote access, like iDRAC, because I’d be experimenting with different storage configurations. The specifications of my Dell R510 12-bay storage node are as follows:

Dell PowerEdge R510 II 2U Server

2 x Intel(R) Xeon(R) CPU E5620 @ 2.40GHz – 4 cores with 8 threads per socket

2 x Dell 80 Plus Gold 750W Switching Power Supply FN1VT

64GB of DDR3L ECC Memory

Dell K869T J675T Remote Access Card iDRAC6 Enterprise

Chelsio T420-CR Dual Port 10GbE PCI-E Unified Wire Adapter

Dell Intel PRO/1000 VT Quad-Port PCI-e Gigabit Card YT674

Dell Perc H200 047MCV PCIe SAS/SATA 6GB/s Storage Controller

12 x Western Digital Re 1TB 7200 RPM 3.5″ WD1003FBYZ Enterprise Drive

2 x Samsung 850 Evo 250GB SSDs

2 x SanDisk Cruzer Fit CZ33 32GB USB 2.0 Low-Profile Flash Drive

Whew, I think that about does it. I’ll explain later why I chose certain pieces of hardware in the list above.

Picking the OS/filesystem

With that out of the way, my pursuit of fast, reliable, shared storage has landed me in a somewhat unexpected position. I imagined using either Nexenta, NetApp ONTAP Select, heck, even ZFS on Linux for my storage solution. I even flirted with the idea of presenting the storage via iSCSI out of Windows Server 2012 R2. For about 20 seconds I even considered using StarWinds on top of Windows.

I tested Nexenta and wasn’t impressed – especially since the community edition was limited in RAW capacity and offered no plugins or additional features. That, and the performance was very, very average. It was further complicated by the fact that I could not deduce whether or not it would accommodate my somewhat atraditional 10 Gbps network configuration.

NetApp ONTAP Select was still in new beginnings. In fact, I don’t know that it was even available then. I love NetApp and its similarity to ZFS in the way it implements its WAFL (Write Anywhere File Layout) system, snapshots, etc. and I know that NetApp Clustered ONTAP 9 supports pretty funky/unusual networking configuration. NetApp ONTAP Select was just too new and unavailable to really lean on.

The Windows 2012 R2 solutions were eliminated because I really didn’t want to have the overhead of a full Windows operating system running my storage, and I know that without special considerations I wouldn’t get very good performance. I’d also be limited to block storage (iSCSI) if dealing with Windows.

All of these considerations left me wanting the features of NetApp ONTAP with the convenience of something pre-built, but it had to be flexible. While ZFS on Linux was pretty stable, there were (and still are) some limitations. What’s all this rambling end up with? FreeNAS. I know I am stating that in a sort of negative fashion – the reason for that is because I had, for so long, disregarded FreeNAS as a viable solution because I only saw it as a sort of DIY Synology solution.

The reality is that FreeNAS is built on FreeBSD so it’s secure and reliable. It utilizes ZFS which will provide redundancy, snapshot capability, performance (using ARC and L2ARC cache tiers), and can provide storage via NFS, iSCSI, CIFS, etc. Further, because FreeBSD is a full-fledged enterprise operating system, there is no real limit on the network configuration underneath.

10 Gbps of convenience

I picked the Chelsio T420-CR Dual Port 10GbE PCI-E Unified Wire adapter for my 10 Gbps connectivity because it was supported in FreeBSD, but also because it is actually used in NetApp systems so I know that it’s an enterprise, reliable part. The T420-CR has two SFP+ ports that can take transceivers with fiber cable but since I am doing this on the cheap, I used 2 x 2M DAC (Direct-Attached Cables) from Dell that I got new on eBay for like $15/each. In each of my ESXi hosts, I installed a Mellanox ConnectX-T2 which is also a 10 Gbps adapter with SFP+ ports. I went with the Mellanox cards in the ESXi hosts because I know they’re supported by VMware. I believe the Chelsio T420-CR is as well, but the Mellanox ConnectX-T2, being single-port, are extremely cheap.

The beauty of this FreeNAS/ESXi setup is that I have no 10 Gbps switch. While this would be an issue ordinarily, I only need 10 Gbps connectivity between the ESXi hosts and the storage node so there’s really no need for any more than 3 ports in the whole configuration. Essentially, the FreeNAS node (R510 II) will be my “switch”. By putting the two ports on the Chelsio T420-CR inside the FreeNAS node in bridge mode, port 1 will forward all frames to port 2, and vice versa. So, essentially, I’ll have one IP address on the Chelsio T420-CR that is both listening for packets addresses to it, but also passing packets that are no addressed to it. What does this mean? Instant 10 Gbps storage connectivity as well as inter-ESXi 10 Gbps connectivity. Take a look below to understand better:

In the diagram above you’ll see that I am using two ports on each Intel Pro/1000 NIC for iSCSI connectivity. This is more for the sake of compatibility and flexibility, allowing me to test VAAI and block storage if need be. The iSCSI configuration here allows for a total of 2 Gbps total throughput. In practice, I am using NFS for storage since it’s thin-provisioned and allows for compression, etc. You can see that each iSCSI vmkernel on either ESXi host are configured for a different subnet (everything in the diagram is masked with by a /24). This provides multi-pathing between the FreeNAS node and the ESXi hosts. One thing I discovered in this configuration is that you cannot IP two interfaces within the same subnet in FreeNAS/FreeBSD due to the way the TCP stack is designed. It’s actually improper from a standards perspective to even allow multiple NICs on the same host to have IPs in the same subnet… I learned a lot here! But, this is all boring stuff. Read further.

The more important, convenient aspect of this setup is that not only do I have 10 Gbps connectivity from each ESXi host to the FreeNAS box for storage, but because of how the bridge acts, I have 10 Gbps connectivity between hosts as well! Granted, for this to work, you need the FreeNAS node to be up and available. If I reboot my FreeNAS node (which would be an issue anyway since it’s not HA and all my VMs run from it), I will get “Network Connectivity Lost” alarms withing vCenter because the link goes down since there is no switch between the hosts. However, by utilizing the same vmkernel for vMotion as I already do for NFS connectivity, I gain vMotion over 10 Gbps. This performs extremely well and is so simple because there is no switch involved!

Further, jumbo frames between nodes and FreeNAS is fully supported so long as everything is configured properly end-to-end. It’s pretty much the most convenient setup you can accomplish without introducing expensive 10 Gbps switches. There’s not very much CPU overhead on the FreeNAS server during vMotion events since the NIC is really just forwarding anyway and even if there were, I have found that the E5620 2.4 GHz CPUs are total overkill for this storage device as-is. Obviously this switchless 10 Gbps solution will not work as-is if you have a third ESXi host. I do think you’d be able to add an additional Chelsio T420-CR Dual Port card and bridge all four SFP+ ports, allowing for a total of four ESXi hosts in the setup w/ single SFP+ NICs.

You can see that when I vMotion VMs from one host to the other (which has to move the memory of the VM along with the NVRAM file, etc.) I am getting good throughput on the FreeNAS node:

I’ve even created a video to demonstrate what vMotion/Maintenance Mode might look like for you with this network/storage configuration:

(If you find the video above useful, be sure to like and subscribe…)

The storage layout

I hate divulging this portion of any storage configuration to people because it’s extremely subjective. There is always going to be a “better” and “worse” configuration for a specific work-load. For instance, if you look at ZFS storage pool configurations with 12 x 1TB drives you’ll likely find people recommending a single RAIDZ2 vdev in a pool. If you find someone with 12 x 4TB drives people will recommend multiple RAIDZ2 vdevs or even a RAIDZ3 vdev. The problem is while these configurations are conservative in terms of data preservation, they also offer the slowest write performance. Naturally, you could do many mirrors vdevs and create, essentially, a RAID10, but your usable space will suffer.

I decided that I need moderate redundancy and good throughput. As a result, I ended up with 4 RAIDZ1 vdevs. The result is basically similar to a “RAID50” in non-ZFS world. It looks like this:

Pretty simple overall. Each vdev has a usable capacity of two disks because of parity. So, with 12 x 1TB disks, I have a usable capacity of 8TB (before formatting). This layout should give me better write performance than would a RAIDZ2 because of write penalties, etc. It also allows me to lose up to 4 disks so long as they’re not part of the same vdev.

Because of how ZFS works, performance is oftentimes much better than the spindle layout would suggest. This is because of what ZFS refers to as ARC and L2ARC. ARC (adaptive replacement cache) utilizes system RAM and in my case I have 64GB (less some overhead) of really, really fast buffering/caching space of most read data. Because the R510 II has only 8 DIMM slots, I could only add 8 x 8GB DIMMs in order to remain affordable. While 64GB of ARC isn’t bad, more would be better. That’s where L2ARC comes in. I am sort of breaking the rules by using cheap SSDs for this. You really want to have MLC SSDs (like Intel S3710) for L2ARC. However, the Samsung 850 Evo 250GB I am using is better than nothing. When the frequently read data doesn’t fit in ARC, it gets put in L2ARC. You don’t need mirrored L2ARC drives because the data still resides on spinning disk should the L2ARC SSD fail.

There’s another concept in ZFS which can significantly increase performance and that’s the ZIL (or ZFS Intent Log) drive – this is a write-caching drive. Since ZFS uses HBAs, instead of hardware RAID controllers, there is no controller write-cache. Conventional hardware RAID configurations use ~1-2 GB of DDR3 memory on the controller itself to store data to while it places it on spinning disk when possible. If you want to significantly improve the write performance of your pool and you’re not using SSDs as primary storage then add a mirrored pair of ZIL drives. You want the ZIL mirrored because this is the write intent log. If the data falls out of the ZIL it’s lost forever, which leads to corruption.

Sizing L2ARC and ZIL

Since ZFS allows you assign SSDs for L2ARC and ZIL, respectively, you can pick any size for each! With hardware RAID controllers you can spend a real pretty penny upgrading to the model that has 2GB of write-cache instead of 512MB, etc. But, how much space do you need? This is kind of easy, maybe.

For ZIL, consider the maximum write-speed of your network and SSD. ZFS issues a new transaction group (and updates the pool) every 5 seconds (or sooner). As one transaction is written to ZIL the other is likely being written to disk. So, at worst case, you have two full transaction groups you need to store in ZIL while the pool finishes. Simple math says that if 5 seconds is the longest interval on transaction group creation, then we need the capacity of a full-write stream for 10 seconds. So, if you find an SSD that can write at 250 MB/s and your network is > 250 MB/s capable, then you need a total of 2.5GB of ZIL capacity. Obviously a 2.5GB SSD doesn’t exist today, so almost anything will work.

L2ARC is a little more random. L2ARC, as mentioned earlier, is your read-cache. There are some clever tools you can use to measure this. FreeNAS provides you a RRD graph you can reference just log into your WebUI, choose Reporting, then ZFS:

As you can see my L2ARC is ~267GB in total and my ARC is about 56GB in total. Ideally, your ARC hit ratio would have your L2ARC higher than your ARC but because I am using my FreeNAS solution exclusively for VM storage there are not a TON of requests for read-cached pieces of data.

For your L2ARC sizing, I’d recommend about 5 – 10% of your total usable capacity as L2ARC. Only 11.5% of my IO hits my L2ARC so I am fine. In a situation where you are storing either more data or have frequent requests for similar files over and over (think about an HR or Finance share on a corporate CIFS share around open-enrollment or tax season) you would likely need closer to 10-20% if you want to keep from hitting the spindles.

Keeping an eye on things

Naturally you’ll want to setup email/SMTP support in FreeNAS, NTP, etc. One thing I found a little lacking from the start was reporting. Because this is now important data, I want to make sure I know when a spindle fails. I am not local to this storage node so I will not see the amber lights, etc. on the tray.

With the help of the FreeNAS forums, a co-worker, and some patience, I came up with a modified version of some SMART reports that come out.

I setup a cron job that runs the following script:

[root@krcsan1] /mnt/StoragePool/scripts# cat esmart.sh

#!/bin/bash

#

# Place this in /mnt/pool/scripts

# Call: sh esmart.sh

(

echo “To: me@myaddress.com”

echo “Subject: SMART Drive Results for all drives”

echo “Content-Type: text/html”;

echo “MIME-Version: 1.0″;

echo ” ”

echo “<html>”

) > /var/cover.html c=0

for i in /dev/da1 /dev/da2 /dev/da3 /dev/da4 /dev/da5 /dev/da6 /dev/da6 /dev/da7 /dev/da8 /dev/da9 /dev/da10 /dev/da11 /dev/da12 /dev/da13; do results=$(smartctl -i -H -A -n standby -l error $i | grep ‘test result’)

badsectors=$(smartctl -i -H -A -n standby -l error $i | grep ‘Reallocated_Sector’ | awk ‘/Reallocated_Sector_Ct/ {print $10}’)

temperature=$(smartctl -i -H -A -n standby -l error $i | grep ‘Temperature_Cel’ | awk ‘/Temperature_Cel/ {print $10}’)

((c=c+1))

#echo $c

if [[ $results == *”PASSED”* ]]

then

status[$c]=”Passed”

color=”green”

else

status[$c]=”Failed”

color=”red”

fi echo “$i status is ${status[$c]} with $badsectors bad sectors. Disk temperature is $temperature.”

echo “<div style=’color: $color’> $i status is ${status[$c]} with $badsectors bad sectors. Disk temperature is $temperature.</div>” >> /var/cover.html

done echo “</html>” >> /var/cover.html sendmail -t < /var/cover.html exit 0

[root@krcsan1] /mnt/StoragePool/scripts#

The result is a very, very basic HTML email that looks like this:

Clearly you can see /dev/da11 doesn’t show a sector count – that’s because da11 is an SSD. You can beautify the output if you want, but this works for me. In fact, this very SMART script saved me from data loss when it let me know about a failed disk in my storage node which I made a blog post about some time ago.

One more useful tip is to use the following command to identify which slot has which serial number disk:

[root@krcsan1] ~# sas2ircu 0 DISPLAY | grep -vwE “(SAS|State|Manufacturer|Model|Firmware|Size|GUID|Protocol|Drive)”

Once you have correlated all of your slots to the serial numbers, go ahead and edit your disk info by opening the FreeNAS WebUI, going to Storage, then View Disks. Select each disk and click edit and populate the “description” field with the serial number:

This will make identifying the slot number much easier when you get a report that says /dev/da6 has failed – you can look up your table and see that da6 was slot #10 and you can be sure to pull the correct disk. I’ve even gone and made a very, very quick video replacing a failed disk on this node:

I will be following up, yet again, with some actual storage performance metrics. Don’t worry, it won’t be another year or more for that post! Look for it soon – most likely later this week. Thanks for reading!

Share this: Twitter

Facebook

