Let me regale to you a story of a young sysadmin who wanted to prove that he didn't need expensive hardware and software to have an effective backup solution. More importantly, I hope you can learn from my mistakes so that in the future you don't make them.

Backstory

I started my current position 6 months ago; the current sysadmin had been at the company for 6 months and was completely overwhelmed. I was brought on as young blood to help provide the extra support needed to get the system into shape. There were many issues to be fixed, one of which was a flaky backup system that never seemed to work right. After tackling many great tasks, I was finally given the responsibility of fixing the extremely fragile backup system.

The Backup System

The backup system that was already in place was a Seagate BlackArmor that our Veeam backup server wrote its backups to. It had a whole host of problems but the simplified version is simply that it did not have the performance our system needed. If you tried to run more than one backup at time they system would start dropping packets or not responding which then lead Veeam to think that the backup target was offline, which would fail our backups.

The Solution

The solution I eventually picked was a slightly modified Backblaze Storage Pod 3.0, here was my parts list.

Total price without disks comes out to a cool $1070.67. Add in 7 x Western Digital Red 2TB Hard Drives and our total cost for 14TB of RAW storage is $1609.20. This is substantially less than any prebuilt NAS or SAN offering comparable features. We also pulled all four 2TB drives from our previous "NAS" and added them to this system. This gave us a total of 22TB of RAW storage.

The Problems

Here is where things got interesting. The parts arrived (except the chassis which got lost in transit), I put the system together and it booted only one set disks was showing up. What gives I thought? After some searching around I found out the by default only one PCI-E port is running in PCI-E 2.0 mode the rest are in 3.0. Well it appears that this particular RAID card would be detected by the system and even boot up and let a user into its BIOS but if it was running on a PCI-E 3.0 port it would not detect any disks. So get into the motherboards BIOS and force the system to run each port in PCI-E 2.0 mode.

Problem fixed, the system should be good to go right? If only I was so lucky.

I setup my two ZFS arrays, our 4 older disks in a RAIDZ array and 6 of our new disks in a RAIDZ2 array with one hot spare drive. I setup our iSCSI initiators and connect our Veeam server to the RAIDZ2 array and started copying over our data. We were getting respectable speeds, and the migration to the new NAS was going swimmingly.

Two days later after our backups started running to the new system I received a very strange error from Veeam in my email.

"Storage file 'F:\Backups\Veeam\Veeam2013-06-17T120050.vbk' is missing from host 'This server'."

This concerned me, so I logged in to find that one of my folders had appeared to have become corrupt. One disk check later and most of my files were back in place, but enough were missing that it required a new full backup. After kicking off the full backup, I logged into the FreeNAS web interface, it showed no errors and the RAID array seemed to be in great shape. I figured it was a fluke and went to grab some food.

The next day I came into the office and went and checked the console of our newest server. I was greeted with wonderful screen. Much Googling revealed very little about this error and so I assumed it was just the result of the system running. I went on my merry way, assuming my previous issue was a one time fluke and went on to fix other problems. I was wrong, very wrong.

That evening I received quite a few more emails about failed backups, I logged in to find my data drive empty. I did a disk scan and most of my files were found but they were completely mangled. At this point I stopped my backups and started looking for errors with my system, as it was clear there was something seriously wrong with my system.

A few posts on the forums and a couple days later and we had discovered my problem, and let me tell you, I felt stupid when we discovered it. You see I have mounted and formatted my share as an NTFS partition, which is fine and dandy, as long as you follow one very important. This rule, which I did not know is, never connects multiple windows machines to the same iSCSI share formatted for NTFS. Why is this you might ask?

The Answer

Well as a user on the forums was so kind to point out NTFS is not cluster aware and this causes this exact issue. What is a cluster aware file system you might ask and why does NTFS not being cluster aware cause my data to miraculously disappear? I set out to find this out and found a very helpful blog post with answers to these important questions.

The really nice answer given is,

"For example, lets assume you have a SAN with a single volume, and you connect it to two servers running Windows 2003 Standard. Both systems see the volume and try to use it. When the first server writes files to the system, everything works fine. Then the other server modifies the files in some way, perhaps it just reads the files and updates the date accessed attribute on the file. NTFS is looking at the blocks of the file system and see’s changes it did not expect. NTFS at this point may think there is something wrong with that Block and take some corrective action. At the same time the other server will see something strange happening and take the same action. In the end, you have a corrupt file."

So where was the extra connection to the iSCSI share coming from? Directly from my workstation, you see when I first made the iSCSI share I used my machine as the test, formatting the disk, etc. I left this connection running after I connected our Veeam server to this same exact iSCSI share. After seeing this all I could think is "Well there's your problem". I removed this connection and a week and a half later the system is runing fantastically, providing the performance we need at a price point the administration is happy with.

What are some things you learned the hard way that you think no other person should have to go through?

**Chassis has been added**