and I’m all out of bubblegum.

I have an interesting problem. In My God, It’s Full of Files…, I discussed some of the things I had to deal with on our production application server stack, and I used the following picture to explain things:

In that article, I briefly outlined my plan to reduce wasted space by eliminating roughly half of the data (eliminating data is always the easiest way to optimize). That plan is still in development, but it’s only addressing half the issue. The other half is…”wow, an TB filesystem? Is that a good idea?”

Looking at the diagram above, lets pretend that my “staging file storage” is already switched over to consisting entirely of symlinks, and that I’m only dealing with the production file storage (sitting right now at ~800GB). If it were only an 800GB LUN, I would be worried, but as it stands, things are much worse than they seem at first glance. (If you’re not familiar with LVM or virtual disks, you can skim over my Introduction to LVM in Linux column before going to the next section).

I originally started with what I thought was a decently-sized chunk of storage: a 500GB LUN. I mean, when I started, the data set was around 200GB, so I thought “I’m going to more than double the size, that should buy me several years”. Fortunately for my paycheck, but unfortunately for my data set, business has been good, so my growth rate was…somewhat higher.

As it stands now, my production file storage looks sort of like this:

This is much worse than a single 900GB LUN. As it stands right now, all it takes is any of those 5 LUNS to be unavailable to wreck the filesystem. Even if it WERE on one filesystem, how long do you suppose it’ll take to run ‘fsck’ on that? A long damn time. And that’s only right now.

Everyone reading this knows that you should graph your data usage, right? It’s sleep prevention, though, because it gives you things like this:

The blue is the total size, and the red is the used size. This one graph really shows a lot of things…most obviously, you can see that I’ve added additional storage frequently — that’s the stair-step pattern. As I’ve grown the filesystem, the amount of available storage grew as well.

At the end of August, I finally got everyone to agree on a massive (250GB) purge of old useless data. You can see, though, at this point, I’m adding nearly 100GB a month. My current method of adding more storage to the existing filesystem just isn’t going to work. (As an aside, this graph really brings to light the amazing job our sales staff has been doing. Take a look at the growth rate back in February versus December. Business has been good.)

The way I’m planning to attack this is two-fold. First, I’m going to try to reduce the amount of data I have to deal with. Only the previous X number of months of daily reports will be available online (where X is defined by the client services staff). This will cut down the amount of data necessary to keep, but with the growth rate we’re experiencing (looking at that graph, the degree of the curve) is such that even if we only keep 6 months live, that’s still 600GB, and the management is planning on doubling our revenue this coming year, which will likely lead to doubling our report production, too. Exponential growth can’t continue indefinitely, but it can be a pain in my ass for the next year or two.

If we double to gaining 200GB a month and I have to retain 6 months, that’s still 1.2TB on a single filesystem. And you KNOW that there are exceptions to that 6 months (for instance, monthly reports as well as end-of-month reports will be kept indefinitely, apparently).

Now that I’ve made my case for something needing to be done, here’s my plan: I’m going to shard my dataset.

If you’re unfamiliar with the term, sharding typically refers to databases, where you have a single mammoth database and you break it up into manageable chunks.

You can look at a filesystem as a database, and there are many similarities, so if you can shard a database, why can’t you shard a filesystem? Lets look at this logically:

I have a single mountpoint right now: /mnt/deploy (as you can see from the above graph). The directory structure looks a lot like this:

/mnt/deploy/Client1

/mnt/deploy/Client2

/mnt/deploy/Client…

/mnt/deploy/ClientN

That’s a single FS on top of several LUNs. It’s a tower that’s waiting to be toppled over by a single missing-or-misconfigured LUN. Instead of continuing to expand my dataset into that one filesystem, what I want to do is to break it apart:

/mnt/deployFS/1/Client1

/mnt/deployFS/1/Client2

/mnt/deployFS/1/Client…

/mnt/deployFS/1/Client(N/X)

/mnt/deployFS/2/Client(N/X)+1

/mnt/deployFS/2/Client(N/X)+2

/mnt/deployFS/…/…

/mnt/deployFS/X/ClientN

Such that each directory under /mnt/deployFS/ (1, 2, … N) is its own 500GB filesystem.

Because the application expects to see everything in /mnt/deploy, my plan is to symlink from /mnt/deployFS/X/ClientN to /mnt/deploy. This should be transparent to the application itself, and also give me a TON of flexibility. Actually, the more I thought about this, the more appealing it became, mostly because of all of the unintended benefits:

Filesystems are locked to a single size Increased reliability Flexible growth Storage Tiering

There are a lot of advantages, and really only a couple of drawbacks: primarily that the application wasn’t developed with this in mind, so it doesn’t natively know about the sharding. This will have to be solved with symlinks until a “real” solution can be engineered.

I’m not sure that many people have done this before, honestly. Google searches for “shard a filesystem” have 0 results. A search for “shard file system” consisted entirely of typos of “shared file system”. It might be that I’m doing something new and novel, but it’s more likely that I’m doing something that I should be looking for under a different name (or, alternately, I could be doing something so dumb that no one else would even consider it).

This is where you come into play. Please let me know what you think of my idea. I asked twitter about the ability to have multiple mountpoints into one directory (to eliminate the need for symlinks), and one third of the people responding said “use UnionFS”. Another third said “Use Gluster” (the other third said “Dear God No! Don’t use Gluster!”). But I wasn’t asking this particular question (mostly because it took 1500 words to explain what I wanted to do).

I should also say that Bash Cures Cancer thinks this is a terrible idea ;-)

So what do you think? Please let me know in the comments!