I was talking with one of Qumulo’s animation customers about capacity utilization, and they mentioned that they have a particularly difficult time getting people to delete data: they have roughly a thousand artists all working in the same file system tree with hundreds of millions of files! When they start to run low on capacity, they have to apply backpressure. But, in order to do so effectively, they have to tell people what capacity they’re consuming: it’s not practical for a thousand users to crawl hundreds of millions of files. So, this customer had built custom code that ran for many hours accumulating data about the footprint of every user. This was slow and cumbersome, resource intensive, hard to maintain, and delivered out-of-date information. The customer challenged us to deliver something better.

How do you get users to delete data?

A painful part of data administration is that people create data, and they never free it. And it doesn’t really hurt anything, until everyone is out of space, and everyone has to go look around for things to delete. By which point their application is falling over and everyone has forgotten what they created.

The timeline kind of looks like this:

Best practices

Storage administrators know there are a few ways to get people to delete data.

You can cut your users off. You could let your whole application fall apart as the storage system finally grinds to a halt. As it happens a lot of storage systems also fall apart as they hit 100% full… make sure to ask your storage vendor about how performance is, and whether you’re protected against drive failures when you fill up your storage system.

You could set quotas, maybe on users, or maybe on directories. This works great, but it still breaks applications (files can become corrupt when an application encounters out-of-space conditions, for example, and it doesn’t solve the problem of helping the user figure out what to delete). The second way to get users to delete data is to charge them ($ cost) or shame them (reputational cost). According to the hundreds of administrators Qumulo has talked to, shame actually works better than charging. Maybe it’s related to the phenomenon whereby fining parents for picking their kids up late from preschool [actually increases lateness.]https://www.theguardian.com/commentisfree/2014/jan/23/fining-parents-unpunctual-school-pupils) People might be happy spending $100 of their company’s money, but unhappy having their coworker know they’re spending $100 The third way is to help users overcome the “I forget it” stage and help them see what data they’ve created. The idea here is that if you quickly show a user what data they’ve created and where, they can make a quick decision about whether it’s still needed.

Solutions

Unfortunately, when dealing with hundreds of millions of files, looking at every single one to create a report simply isn’t an option… it’s a ton of performance for a storage system to deliver, it takes a lot of time to process that much data, and the data you get is out-of-date by the time it’s produced. What is needed is a way of rapidly, and cheaply, computing an answer. 1,000 artists running du’s on 1 billion files would be a trillion stats… not going to happen. We need something more efficient.

Enter Sampling

It’d be really convenient to sample the file system to find out where the capacity ends up… to choose blocks at random, see where they live, and get a sense of where capacity is going. Sampling works well when you’re looking for the worst offenders in the consumption of some resource, but poorly when looking for small contributors. We’re looking for the worst offenders, so maybe sampling can help.

But one of the maddening things about traditional file systems is that they’re impossible to sample. For most file systems we can’t, for example, take a hundred blocks at random and find out where they live.

It helps to consider trying to implement sampling from the top of a tree down, in diagram 2. In this traditional file system we can’t look at the root and figure out where to look next to take a sample, because there’s nothing to tell us at the root whether we should descend to /home or to /tmp. Naively, we might flip a coin and choose a child with equal probability. It’s pretty easy to see, however, that this would sample something in /home 50% of the time, vs. the expected 0.01%. There simply isn’t enough information in the file system to sample effectively without processing the entire tree first, which would defeat the object.

Sampling would help, but traditional file systems can’t do it.

Sampling in Qumulo Core

Qumulo Core maintains the invariant that every directory tracks the total number of blocks, files, and various other things that live anywhere below it. Those totals are called aggregates (with apologies to NetApp), and those aggregates are updated in an eventually-consistent but timely (think 15 seconds) way.

Those aggregates have some useful applications. For example, it’s trivially easy to find out how many blocks are consumed under /home/pete in the example above. We can see at a glance that /home/pete has 10 blocks in it.

Our problem is more complicated, however, because we want to find out how many files are owned by a specific user in a given area of a tree. As we discussed previously, we can use sampling to solve this problem.

It’s easy to see that in a sampling application, we can look at the root of this tree, and decide to sample /home .01% of the time (sum_blocks for /home divided by sum_blocks for / == .01%). We can apply this rule recursively at every level, and eventually sample a single block at random.

Qumulo Core makes this kind of sampling easy with a pre-baked API, /v1/files/:ref:/sample, which takes parameters by-value and limit, which control how to weight, and the number of samples to take, respectively. This API allows us to sample in a way that is weighted by sum_blocks (i.e. /tmp is sampled 99.99% of the time, vs. /home at 0.01%), which is exactly what we need.

What Next?

Sampling is great, but not enough on its own. It gives us a file list. For example:

File Owner /home/pete/test pete /tmp/bigfile root /tmp/bigfile root /tmp/bigfile pete /tmp/bigfile root /tmp/bigfile root

A first approximation gives the proportion of space owned by root as 83%, and by pete as 17%. As a next step, we want to break down samples by owners and then display a tree for each owner in turn.

Refining the Tree

The example above is too simplistic: it shows the same files being sampled repeatedly, because our data set is small. The real world isn’t generally as accommodating: in 100,000 samples of a hundreds-of-millions-of-files data set, we might get a multiple sampling of the same file only a few times. In other words, if we try to show data to users, it’s going to be all leaf and no tree:

Our goal wasn’t to tell users a bunch of anecdotes about files where they have used capacity. Instead it’s to tell users about area of the tree where they may have a lot of data.

To solve this problem, we observe that we can prune leaves in the tree, moving their sample counts up. A sample of a file in a directory is also a sample of the file’s parent directory. In other words, we can repeatedly make the following transformation:

Choose the smallest-sampled leaf with the tree Prune it out of the tree and move its samples to its parent

Visually, that transformation looks like this:

The basic algorithm we use to refine samples into something usable goes like this:

while leaves > max_leaves or lightest_leaf.samples < min_samples:

prune_lightest_leaf()

In other words, we keep removing low-weight leaves until our tree has been reduced to a smaller number of heavier-weight leaves.

Our goal is to limit the number of results so users can focus on the most relevant files and directories, and also to provide reasonable statistical significance for the amount of capacity displayed. To achieve the first goal, we allow the user to set a limit on the number of leaves. To achieve the second, we allow the user to specify the minimum number of samples allowed at a leaf.

If we set the desired number of leaves to be no more than 5, and the minimum leaf sampling to be 3, eventually we’ll end up with this:



Note that we end up with a smaller number of leaves (3) than the limit, in order to accomplish the goal of minimum leaf weight.

This does a pretty good job of pointing out to a user where space is consumed. We’d love to know if you know of a better way.

Computing confidence intervals

While recent elections have cast the legitimacy of sampling into doubt in the minds of many, file systems have some advantages over polls. First, blocks don’t lie about what file they’re a part of. Second, blocks don’t refuse to tell you what file they’re a part of. The combination of those two facts allows us to use confidence intervals to accurately describe the range of capacity consumed by a user in a given area of a file system tree. For this to work we observe that each node in the tree we’re displaying a weight for is based on a certain number of hits in a certain number of samples. If you’re math-challenged like me you may find this for dummies guide to confidence intervals helpful. The main point here is that you can’t get anything 100% with sampling, but you can get a range you’re confident in.

Putting it all together

We built a script that:

Takes samples weighted by block count Maps samples into file owners Projects the samples into trees, and refines those trees until they have constrained scope and significance Computes the confidence intervals for the returned values Dumps the results

Running this script against my home directory on our shared build server, I see I have a lot of Windows VM and not a lot else:

bash-3.2$ python capacity_by_user.py -A -Uadmin -s50000 -x10 -Cgravytrain.eng.qumulo.com -i /home/pete Total: [17.50G-17.50G] Owner NFS_UID:pete (~100.0%/[17.50G-17.50G]) \--- \--- \---home \---pete \---pete([17.50G-17.50G]) +---Windows 7 x64.vmwarevm([12.41G-12.54G]) | +---Windows 7 x64-s001.vmdk([1.86G-1.95G]) | +---Windows 7 x64-s002.vmdk([1.92G-2.01G]) | +---Windows 7 x64-s003.vmdk([1.98G-2.08G]) | +---Windows 7 x64-s004.vmdk([1.93G-2.02G]) | +---Windows 7 x64-s005.vmdk([1.95G-2.05G]) | +---Windows 7 x64-s006.vmdk([1.33G-1.41G]) | \---Windows 7 x64.vmem([1002.22M-1.05G]) +---qfs-657.vmwarevm([2.87G-2.99G]) | +---qkiosk-3143-disk1.vmdk([1.74G-1.84G]) | \---qkiosk-3143-f195d503.vmem([978.60M-1.03G]) \---src.old \---product([1.52G-1.61G]) \---bin([1.28G-1.36G]) \---user_linux64([549.10M-604.55M])

One more thing…

I did say that we were going to talk about shameback. No one is naturally ashamed of having a terabyte of data, right? On the other hand, would you be ashamed of costing your company $40/month so that you maintain a library of every windows release from 1991 to 2016? Maybe? So, we added an option (-D) to print out usage in terms of dollars per month.

An example output is below:

Owner NFS_UID:pete (~100.0%/[$0.75-$0.75]/month) \---home \---pete \---pete([$0.75-$0.75]/month) +---Windows 7 x64.vmwarevm([$0.53-$0.54]/month) | +---Windows 7 x64-s001.vmdk([$0.08-$0.08]/month) | +---Windows 7 x64-s002.vmdk([$0.08-$0.09]/month) | +---Windows 7 x64-s003.vmdk([$0.08-$0.09]/month) | +---Windows 7 x64-s004.vmdk([$0.08-$0.09]/month) | +---Windows 7 x64-s005.vmdk([$0.08-$0.09]/month) | +---Windows 7 x64-s006.vmdk([$0.06-$0.06]/month) | \---Windows 7 x64.vmem([$0.04-$0.04]/month) +---qfs-657.vmwarevm([$0.12-$0.13]/month) | +---qkiosk-3143-disk1.vmdk([$0.08-$0.08]/month) | \---qkiosk-3143-f195d503.vmem([$0.04-$0.04]/month) \---src.old \---product([$0.07-$0.07]/month) \---bin([$0.06-$0.06]/month) \---user_linux64([$0.02-$0.03]/month)

The good news is that my windows fetish is at most costing Qumulo 75 cents a month. And probably not even that, because we get our storage pretty much for free!

You can get your copy of capacity_by_user.py here!