Introduction

I think it's a given that the amount of data is increasing at a fairly fast rate. We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we're starting to put lots of data into the cloud (e.g. Facebook). One question that affects storage design and performance is if these files are large or small and how many of them are there?

At this year's FAST (USENIX Conference on File System and Storage Technologies) the best paper went to "A Study of Practical Deduplication" by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia. While the paper didn't really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. Some of the highlights from the paper are:

The median file size isn't changing The average file size is larger The average file system capacity has tripled from 2000 to 2010

To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file size is computed by summing the size of every file and dividing by the number of files. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The median file size is the one in the middle of the ordered list. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files.

Yes, it's Windows. Yes, it's desktops. But these observations are another good data point that tell us something about our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data.

Data Management - Who's on First?

One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. Fortunately, POSIX gives us some standard metadata for our files such as the following:

File ownership (User ID and Group ID)

File permissions (world, group, user)

File times (atime, ctime, mtime)

File size

File name

Is it a true file or a directory?



There are several others (e.g. links) which I didn't mention here.

With this information we can monitor the basic state of our data. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). We can also determine how our data is "aging" - that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. In essence we can get a good statistical overview of the "state of our data".

All of this capability is just great and goes far beyond anything that is available today. However, with the file system capacity increasing so rapidly and the median file size staying about the same, we have a lot more files to monitor. Plus we keep data around for longer than we ever have. Perhaps over time it is easy to forget what a file name means or what is contained in a cryptic file name. Since POSIX is good enough to give some basic metadata wouldn't it be nice to have the ability to add our own metadata? Something that we control that would allow is to add information about the data?

Extended File Attributes

What many people don't realize is that there actually is a mechanism for adding your own metadata to files that is supported by most Linux file systems. This is called Extended File Attributes. In Linux, many file systems support it such as the following: ext2, ext3, ext4, jfs, xfs, reiserfs, btrfs, ocfs2 (2.1 and greater), and squashfs (kernel 2.6.35 and greater or a backport to an older kernel). Some of the file systems have restrictions on extended file attributes, such as the amount of data that can be added, but they do allow for the addition of user controlled metadata.

Any regular file that uses one of the previously mentioned extended file attributes may have a list of extended file attributes. The attributes have a name and some associated data (the actual attribute). The name starts with what is called a namespace identifier (more on that later), followed by a dot ".", and then followed by a null-terminated string. You can add as many names separated by dots as you like to create "classes" of attributes.

Currently on Linux there are four namespaces for extended file attributes:

user trusted security system



This article will focus on the "user" namespace since it has no restrictions with regard to naming or contents. However, the "system" namespace could be used for adding metadata controlled by root.

The system namespace is used primarily by the kernel for access control lists (ACLs) and can only be set by root. For example, it will use names such as "system.posix_acl_access" and "system.posix_acl_default" for extended file attributes. The general wisdom is that unless you are using ACLs to store additional metadata, which you can do, you should not use the system namespace. However, I believe that the system namespace is a place for metadata controlled by root or metadata that is immutable with respect to the users.

The security namespace is used by SELinux. An example of a name in this namespace would be something such as "security.selinux".

The user attributes are meant to be used by the user and any application run by the user. The user namespace attributes are protected by the normal Unix user permission settings on the file. If you have write permission on the file then you can set an extended attribute. To give you an idea of what you can do for "names" for the extended file attributes for this namespace, here are some examples:

user.checksum.md5

user.checksum.sha1

user.checksum.sha256

user.original_author

user.application

user.project

user.comment



The first three example names are used for storing checksums about the file using three different checksum methods. The fourth example lists the originating author which can be useful in case multiple people have write access to the file or the original author leaves and the file is assigned to another user. The fifth name example can list the application that was used to generate the data such as output from an application. The sixth example lists the project that the data with which the data is associated. And the seventh example is the all-purpose general comment. From these few examples, you see that you can create some very useful metadata.

Tools for Extended File Attributes

There are several very useful tools for manipulating (setting, getting) extended attributes. These are usually included in the attr package that comes with most distributions. So be sure that this package is installed on the system.

The second thing you should check is that the kernel has attribute support. This should be turned on for almost every distribution that you might use, although there may be some very specialized ones that might not have it turned on. But if you build your own kernels (as yours truly does), be sure it is turned on. You can just grep the kernel's ".config" file for any "ATTR" attributes.

The third thing is to make sure that the libattr package is installed. If you installed the attrpackage then this package should have been installed as well. But I like to be thorough and check that it was installed.

Then finally, you need to make sure the file system you are going to use with extended attributes is mounted with the user_xattr option.

Assuming that you have satisfied all of these criteria (they aren't too hard), you can now use extended attributes! Let's do some testing to show the tools and what we can do with them. Let's begin by creating a simple file that has some dummy data in it.



$ echo "The quick brown fox" > ./test.txt $ more test.txt The quick brown fox

Now let's add some extended attributes to this file.



$ setfattr -n user.comment -v "this is a comment" test.txt



This command sets the extended file attribute to the name "user.comment". The option "-v" is the value of the attribute followed by that value. The final option for the command is the name of the file.

You can determine the extended attributes on a file with a simple command, getfattr as in the following example,



$ getfattr test.txt # file: test.txt user.comment



Notice that this only lists what extended attributes are defined for a particular file not the values of the attributes. Also notice that it only listed the "user" attributes since the command was done as a regular user. If you ran the command as root and there were system or security attributes assigned you would see those listed.

To see the values of the attributes you have to use the following command:



$ getfattr -n user.comment test.txt # file: test.txt user.comment="this is a comment"



With the "-n" option it will list the value of the extended attribute name that you specify.

If you want to remove an extended attribute you use the setfattr command but use the "-x" option such as the following:



$ setfattr -x user.comment test.txt $ getfattr -n user.comment test.txt test.txt: user.comment: No such attribute



You can tell that the extended attribute no longer exists because of the return from the setfattr command.

Summary

Without belaboring the point, the amount of data is growing at a very rapid rate even on our desktops. A recent study also pointed out that the number of files is also growing rapidly and that we are adding some very large files but also a large number of small files so that the average file size is growing while the median file size is pretty much staying the same. All of this data will result in a huge data management nightmare that we need to be ready to address.

One way to help address the deluge of data is to enable a rich set of metadata that we can use in our data management plan (whatever that is). An easy way to do this is to use extended file attributes. Most of the popular Linux file systems allow you to add to metadata to files, and in the case of xfs, you can pretty much add as much metadata as you want to the file.

There are four "namespaces" of extended file attributes that we can access. The one we are interested as users is the user namespace because if you have normal write permissions on the file, you can add attributes. If you have read permission on the file you can also read the attributes. But we could use the system namespace as administrators (just be careful) for attributes that we want to assign as root (i.e. users can't change or query the attributes).

The tools to set and get extended file attributes come with virtually every Linux distribution. You just need to be sure they are installed with your distribution. Then you can set, retrieve, or erase as many extended file attributes as you wish.

Extended file attributes can be used to great effect to add metadata to files. It is really up to the user to do this since they understand the data and have the ability to add/change attributes. Extended attributes give a huge amount of flexibility to the user and creating simple scripts to query or search the metadata is fairly easy (an exercise left to the user). We can even create extended attributes as root so that the user can't change or see them. This allows administrators to add really meaningful attributes for monitoring the state of the data on the file system. Extended file attributes rock!