Object-oriented design patterns in the kernel, part 1

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Despite the fact that the Linux Kernel is mostly written in C, it makes broad use of some techniques from the field of object-oriented programming. Developers wanting to use these object-oriented techniques receive little support or guidance from the language and so are left to fend for themselves. As is often the case, this is a double-edged sword. The developer has enough flexibility to do really cool things, and equally the flexibility to do really stupid things, and it isn't always clear at first glance which is which, or more accurately: where on the spectrum a particular approach sits.

Instead of looking to the language to provide guidance, a software engineer must look to established practice to find out what works well and what is best avoided. Interpreting established practice is not always as easy as one might like and the effort, once made, is worth preserving. To preserve that effort on your author's part, this article brings another installment in an occasional series on Linux Kernel Design Patterns and attempts to set out - with examples - the design patterns in the Linux Kernel which effect an object-oriented style of programming.

Rather than providing a brief introduction to the object-oriented style, tempting though that is, we will assume the reader has a basic knowledge of objects, classes, methods, inheritance, and similar terms. For those as yet unfamiliar with these, there are plenty of resources to be found elsewhere on the web.

Over two weeks we will look for patterns in just two areas: method dispatch and data inheritance. Despite their apparent simplicity they lead to some rich veins for investigation. This first article will focus on method dispatch.

Method Dispatch

The large variety of styles of inheritance and rules for its usage in languages today seems to suggest that there is no uniform understanding of what "object-oriented" really means. The term is a bit like "love": everyone thinks they know what it means but when you get down to details people can find they have very different ideas. While what it means to be "oriented" might not be clear, what we mean by an "object" does seem to be uniformly agreed upon. It is simply an abstraction comprising both state and behavior. An object is like a record (Pascal) or struct (C), except that some of the names of members refer to functions which act on the other fields in the object. These function members are sometimes referred to a "methods".

The most obvious way to implement objects in C is to declare a "struct" where some fields are pointers to functions which take a pointer to the struct itself as their first argument. The calling convention for method "foo" in object "bar" would simply be: bar->foo(bar, ...args); While this pattern is used in the Linux kernel it is not the dominant pattern so we will leave discussion of it until a little later.

As methods (unlike state) are not normally changed on a per-object basis, a more common and only slightly less obvious approach is to collect all the methods for a particular class of objects into a separate structure, sometimes known as a "virtual function table" or vtable. The object then has a single pointer to this table rather than a separate pointer for each method, and consequently uses less memory.

This then leads to our first pattern - a pure vtable being a structure which contains only function pointers where the first argument of each is a pointer to some other structure (the object type) which itself contains a pointer to this vtable. Some simple examples of this in the Linux kernel are the file_lock_operations structure which contains two function pointers each of which take a pointer to a struct file_lock , and the seq_operations vtable which contains four function pointers which each operate on a struct seq_file . These two examples display an obvious naming pattern - the structure holding a vtable is named for the structure holding the object (possibly abbreviated) followed by "_operations". While this pattern is common it is by no means universal. Around the time of 2.6.39 there are approximately 30 "*_operations" structures along with well over 100 "*_ops" structures, most if not all of which are vtables of some sort. There are also several structs such as struct mdk_personality which are essentially vtables but do not have particularly helpful names.

Among these nearly 200 vtable structures there is plenty of variability and so plenty of scope to look for interesting patterns. In particular we can look for common variations from the "pure vtable" pattern described above and determine how these variations contribute to our understanding of object use in Linux.

NULL function pointers

The first observation is that some function pointers in some vtables are allowed to be NULL. Clearly trying to call such a function would be futile, so the code that calls into these methods generally contains an explicit test for the pointer being NULL. There are a few different reasons for these NULL pointers. Probably easiest to justify is the incremental development reason. Because of the way vtable structures are initialized, adding a new function pointer to the structure definition causes all existing table declarations to initialise that pointer to NULL. Thus it is possible to add a caller of the new method before any instance supports that method, and have it check for NULL and perform a default behavior. Then as incremental development continues those vtable instances which need it can get non-default methods.

A recent example is commit 77af1b2641faf4 adding set_voltage_time_sel() to struct regulator_ops which acts on struct regulator_dev . Subsequent commit 42ab616afe8844 defines that method for a particular device. This is simply the most recent example of a very common theme.

Another common reason is that certain methods are not particularly meaningful in certain cases so the calling code simply tests for NULL and returns an appropriate error when found. There are multiple examples of this in the virtual filesystem (VFS) layer. For instance, the create() function in inode_operations is only meaningful if the inode in question is a directory. So inode_operations structures for non-directories typically have NULL for the create() function (and many others) and the calling code in vfs_create() checks for NULL and returns -EACCES .

A final reason that vtables sometimes contain NULL is that an element of functionality might be being transitioned from one interface to another. A good example of this is the ioctl() operation in file_operations . In 2.6.11, a new method, unlocked_ioctl() was added which was called without the big kernel lock held. In 2.6.36, when all drivers and filesystems had been converted to use unlocked_ioctl() , the original ioctl() was finally removed. During this transition a file system would typically define only one of two, leaving the other defaulting to NULL.

A slightly more subtle example of this is read() and aio_read() , also in file_operations , and the corresponding write() and aio_write() . aio_read() was introduced to support asynchronous IO, and if it is provided the regular synchronous read() is not needed (it is effected using do_sync_read() which calls the aio_read() method). In this case there appears to be no intention of ever removing read() - it will remain for cases where async IO is not relevant such as special filesystems like procfs and sysfs. So it is still the case that only one of each pair need be defined by a filesystem, but it is not simply a transition, it is a long-term state.

Though there seem to be several different reasons for a NULL function pointer, almost every case is an example of one simple pattern - that of providing a default implementation for the method. In the "incremental development" examples and the non-meaningful method case, this is fairly straightforward. e.g. the default for inode->create() is simply to return an error. In the interface transition case it is only slightly less obvious. The default for unlocked_ioctl() would be to take the kernel lock and then call the ioctl() method. The default for read() is exactly do_sync_read() and some filesystems such as ext3 actually provide this value explicitly rather than using "NULL" to indicate a default.

With that in mind, a little reflection suggests that if the real goal is to provide a default, then maybe the best approach would be to explicitly give a default rather than using the circuitous route of using a default of NULL and interpreting it specially.

While NULL is certainly the easiest value to provide as a default - as the C standard assures us that uninitialized members of a structure do get set to NULL - it is not very much harder to set a more meaningful default. I am indebted to LWN reader wahern for the observation that C99 allows fields in a structure to be initialized multiple times with only the final value taking effect and that this allows easy setting of default values such as by following the simple model:

#define FOO_DEFAULTS .bar = default_bar, .baz = default_baz struct foo_operations my_foo = { FOO_DEFAULTS, .bar = my_bar, };

This will declare my_foo with a predefined default value for baz and a localized value for bar . Thus for the small cost of defining a few "default" functions and including a " _DEFAULTS " entry to each declaration, the default value for any field can easily be chosen when the field is first created, and automatically included in every use of the structure.

Not only are meaningful defaults easy to implement, they can lead to a more efficient implementation. In those cases where the function pointer actually is NULL it is probably faster to test and branch rather than to make an indirect function call. However the NULL case is very often the exception rather than the rule, and optimizing for an exception is not normal practice. In the more common case when the function pointer is not NULL, the test for NULL is simply a waste of code space and a waste of execution time. If we disallow NULLs we can make all call sites a little bit smaller and simpler.

In general, any testing performed by the caller before calling a method can be seen as an instance of the "mid-layer mistake" discussed in a previous article. It shows that the mid-layer is making assumptions about the behavior of the lower level driver rather than simply giving the driver freedom to behave in whatever way is most suitable. This may not always be an expensive mistake, but it is still best avoided where possible. Nevertheless there is a clear pattern in the Linux kernel that pointers in vtables can sometimes be NULLable, typically though not always to enable a transition, and the call sites should in these cases test for NULL before proceeding with the call.

The observant reader will have noticed a hole in the above logic denouncing the use NULL pointers for defaults. In the case where the default is the common case and where performance is paramount, the reasoning does not hold and a NULL pointer could well be justified. Naturally the Linux kernel provides an example of such a case for our examination.

One of the data structures used by the VFS for caching filesystem information is the "dentry". A "dentry" represents a name in the filesystem, and so each "dentry" has a parent, being the directory containing it, and an "inode" representing the named file. The dentry is separate from the inode because a single file can have multiple names (so an "inode" can have multiple "dentry"s). There is a dentry_operations vtable with a number of operations including, for example, "d_compare" which will compare two names and "d_hash" which will generate a hash for the name to guide the storage of the "dentry" in a hash table. Most filesystems do not need this flexibility. They treat names as uninterpreted strings of bytes so the default compare and hash functions are the common case. A few filesystems define these to handle case-insensitive names but that is not the norm.

Further, filename lookup is a common operation in Linux and so optimizing it is a priority. Thus these two operations appear to be good candidates where a test for NULL and an inlined default operation might be appropriate. What we find though is that when such an optimization is warranted it is not by itself enough. The code that calls d_compare() and d_hash() (and a couple of other dentry operations) does not test these functions for NULL directly. Rather they require that a few flag bits (DCACHE_OP_HASH, DCACHE_OP_COMPARE) in the "dentry" are set up to indicate whether the common default should be used, or whether the function should be called. As the flag field is likely to be in cache anyway, and the dentry_operations structure will often be not needed at all, this avoids a memory fetch in a hot path.

So we find that the one case where using a NULL function pointer to indicate a default could be justified, it is not actually used; instead, a different, more efficient, mechanism is used to indicate that the default method is requested.

Members other than function pointers

While most vtable-like structures in the kernel contain exclusively function pointers, there are a significant minority that have non-function-pointer fields. Many of these appear on the surface quite arbitrary and a few closer inspections suggest that some of them result of poor design or bit-rot and their removal would only improve the code.

There is one exception to the "functions only" pattern that occurs repeatedly and provides real value, and so is worth exploring. This pattern is seen in its most general form in struct mdk_personality which provides operations for a particular software RAID level. In particular this structure contains an "owner", a "name", and a "list". The "owner" is the module that provides the implementation. The "name" is a simple identifier: some vtables have string names, some have numeric names, and it is often called something different like "version", "family", "drvname", or "level". But conceptually it is still a name. In the present example there are two names, a string and a numeric "level".

The "list", while part of the same functionality, is less common. The mdk_personality structure has a struct list_head , as does struct ts_ops . struct file_system_type has a simple pointer to the next struct file_system_type . The underlying idea here is that for any particular implementation of an interface (or "final" definition of a class) to be usable, it must be registered in some way so that it can be found. Further, once it has been found it must be possible to ensure that the module holding the implementation is not removed while it is in use.

There seem to be nearly as many styles of registration against an interface in Linux as there are interfaces to register against, so finding strong patterns there would be a difficult task. However it is fairly common for a "vtable" to be treated as the primary handle on a particular implementation of an interface and to have an "owner" pointer which can be used to get a reference on the module which provides the implementation.

So the pattern we find here is that a structure of function pointers used as a "vtable" for object method dispatch should normally contain only function pointers. Exceptions require clear justification. A common exception allows a module pointer and possible other fields such as a name and a list pointer. These fields are used to support the registration protocol for the particular interface. When there is no list pointer it is very likely that the entire vtable will be treated as read-only. In this case the vtable will often be declared as a const structure and so could even be stored in read-only memory.

Combining Methods for different objects

A final common deviation from the "pure vtable" pattern that we see in the Linux kernel occurs when the first argument to the function is not always the same object type. In a pure vtable which is referenced by a pointer in a particular data structure, the first argument of each function is exactly that data structure. What reason could there be for deviating from that pattern? It turns out that there are few, some more interesting than others.

The simplest and least interesting explanation is that, for no apparent reason, the target data structure is listed elsewhere in the argument list. For example all functions in struct fb_ops take a struct fb_info . While in 18 cases that structure is the first argument, in five cases it is the last. There is nothing obviously wrong with this choice and it is unlikely to confuse developers. It is only a problem for data miners like your author who need to filter it out as an irrelevant pattern.

A slight deviation on this pattern is seen in struct rfkill_ops where two functions take a struct rkfill but the third - set_block() - takes a void *data . Further investigation shows that this opaque data is exactly that which is stored in rfkill->data , so set_block() could easily be defined to take a struct rfkill and simply to follow the ->data link itself. This deviation is sufficiently non-obvious that it could conceivably confuse developers as well as data miners and so should be avoided.

The next deviation in seen for example in platform_suspend_ops , oprofile_operations , security_operations and a few others. These take an odd assortment of arguments with no obvious pattern. However these are really very different sorts of vtable structures in that the object they belong to are singletons. There is only one active platform, only one profiler, only one security policy. Thus the "object" on which these operations act is part of the global state and so does not need to be included in the arguments of any functions.

Having filtered these two patterns out as not being very interesting we are left with two that do serve to tell us something about object use in the kernel.

quota_format_ops and export_operations are two different operations structures that operate on a variety of different data structures. In each case the apparent primary object (e.g. a struct super_block or a struct dentry ) already has a vtable structure dedicated to it (such as super_operations or dentry_operations ) and these new structures add new operations. In each case the new operations form a cohesive unit providing a related set of functionality - whether supporting disk quotas or NFS export. They don't all act on the same object simply because the functionality in question depends on a variety of objects.

The best term from the language of object-oriented programming for this is probably the "mixin". Though the fit may not be perfect - depending on what your exact understanding of mixin is - the idea of bringing in a collection of functionality without using strict hierarchical inheritance is very close to the purpose of quota_format_ops and export_operations .

Once we know to be on the lookout for mixins like these we can find quite a few more examples. The pattern to be alert for is not the one that led us here - an operations structure that operates on a variety of different objects - but rather the one we found where the functions in an "operations" structure operate on objects that already have their own "operations" structure. When an object has a large number of operations that are relevant and these operations naturally group into subsets, it makes a lot of sense to divide them into separate vtable-like structures. There are several examples of this in the networking code where for instance both tcp_congestion_ops and inet_connection_sock_af_ops operate (primarily) on a struct sock , which itself has already got a small set of dedicated operations.

So the pattern of a "mixin" - at least as defined as a set of operations which apply to one or more objects without being the primary operations for those objects - is a pattern that is often found in the kernel and appears to be quite valuable in allowing better modularization of code.

The last pattern which explains non-uniform function targets is probably the most interesting, particularly in its contrast to the obvious application of object-oriented programming style. Examples of this pattern abound with ata_port_operations , tty_operations , nfs_rpc_ops and atmdev_ops all appearing as useful examples. However we will focus primarily on some examples from the filesystem layer, particularly super_operations and inode_operations .

There is a strong hierarchy of objects in the implementation of a filesystem where the filesystem - represented by a "super_block" - has a number of files ( struct inode ) which may have a number of names or links ( struct dentry ). Further each file might store data in the page cache ( struct address_space ) which comprises a number of individual pages ( struct page ). There is a sense in which all of these different objects belong to the filesystem as a whole. If a page needs to be loaded with data from a file, the filesystem knows how to do that, and it is probably the same mechanism for every page in every file. Where it isn't always the same, the filesystem knows that too. So we could conceivably store every operation on every one of these objects in the struct super_block , as it represents the filesystem and could know what to do in each case.

In practice that extreme is not really helpful. It is quite likely that while there are similarities between the storage of a regular file and a directory, there are also important differences and being able to encode those differences in separate vtables can be helpful. Sometimes small symbolic links are stored directly in the inode while larger links are stored like the contents of a regular file. Having different readlink() operations for the two cases can make the code a lot more readable.

While the extreme of every operation attached to the one central structure is not ideal, it is equally true that the opposite extreme is not ideal either. The struct page in Linux does not have a vtable pointer at all - in part because we want to keep the structure as small as possible because it is so populous. Rather the address_space_operations structure contains the operations that act on a page. Similarly the super_operations structure contains some operations that apply to inodes, and inode_operations contains some operations that apply to dentries.

It is clearly possible to have operations structures attached to a parent of the target object - providing the target holds a reference to the parent, which it normally does - though it is not quite so clear that it is always beneficial. In the case of struct page which avoids having a vtable pointer altogether the benefit is clear. In the case of struct inode which has its own vtable pointer, the benefit of having some operations (such as destroy_inode() or write_inode() ) attached to the super_block is less clear.

As there are several vtable structures where any given function pointer could be stored, the actual choice is in many cases little more than historical accident. Certainly the proliferation of struct dentry operations in inode_operations seems to be largely due to the fact that some of them used to act directly on the inode, but changes in the VFS eventually required this to change. For example in 2.1.78-pre1, each of link() , readlink() , followlink() (and some others which are now defunct) were changed from taking a struct inode to take a struct dentry instead. This set the scene for "dentry" operations to be in inode_operations , so when setattr and getattr were added for 2.3.48, it probably seemed completely natural to include them in inode_operations despite the fact that they acted primarily on a dentry.

Possibly we could simplify things by getting rid of dentry_operations altogether. Some operations that act on dentries are already in inode_operations and super_operations - why not move them all there? While dentries are not as populous as struct page there are still a lot of them and removing the "d_op" field could save 5% of the memory used by that structure (on x86-64).

With two exceptions, every active filesystem only has a single dentry operations structure in effect. Some filesystem implementations like "vfat" define two - e.g. one with case-sensitive matching and one with case-insensitive matching - but there is only one active per super-block. So it would seem that the operations in dentry_operations could be moved to super_operations , or at least accessed through "s_d_op". The two exceptions are ceph and procfs. These filesystems use different d_revalidate() operations in different parts of the filesystem and - in the case of procfs - different d_release() operations. The necessary distinctions could easily be made in per-superblock versions of these operations. Do these cases justify the 5% space cost? Arguably not.

Directly embedded function pointers

Finally it is appropriate to reflect on the alternate pattern mentioned at the start, where function pointers are stored directly in the object rather than in a separate vtable structure. This pattern can be seen in struct request_queue which has nine function pointers, struct efi which has ten function pointers, and struct sock which has six function pointers.

The cost of embedded pointers is obviously space. When vtables are used, there is only one copy of the vtable and multiple copies of an object (in most cases) so if more than one function pointer is needed, a vtable would save space. The cost of a vtable is an extra memory reference, though cache might reduce much of this cost in some cases. A vtable also has a cost of flexibility. When each object needs exactly the same set of operations a vtable is good, but if there is a need to individually tailor some of the operations for each object, then embedded function pointer can provide that flexibility. This is illustrated quite nicely by the comment with "zoom_video" in struct pcmcia_socket

/* Zoom video behaviour is so chip specific its not worth adding this to _ops */

So where objects are not very populous, where the list of function pointers is small, and where multiple mixins are needed, embedded function pointers are used instead of a separate vtable.

Method Dispatch Summary

If we combine all the pattern elements that we have found in Linux we find that:

Method pointers that operate on a particular type of object are normally collected in a vtable associated directly with that object, though they can also appear: In a mixin vtable that collects related functionality which may be selectable independently of the base type of the object.

In the vtable for a "parent" object when doing so avoids the need for a vtable pointer in a populous object

Directly in the object when there are few method pointers, or they need to be individually tailored to the particular object. These vtables rarely contain anything other than function pointers, though fields needed to register the object class can be appropriate. Allowing these function pointers to be NULL is a common but not necessarily ideal technique for handling defaults.

So in exploring the Linux Kernel code we have found that even though it is not written in an object-oriented language, it certainly contains objects, classes (represented as vtables), and even mixins. It also contains concepts not normally found in object-oriented languages such as delegating object methods to a "parent" object.

Hopefully understanding these different patterns and the reasons for choosing between them can lead to more uniform application of the patterns across the kernel, and hence make it easier for a newcomer to understand which pattern is being followed. In the second part of our examination of object oriented patterns we will explore the various ways that data inheritance is achieved in the Linux kernel and discuss the strengths and weaknesses of each approach so as to see where each is most appropriate.