Meet the Lockers

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

Our story begins with an observation made by Jon Corbet in a recent article about delays in the release of Linux-3.0:

[I]f we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.

While undeniably true, it is not clear that this note of gloom is entirely justified by the immediate circumstances - after all the bug was found and fixed to everyone's satisfaction in a little over 24 hours. However it is clear that it wasn't just this event that triggered the observation. Earlier we read:

Our once approachable and hackable kernel has, over time, become more complex and difficult to understand.

What was once "approachable" is now "difficult to understand". If these words were used to describe a life-partner, we might be looking to choose between a divorce settlement or counseling, though often it is a healthy dose of understanding that is really needed.

The human touch

The allusion to human relationships is not as far fetched as it might seem. In his excellent book "The Maths Gene", Keith Devlin writes about Wim Klein, "a famous calculating wizard who in the days before electronic computers once held a professional position with the title 'computer'".

Apparently Klein was known to say: "Numbers are friends to me." and referring to 3844 he says:

For you it's just a three and an eight and a four and a four, but I say, "Hi 62 squared!"

The point Devlin is leading to is summed up thus:

To put it simply, mathematicians think about mathematical objects and the mathematical relationships between them using the same mental facilities that the majority of people use to think about other people.

The idea is that we can treat abstract ideas more like friends than tools and understand them holistically instead of just functionally. If we get to know their preferences and peculiarities we can side step the fact that they are sometimes difficult to understand and once again find them approachable.

One more quote from that recent LWN article will start us on the path to understanding what it means to be friends with abstractions. In a parenthetical comment concerning an assignment of NULL to d_inode , Jon writes:

(It's worth noting that this behavior goes against normal RCU practice, which calls for the structure to be preserved unmodified until the last reference is known to be gone)

This is just the sort of thing that suggests a holistic "friendly" relationship rather than a more distant formal one. Something just doesn't seem right, "He doesn't usually behave this way". Being able to make that sort of observation helps you focus in and ask incisive questions. You should then either be able to help your friend, or else understand them at a deeper level. We'll come back to this particular observation later, but first I want to demonstrate the effectiveness of this sort of friendship in a context that all experienced C programmers should find very familiar.

" For ": he's a jolly good fellow

A simple and common construct in C programs is the " for " loop. It is extremely flexible and can be used like a " while " loop, can iterate over a set of numbers, can walk a linked list, and can perform many more tasks. A common form that you might expect to see quite often is:

for (i = 0; i < count; i++)

This will execute the body of the loop " count " times assigning " i " to each value from " 0 " to " count - 1 " in turn. There are similar constructs that are almost as common and familiar such as

for (i = 0; i <= final; i++)

or

for (i = 1; i <= count; i++)

which are really just minor variations on the theme. But if you see

for (i = 0; i <= count; i++)

a seasoned C programmer, familiar with the ways of the " for ", will immediately think that something is odd. Not necessarily wrong, but certainly odd. This might be an error, but it could also be that " count " has been misnamed and is really a " final " rather than a " count " sort of value (an ordinal rather than a cardinal). Or maybe the loop really has a good reason for executing " count + 1 " times. At this point the friendship has served a valuable purpose by identifying a possible issue, but must give way to a more detailed analysis to determine if there really is a problem.

We can make this very concrete by examining the Linux kernel source code (v3.0) which is often a good source of interesting code fragments. A search for the above pattern can be achieved with:

git grep 'for.*(.*= *0.*;.*<=.*c[ou]*nt.*;.*)'

which results in:

arch/ia64/kernel/cpufreq/acpi-cpufreq.c: for (i = 0; i <= data->acpi_data.state_count; i++) drivers/gpu/drm/radeon/evergreen_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) { drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= pkt->count; i++, idx++) { drivers/gpu/drm/radeon/r100.c: for (i = 0; i <= (pkt->count + 1); i++, idx++) { drivers/gpu/drm/radeon/r100.c: for (j = 0; j <= count; j++) { drivers/gpu/drm/radeon/r600.c: for (j = 0; j <= count; j++) { drivers/gpu/drm/radeon/r600_cs.c: for (i = 0; i <= pkt->count; i++, idx++, reg += 4) { drivers/i2c/algos/i2c-algo-pcf.c: for (i = 0; i <= count; i++) { drivers/staging/rts_pstor/rtsx_transport.c: for (i = 0; i <= buf_cnt / (HOST_SG_TBL_BUF_LEN / 8); i++) { fs/ufs/balloc.c: for (pos = 0; pos <= uspi->s_fpb - count; pos++) { scripts/docproc.c: for (i=0; i <= symfilecnt; i++)

A closer examination of each of these is left as an exercise for the reader. The chance of finding a genuine bug is very high, though it is likely that the bugs are not very serious or they would have already been fixed.

The success in finding a bug even when we weren't really looking for one is evidence of the value of building strong relationships with important abstractions as we can learn to recognize early warning signs. Unfortunately the building of strong relationships can be time-consuming and so is often neglected, being left to those with a passion rather than those who simply have a task to do.

In human relationships a friendship can blossom more quickly if a mutual friend acts to introduce two parties and start them out on a sound footing. The same can work for abstractions. While much has been written about RCU ("Read-Copy-Update" which - from the name "rcu-walk" seems to be the center-point of the aforementioned bug), it has largely been of a formal style. The documentation in linux/Documentation/RCU/ and on Wikipedia is, not unlike a résumé, suitable for an introduction to a potential employer, but not so much as an introduction to a future friend. What follows is an attempt to present RCU together with other related technologies in a somewhat different style aimed at building friendships - what you might call the human-interest perspective.

A family affair

While it is wrong to judge someone based on their parents, knowing their family background can certainly lead to a deeper understanding. So we will start by introducing RCU's parents who are, for the purpose of this tale, Sir Spinlock and Dame Refcount - both recently honored for services to computer engineering.

Sir Spinlock is somewhat "old school" and a stickler for doing everything "just right". When he protects a data structure he does it "properly" (as he says) and ensures everything is consistent, stable and completely up-to-date. His wife, Dame Refcount, feels that her husband takes a rather short-term view of things and, like many mothers, just wants to be sure that her charges are safe and can be found when needed (comments about apron strings are generally not appreciated).

The gentleman and his lady had another child much earlier (too early some say) who was named "Rwlock". When write access is needed, he is much like his father becoming a short term exclusive lock. When read access is wanted he is more like his mother, keeping count of all the different readers. However on the whole he seems to have received the worst of both worlds, lacking the long term connections of his mother, or the simple fairness available to his father. Master Rwlock doesn't get out much these days.

In some ways Rwlock is of the same generation as his parents as none of them really seem to understand the problem of cache line bouncing. Sir Spinlock likes to say that "Cache lines belong in banks" and seems to think he is being awfully funny, but it just makes the younger generation feel awkward, smile weakly, and try to change the subject.

That younger generation are the twins - RCU and his sister Seqlock. They both grew up in the shadow of Sir Spinlock's dominance of multiprocessing and each reacted in their own way. They have both learned the importance of stability but were culturally more focused on the appearance than the reality, and technologically more inventive in how to provide it.

RCU explains that he agrees with his dad's mantra of "Always consistent, always stable" but finds that the "always up-to-date" idea is over-rated. He himself finds that he is usually running late for something or other and doesn't see that it has hurt him all that much. RCU is a big fan of the photo-copiers and realized early that the best way to preserve the appearance of stability was not to change the original at all but to make a copy and work on that. So while RCU always provides stable data, there might be something newer that he is working on that he won't let you see just yet. RCU's real skill is understanding use-by dates. What you get might not be perfectly fresh, but it will never become stale.

His sister, Seqlock, takes a very different approach to the same problem learning from her favorite piece of technology - the digital camera. Like many of us she just takes lots and lots of photos and picks the ones she likes. "All of them are stable", she says, "and it isn't too hard to tell when one isn't consistent - just throw those away". You might not think it is so easy, but then this is her special skill - seeing the blur in the photo and taking another. She is happy to help and tell you when a new photo is needed.

Another notable contrast between the twins is reflected in their mother's pet names for them: "replace" and "in-place" (she sometimes calls them "The place twins" which they find very irritating). RCU insists on updating things by replacing them so he is happiest with relatively small objects without too much linkage which would need to be re-threaded. He "doesn't work on books", he says, "just collections of pages, so don't try looking at two pages at once!". When he does have these small objects he can replace them with lightning speed, so others hardly even notice anything has changed.

Seqlock expects people to take a photo if they are interested so she just updates things when ever she likes, where ever they are - just being sure to add enough "blur" so that if she is caught "in the act" another photo will be taken. This means she can handle much bigger containers than her brother, and containers with lots of connections. Even a wide-angle lens can notice the blur and take a second photo.

A chip off the old lock

When we look more closely at RCU to see how he achieves his flashy performance we can see echoes of both his parents at work. When he was younger he found this mildly embarrassing but since they were both knighted he has found a new respect for his heritage.

From his father, RCU appreciates the value of a quick change when no-one else can see and has refined this down to a single pointer update which modern hardware guarantees to be atomic. When he is really trying to impress he uses the " xchg() " atomic operation to replace an old structure with a new structure without even needing any locking for an update (though there are some costs to this and it is mostly just a party trick - but see net/ipv4/cipso_ipv4.c:cipso_v4_req_setattr()).

This observation about the centrality of the pointer is crucial to understanding RCU's sleight-of-hand. It is easy to get distracted by the " rcu_read_lock() / rcu_read_unlock() " calls and think that they provide some sort of atomicity, but they don't. Rather they come from his mother's side of the family as we will see shortly. As the atomicity is constrained to a single pointer, two separate pointer accesses - even in the one rcu_read_lock() ed section - will not be coherent. This means it is very important not to dereference the same pointer twice (you might get two different values) and if you need to read two separate pointers, be very careful about assuming any relationship between the two values - there might be one but you need to understand the rest of the code to be sure. The most common case of dereferencing multiple pointers in the one rcu_read_lock() ed region is when following a linked list. Each pointer found leads directly to the next. In this case you can be sure that each consecutive pair will be ordered if there is ordering in the list, but you cannot be sure you will see all entries that are in the list when you started, or that you have seen all the entries that are in the list when you finish.

Now that we know that friends don't let RCU-friends assume coherence outside of a single protected structure, it should be easy to see that something looks wrong in kernel/cgroup.c:cgroup_attach_proc(). The relevant code snippet - very well commented which helps a lot - is:

rcu_read_lock(); if (!thread_group_leader(leader)) { /* * a race with de_thread from another thread's exec() may strip * us of our leadership, making while_each_thread unsafe to use * on this task. if this happens, there is no choice but to * throw this task away and try again (from cgroup_procs_write); * this is "double-double-toil-and-trouble-check locking". */ rcu_read_unlock(); retval = -EAGAIN; goto out_free_group_list; }

where:

#define thread_group_leader(p) (p == p->group_leader)

is defined elsewhere.

This code is assuming that rcu_read_lock() will provide coherence between this access of " leader->group_leader " here and other accesses to something in the "leader" structure later on in " while_each_thread() ". While this is not totally impossible, as a well placed " synchronize_rcu() " between updates could lead to some coherence, it is not the usual way RCU is used and questions need to be asked. A discussion with the author of this code suggests that what is really needed is a read_lock() on " tasklist_lock ".

So just like with the case of the " for " loop the fact that something "doesn't look right" isn't a guarantee that it is wrong but it is certainly an indication that the code is worth checking.

From his mother, RCU learned the value of counting to keep track of things. While his mum has a counter in each individual object, RCU tracks the whole system state (or at least that part entrusted to him) with a single counter (though admittedly it is spread across all CPUs). The details of getting this just right are are rather tricky, but as mentioned earlier, that is his real skill.

Sometimes he thinks this one skill is over-rated and finds a real kinship with the artists and musicians he knows who can only find work in copying the ancient masters or teaching unwilling children to play. In RCU's case the skill that most people pay for is not the "quick change, always stable" trick that he is so proud of, but the "never stale" promise he has to make to achieve this. Among the several hundred times that RCU is employed by the Linux kernel, a substantial majority simply use RCU as an extra, cheap, reference count to stop things from going stale until they are really not needed. In a number of cases this is very explicit in a "put" function. For example kernel/audit_tree.c has:

static inline void put_tree(struct audit_tree *tree) { if (atomic_dec_and_test(&tree->count)) call_rcu(&tree->head, __put_tree); }

which makes it very clear where RCU stands - one step below his mother (we note that she isn't complaining, only him).

Even when RCU does get to use his quick-change trick it is usually to simply replace an old value with a new value - no real copying is done. Like most of us, he hardly ever uses his middle name.

Some reasonably simple examples of true RCU where the old value is copied and updated are in fs/afs/security.c:afs_cache_permit() and drivers/md/linear.c:linear_add(). In both these cases an array needs to be made bigger, so a new structure is allocated and RCU is used to slip it into place without anyone noticing.

Others, possibly more genuine (as they aren't just for a size change), can be found where " list_replace_rcu() " or " hlist_replace_rcu() " are used (about a dozen places in Linux-3.0). For example in net/ipv4/fib_trie.c:fib_table_insert(), RCU is used to update several fields in a " fib_alias " atomically.

Seeking Seqlock's secret

Returning to Miss Seqlock and examining more closely her relationship with her parents we find that she is very close to her father and depends on him to provide exclusion from other writers when writing to a protected data structure (though in the Linux kernel she has a more independent alter ego named "Seqcount" who leaves the write-side exclusion to be taken on by whoever is available).

On the other hand she shows her rebellious side by not requiring a strict bracketing of get/put calls. Sir Spinlock's calls must always pair a spin_lock() with spin_unlock() and RCU must always follow rcu_read_lock() by rcu_read_unlock() . Dame Refcount almost always pair "increment" with "decrement" though sometimes when there is only one reference left she won't bother decrementing but just throws the object away. Seqcount - possibly inspired by her mother, only pretends to need balance, on the read side at least. Her standard pattern is to balance " read_seqbegin() " with " read_seqretry() " as is shown quite eloquently in, for example, fs/nfs/nfs4state.c:nfs4_copy_stateid():

do { seq = read_seqbegin(&state->seqlock); memcpy(dst, &state->stateid, sizeof(*dst)); } while (read_seqretry(&state->seqlock, seq));

We have begin, copy, and retry. But it doesn't have to be like that. A simple variation can be seen in fs/cifs/dir.c:build_path_from_dentry() where three different error cases can return from the middle of the loop. In any other locking scheme the error path would need to unlock or release whatever was being held, but not with Miss Seqlock. When she is in charge you can check-out any time you like.

While this example copies the entire protected structure that is not necessary and would be prohibitive when Seqlock is protecting a large or complex structure. Rather it is only necessary to copy the pieces of interest. If they prove interesting you might copy some more bits and some more, each time only checking " read_seqretry() " to make sure the entire picture you are forming is valid (and not blurred). This is not yet common in Linux with the new scalable dcache being the only advanced user. This first exposure may yet gain more friends for Seqlock, though, and we can expect to see more of her as time passes.

It's all about teamwork

We have already noted that Seqlock usually depends on her dad for help with write-side exclusion and that RCU often helps out his mum. The co-operation doesn't end there. The parents have long worked together (it was in the middle of an " atomic_dec_and_lock() " that he proposed) and the twins carry on that tradition. Some times the family members take turns in protecting a structure and sometimes they share out responsibility for different fields, so Sir Spinlock might protect some fields in a structure while RCU guarantees that others won't change except by copying the whole.

As we now turn back to look at the context of the bug that started this discussion which was in the new path name lookup code we find that all four (main) members of the family are making significant contributions and working together.

As the earlier article on this topic outlines, Sir Spinlock and Dame Refcount are used for the "always works but sometimes slow" ref-walk option, while RCU and Seqlock help out for the "often works, always fast" rcu-walk option. This allows us to compare and contrast them in a real-life context.

Where ref-walk uses reference counts on one dentry after another to keep each dentry active, rcu-walk uses RCU to keep an effective reference count on all dentrys. Where ref-walk uses spinlocks to get a stable view of each dentry, rcu-walk uses seqlock to avoid using an unstable view.

To really understand this bug some degree of friendship with VFS is needed too, but fortunately not much. Central to the issue is the " d_inode " field of a dentry . It is fairly special because it doesn't change much. The only valid changes are from NULL to a valid inode pointer, and from a valid inode pointer back to NULL . And that second one is special too - it is only allowed to happen if there are no other references to the dentry.

The result of this in the old dcache (before rcu-walk) is that you don't need any locking for read access to d_inode , only a counted reference. If it is NULL, then you know the file doesn't exist. If it is not NULL, then the value is the file's inode and it won't change - the reference you have on the dentry becomes an implied reference on the inode.

As rcu-walk largely mapped spinlocks to seqlocks and refcounts to RCU you might assume that an RCU reference is sufficient to make accessing d_inode safe but it isn't quite that simple. To get a new reference on a dentry you need to be holding a spinlock (hence dget_locked() is the function to use). So while you only need a reference to access d_inode , you need to have previously taken a spinlock. This doesn't map directly into RCU usage as getting an RCU reference (with rcu_read_lock() ) can never require a spinlock. So in rcu-walk, both Seqlock and RCU are used to protect d_inode . RCU holds a global reference so you can safely copy the d_inode value, and then later Seqlock will tell you if the copy you made is still valid.

It seems that there is room for some confusion on whether just a reference, or also a lock, is need to access d_inode and we see some of that in the patches proposed. The first patch proposed by Linus seems to treat d_inode like an RCU-protected field which should not be changed (as observed in our earlier parenthetical comment), and so removed the code that cleared it before the last RCU-reference was dropped. While it may have fixed the problem it didn't fully explain it as the field is meant to be Seqlock protected.

Linus's second patch added some Seqcount handling which seemed to acknowledge that the field was meant to be Seqlock protected, but also moved the access to d_inode towards the end of the loop. That loop shows off Seqlock's unbalanced nature quite nicely. It has a read_seqcount_begin() but had no read_seqretry() . Linus added a read_seqretry() which resulted in balance but that didn't really address the issue.

As mentioned earlier a "Seqlock protected region" is somewhat vague as it can have several read_seqretry() calls with each being used to validate what has just been copied. While a Seqlock protected region is, in some sense, still active when we enter this loop, it is no longer meant to protect d_inode . d_inode was copied out and validated in an earlier call to _d_lookup_rcu() (from do_lookup() ). The current Seqlock protection will only be used if this inode is a directory. So in this loop there is no Seqlock protection for d_inode . The only reason it is still safe to access it at the end of the loop is because we know we have a mount point, and we know a mount point is pinned in such a way that d_inode cannot change (as the comment explains). At the top of the loop we don't have that certainty.

So it seems that the problem child here is Seqlock, but in fairness it isn't really her fault. She looks like a lock, but doesn't bracket things at all the same way that more traditional locks do and so those familiar with the older generation can get confused about exactly what she is protecting.

If we back-track a little and find out how the reference to d_inode came to not be properly protected we find commit 62a7375e5d77d654695 which moves the access to d_inode from the end of that loop to near the top. This looks like it should be safe as a Seqlock seems to be active, but as we now know, that can be deceiving. The reason for that move is probably worth exploring, but not here. This article is about friendship and friendship will rarely tell us how to fix a problem - we usually need to visit a specialist for that. But a friendship with data structures and locking mechanisms can help us identify which code is worthy of a closer inspection just as we have seen in this exploration.

In this case a friend would need to have known that d_inode requires Seqlock protection, and to have fully understood the rather independent approach that Seqlock takes to protected regions. Maybe Seqlock just needs some more friends.

Invitation

It has been said that given enough eyeballs, all bugs are shallow. For this to be true the eyes must also be able to focus and recognize (and even be open). Hopefully by gaining more friends who are able to recognize its faults, the Linux kernel can avoid the stigma of being difficult to understand and can once more be seen as approachable. Would you like to be friends?

Exercises:

Examine each unusual "for" loop found above and determine if each is a bug, an interesting usage, or an uninteresting false-positive. " struct super " uses a secondary reference count " s_count " which provides similar functionality to that provided by RCU. Determine if any flavor of RCU can actually be used to achieve the same result. Outline the costs and benefits of such a change. Understand why commit 62a7375e5d77d654695 moved the d_inode access as it did and so determine if commit 59430262401bec02 has undone some good work. Examine the usage of RCU in jbd2_dev_to_name and list any features that "don't seem right". A deeper analysis is not required as this code was recently removed from the kernel.

[The author would like to thank Paul McKenney for providing valuable insights into the character of RCU.]