There’s a lot of hubbub recently about removing the big kernel lock completely in Linux 2.6.37. That is it’s now possible to compile a configuration without any BKL use. There is still some code depending on the BKL,

but it can be compiled out. But is that a big deal?

First some background: Linux 2.0 was originally ported to SMP the complete kernel ran under a single lock, to preserve the same semantics for kernel code as on uni-processor kernels. This was known as the big kernel lock.

Then over time more and more code was moved out of the lock (see chapter 6 in LK09 Scalability paper for more details)

In Linux 2.6 kernels very few subsystems actually still rely on the BKL. And most code that uses it is not time critical.

The biggest (moderately critical) users were a few file systems like reiserfs and NFS. The kernel lockf()/F_SETFL file locking subsystem was also still using the BKL. And the biggest user (in terms of amount of code) were the ioctl callbacks in drivers. While there are a lot of them (most drivers have ioctls) the number of cycles spent in them tends

to be rather minimal. Usually ioctls are just used for initialization and other comparatively rare jobs.

In most cases these do not really suffer from global locking.

This is not to say that the BKL does not matter at all. If you’re a heavy parallel user of a subsystem or a function that still needs it then it may have been a bottleneck. In some situations — like hard rt kernels — where locking is more expensive it may also hurt more. But I suspect for most workloads this wasn’t already the case for a long time,

because BKL is already rare and has been eliminated from most hot paths a long time ago.

There’s an old rule of thumb that 90% of all program run time is spent in 10% of code (I suspect the numbers are actually even more extreme for typical kernel loads). And that should only tune the 10% that matter and leave the rest of the code alone. The BKL is likely already not in your 10% (unless you’re unlucky). The goal of kernel scalability is to make those 10% run in parallel on multiple cores.

Now what happens when the BKL is removed from a subsystem? Normally the subsystem gains a code lock as the first stage: that is a private lock that serializes the code paths of the subsystem. Now using code locks is usually considered

bad practice (“lock data, not code”). Next come object locks (“lock data”) and then sometimes read-copy-update and other fancy lock-less tricks. A lot of code gets stuck in the first or second stages of code or coarse-grained data locking.

Consider a workload that tasks a particular subsystem in a highly parallel manner. The subsystem is most

of the 90% of kernel runtime. The subsystem was using the BKL and gets converted to a subsystem code lock.

Previously the parallelism was limited by the BKL. Now it’s limited by the subsystem code lock. If you don’t have

other hot subsystems that also use the BKL you gain absolutely nothing: the serialization of your 90% runtime is still there, just shifted to another lock. You would only gain if there’s another heavy BKL user too in your workload, which is unlikely.

Essentially with code locks you don’t have a single big kernel lock anymore, but instead lots of little BKLs.

As a next step the subsystem may converted to more finegrained data locking. For examples if the subsystem has a hash table to look up objects and do something with them (that’s a common kernel code pattern) you may have locks per hash bucket and another lock for each object. Objects can be lots of things: devices, mount points, inodes,

directories, etc.

But here’s a common case: your workload only uses one object (for example all writes to the same SCSI disk) and that code is the 90% runtime code. Now again your workload will hit that object lock all the time and it will limit parallelism.

The lock holding time may be slightly smaller, but the fundamental scaling costs of locking are still there. Essentially you will have another small BKL for that workload that limits parallelism.

The good news is that in this case you can at least do somethingabout it as kernel user: you could change the workloads to spread the accesses out to different objects: for example use different directories or different devices. In some cases that’s practical, in others not. If it’s not you still have the same problem. But it’s already a large step in the right

direction.

Now a lot of the BKL conversions recently were simply code locks. About those I must admit I’m not really excited.

Essentially a code lock is not that much better as the BKL. What’s good is data locking and other techniques. There

are a few projects that do this like Nick Piggin’s work on VFS layer scalability, Jens Axboe’s work on block layer scalability, the work on speculative page fault and lots of others. I think those will have much more impact on real kernel scalability than the BKL removal.

So what’s left: The final BKL removal isn’t really a big step forward for Linux. It’s more a symbolic gesture, but I prefer to leave those to politicians and priests.