Debugging SmartOS #1

Finding Lock Inversions with DTrace

The problem was simple, a classic lock inversion bug. Two functions grabbing the same locks in opposite order. It’s an easy fix, change the bad function to use the correct order. But what if you don’t know which one is the bad function because you don’t know what the correct order is? This is the problem I faced. To find my way out of this predicament I needed to assess all uses of these locks and determine the correct order. But when you spread the locks across tens of thousands of lines of C that gets a little tricky. Tracing this type of code by hand is the work of a masochist. No human is going to get this right, and they shouldn’t have to. In this post I’m going to explain how I used DTrace to analyze lock usage in the DLS and MAC layers of the SmartOS networking stack.

When two or more threads grab two or more locks in opposite order you have lock inversion. This is a classic deadlock scenario and most often portrayed in the famous dining philosophers allegory. A bunch of hungry philosophers can’t agree on which order to grab their limited utensil set and end up starving to death. Well, they don’t always starve to death. Deadlock is not guaranteed when you have lock inversion. It requires concurrency and unfortunate timing as well (like a stool, it only works with all three legs). But you should remove lock inversion wherever you find it.

My issue dealt with the acquisition order of the global DLS and MAC perimeter locks across a large collection of functions. But what I’m going to show you works for any pair of locks. The first step to determining lock acquisition order is knowing when individual locks are acquired. To discover that you’ll want to reach for the lockstat and fbt providers.

lockstat The lockstat provider covers lock acquisition and release. It fires the lockstat:::{adaptive,rw}-acquire probes after lock acquisition and the lockstat:::{adaptive,rw}-release probes after lock release. The adaptive (mutex) probes point to the kmutex_t in arg0 and the rw probes point to the krwlock_t in arg0 . You can use these pointers in clause predicates to filter only the locks you are interested in. fbt The fbt (Function Boundary Tracing) provider traces entry and exit of kernel functions. It fires fbt:: :entry on entry and fbt:: :return on exit. The entry probe includes the typed arguments in the args[] array allowing you to inspect the function arguments as you would in C code.

You might think that lockstat is enough but it doesn’t cover what I call composite locks: a plain mutex or rw lock paired with additional state. Lockstat only covers plain locks. The MAC perimeter is a recursive mutex made up of a plain mutex, an owner field, and a counter. To track this lock I need to watch the kernel functions used to enter and exit it. That’s where the fbt provider comes in.

These two providers aren’t going to tell me when lock inversion happens. They can only tell me when a thread acquires or releases a lock. But with that information in hand it’s a simple matter of tracking state to find inversion. The key idea is this: keep thread-local state to track lock acquisition order and print the order upon acquisition of the last lock. For example, upon grabbing lock A check if the thread-local lock B flag is set, if so then I know the order was B->A. I can then print that order along with the stacktrace leading to that order. The code snippet below shows how I tracked acquisition and release of the DLS and MAC locks with thread-local variables.

lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0)/ { self->dls = 1; } lockstat:::rw-release /IS_DEVNET_LOCK(arg0)/ { self->dls = 0; } fbt::i_mac_perim_enter:entry { self->mac = 1; } /* * Letting go of the last reference to the perim. */ fbt::i_mac_perim_exit:entry /IS_MAC_PERIM_CNT(args[0], 1)/ { self->mac = 0; }

Ignoring the predicate macros for a moment, this should look pretty simple. If a lock is held I set its thread-local flag to 1; otherwise it’s set to 0. The MAC perimeter is a recursive mutex so it’s not enough to just check lock release. The consumer may grab this perimeter five times and it’s not truly released until the owner releases it as many times as it acquired it. Thus I only set self->mac to 0 when the current reference count is 1, indicating the final release.

Those four clauses cover all cases of acquisition and release. Now I need to detect when both locks are held and print the stacktrace.

lockstat:::rw-acquire /IS_DEVNET_LOCK(arg0) && self->mac/ { printf("MAC -> DLS %s (%d)

", execname, pid); stack(); } /* * If MAC perim count is 0 then this is the first entry into the * perim. */ fbt::i_mac_perim_enter:entry /IS_MAC_PERIM_CNT(args[0], 0) && self->dls/ { printf("DLS -> MAC %s (%d)

", execname, pid); stack(); }

These two clauses are like the previous two entry clauses but with a slight twist: when acquiring one of the locks check if the other lock is already held, if it is then I know the order and can print the stacktrace. Once again, the MAC perimeter is more involved. I only want to print the stacktrace upon first entry to the perimeter, not on recursive entries.

And that’s all there is to it. The final thing to understand are the predicate macros I used.

#define MIS_IS_VNIC 0x0002 #define IS_VNIC(mip) (mip->mi_state_flags & MIS_IS_VNIC) #define MAC_LOWER(upper) \ ((mac_client_impl_t *)((vnic_t *)upper->mi_driver)->vn_mc_handles[0])->mci_mip #define MAC_PERIM_CNT(mip) (IS_VNIC(mip) ? \ MAC_LOWER(mip)->mi_perim_ocnt : mip->mi_perim_ocnt) #define IS_MAC_PERIM_CNT(mip, cnt) (MAC_PERIM_CNT((mip)) == (cnt)) #define IS_DEVNET_LOCK(l) ((uintptr_t)(l) == (uintptr_t)&`i_dls_devnet_lock)

I used three nice features here: typed arguments à la CTF, embedded C preprocessor, and external variables. The SmartOS kernel is compiled with Compact Type Format (CTF) support. When CTF is in play DTrace probes support an alternate argument array, args[] , containing typed arguments that you can manipulate just as you would in C. Combining this with DTrace’s CPP support I was able to lift C code from the kernel into my script. These macros are complex only because MAC is complex. Without CPP and DTrace’s type support I’d have to know the memory offsets of struct members and make binary comparisons, which would break if the struct changes. Embedded C makes this script easier to work with and less prone to breakage. The devnet lock predicate compares the kmutext_t pointer to the address of the global kernel symbol i_dls_devnet_lock . The backtick is how you reference external variables in DTrace. Allowing you to reference the value or address of kernel symbols.

Finally, here’s the output during a zone start and stop.

MAC -> DLS dladm (125046) dls`dls_devnet_set+0x6b dls`dls_devnet_create+0x4c 0xfffffffff81de5ca 0xfffffffff81dd8cd dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 MAC -> DLS dladm (125046) dls`dls_devnet_hold_common+0x53 dls`dls_devnet_hold+0x17 dls`dls_devnet_setzid+0x8e dld`drv_ioc_prop_common+0x44c dld`drv_ioc_setprop+0x29 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 MAC -> DLS dladm (125047) dls`dls_devnet_rename+0xc0 dld`drv_ioc_rename+0xbb dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 MAC -> DLS flowadm (124904) dls`dls_devnet_hold_common+0x53 dls`dls_devnet_hold_tmp+0x1a dls`dls_devnet_hold_link+0x2f unix`stubs_common_code+0x59 mac`mac_link_flow_walk+0x57 dld`dld_walk_flow+0xca dld`drv_ioc_walkflow+0x24 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x241 DLS -> MAC dlmgmtd (100016) mac`mac_perim_enter_by_mh+0x23 mac`mac_perim_enter_by_macname+0x33 dls`i_dls_devnet_setzid+0x6b dls`dls_devnet_unset+0x2b2 dls`dls_devnet_destroy+0x46 vnic`vnic_dev_delete+0x96 vnic`vnic_ioc_delete+0x28 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`sys_syscall32+0x227 DLS -> MAC dlmgmtd (100016) mac`mac_close+0x18 mac`mac_perim_exit+0x33 dls`i_dls_devnet_setzid+0xfc dls`dls_devnet_unset+0x2b2 dls`dls_devnet_destroy+0x46 vnic`vnic_dev_delete+0x96 vnic`vnic_ioc_delete+0x28 dld`drv_ioctl+0x1e4 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`sys_syscall32+0x227

From this output I could see that functions called during zone start were grabbing the locks in a different order than functions called during zone shutdown. This coincided with reports in the field that mass zone restarts would sometimes cause deadlock. From this output I also noticed that most functions used MAC->DLS ordering. With this information, and some further code spelunking, I convinced myself that this is the only correct order (even though some code comments said otherwise). The real benefit of using DTrace is knowing I can trust the output (especially over my ability to trace the code by hand). I can share this output with others and they will trust it as well. That is, assuming they trust my script, which they can also verify. Finally, this script will far outlast my memory and gives future maintainers a tool to diagnose similar issues.

I can apply this same approach to three or more locks as well. To do that I’d change the thread-local variables from boolean flags to acquisition order. Here’s what such a script might look like.

#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a) #define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b) #define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c) lockstat:::adaptive-acquire /IS_A(args[0])/ { self->cnt++; self->a = self->cnt; } lockstat:::adaptive-acquire /IS_B(args[0])/ { self->cnt++; self->b = self->cnt; } lockstat:::adaptive-acquire /IS_C(args[0])/ { self->cnt++; self->c = self->cnt; } lockstat:::adaptive-release /IS_A(args[0])/ { self->cnt--; self->a = 0; } lockstat:::adaptive-release /IS_B(args[0])/ { self->cnt--; self->b = 0; } lockstat:::adaptive-release /IS_C(args[0])/ { self->cnt--; self->c = 0; } lockstat:::adaptive-acquire /IS_A(args[0]) && self->a == 3 && self->b == 2 && self->c == 1/ { printf("C -> B -> A

"); } lockstat:::adaptive-acquire /IS_A(args[0]) && self->a == 3 && self->b == 1 && self->c == 2/ { printf("B -> C -> A

"); } lockstat:::adaptive-acquire /IS_B(args[0]) && self->b == 3 && self->a == 1 && self->c == 2/ { printf("A -> C -> B

"); } lockstat:::adaptive-acquire /IS_B(args[0]) && self->b == 3 && self->a == 2 && self->c == 1/ { printf("C -> A -> B

"); } lockstat:::adaptive-acquire /IS_C(args[0]) && self->c == 3 && self->a == 2 && self->b == 1/ { printf("B -> A -> C

"); } lockstat:::adaptive-acquire /IS_C(args[0]) && self->c == 3 && self->a == 1 && self->b == 2/ { printf("A -> B -> C

"); }

This gets ugly in a hurry. I’m using the same method as my original script which uses only predicates for flow control. This causes the number of print clauses to grow factorial to the number of locks. What if, instead, I created a thread-local variable for each slot and placed a string representation of the lock in the next open slot when the lock is acquired?

#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a) #define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b) #define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c) #define IS_A_B_C(ptr) IS_A(ptr) || IS_B(ptr) || IS_C(ptr) #define LOCK_STR(ptr) \ IS_A(ptr) ? "A" : IS_B(ptr) ? "B" : IS_C(ptr) ? "C" : "?" lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->third/ { self->third = LOCK_STR(args[0]); } lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->second/ { self->second = LOCK_STR(args[0]); } lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && !self->first/ { self->first = LOCK_STR(args[0]); } lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first && !self->second && !self->third/ { self->first = 0; } lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first && self->second && !self->third/ { self->second = 0; } lockstat:::adaptive-release /IS_A_B_C(args[0]) && self->first && self->second && self->third/ { self->third = 0; } lockstat:::adaptive-acquire /IS_A_B_C(args[0]) && self->third/ { printf("%s -> %s -> %s

", self->first, self->second, self->third); stack(); }

It’s an improvement, but a delicate one. These clauses only work if you get the order exactly right. With a newer DTrace (late 2016), making use of Matt Ahrens’ if/else support, I can make this tighter and more robust. Here’s the script as it probably should be written (but it’s still good to know the old tricks).

#define IS_A(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_a) #define IS_B(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_b) #define IS_C(ptr) ((uintptr_t)(ptr) == (uintptr_t)&`lock_c) #define IS_A_B_C(ptr) IS_A(ptr) || IS_B(ptr) || IS_C(ptr) #define LOCK_STR(ptr) \ IS_A(ptr) ? "A" : IS_B(ptr) ? "B" : IS_C(ptr) ? "C" : "?" lockstat::adaptive-acquire /IS_A_B_C(args[0])/ { if (!self->first) self->first = LOCK_STR(args[0]); else if (!self->second) self->second = LOCK_STR(args[0]); else if (!self->third) self->third = LOCK_STR(args[0]); if (self->third) { printf("%s -> %s -> %s

", self->first, self->second, self->third); stack(); } } lockstat::adaptive-release /IS_A_B_C(args[0])/ { if (self->third) self->third = 0; else if (self->second) self->second = 0; else if (self->first) self->first = 0; }

Now that’s worth taking home to mom.

Remember, whenever you find yourself debugging some code—especially legacy code with twists and turns—don’t do yourself what the computer can do for you.