Why?

I’ve had experience developing both user-mode and kernel-mode software. User-mode environments have a great advantage when we talk about the ease of development; The development community is huge, we have tons of examples online, and many dev tools are ready at hand. You also have many tools and scripting languages to help you with prototyping. Even when your code-base is written in a low level language like C, you can experiment with user-mode ideas in a faster prototyping environment, like Python.

You don’t get that in kernel development. There are tons of APIs, much less documentation and no feasible way of prototyping besides recompiling your code, reloading it and trying to measure (let’s face it, we all just printk ) the differences. Needless to say that you pay badly for mistaking, since many of them will crash your kernel.

I got the idea that being able to easily prototype API calls, access variables and monitor kernel functions behavior, might be useful. A more dynamic language will do well, and a REPL will be very nice.

There are tons¹ of dynamic-patching kernel tools, but I didn’t know of one providing a dynamic REPL, and in a convenient manner. And what would be more convenient than Python?

Basically what I had in mind is a fusion of Frida (providing hooks, because I realized hooks will also be useful) and a Python REPL (providing easy inspection and interaction with objects and functions).

Here’s a real-life example from this week, showing this need of a REPL: I wrote something based on the kernel’s rw_semaphore , in one function I had the read lock taken and I needed to write. The concept of “upgradable RW locks” — atomically converting a read lock to a writer lock — is well known, so I thought a quick search online would yield a result as for whether it’s possible or not. I couldn’t find anything, neither in the docs (the .h file).

A quick check in the REPL would be enough! So I headed to the terminal tab I keep open with a REPL on some QEMU VM, and typed the following:

>>> from kernel_ffi import kmalloc

>>> from struct_access import sizeof

>>>

>>> p = kmalloc(sizeof("rw_semaphore"))

>>> __init_rwsem(p)

>>>

>>> down_read(p)

>>> # yeah, can lock twice

>>> down_read(p)

>>>

>>> down_write_trylock(p)

0 # makes sense, i guess

>>> down_write(p)

# hangs forever!

Back to the story. I’ve had some experience with MicroPython, which is a complete Python 3 interpreter intended for microcontrollers. I’ve been using it on various chips like the ESP32 and ESP8266 for a while, and I decided it won’t be too troublesome to port it to the Linux kernel, since unlike CPython, MicroPython wasn’t designed to be run (only) in usermode. It wasn’t designed to run in the Linux kernel either, but it makes much less assumptions about its environment, and that’s why it is easier to get it running on a new “platform”.

Some challenges

Some issues and designs I have encountered during development.

Struct Access

The kernel code defines and uses thousands of complex structures. I wanted this tool to provide human-friendly struct accessing, not address-based like “read an integer at address XX”, “write a byte at address YY”.

So, for a struct like struct net_device *dev , what’s required to provide accessors like dev->stats.rx_dropped ?

The first thing is some Python syntax magic for the dereferences, array accesses, etc. It’s quite cool behind the scenes, but I won’t talk about that.

The second thing, and the more interesting one, is how does the Python get to know the structure layout?

When compiling a kernel module (that uses structures), you are required to have the kernel headers and configuration. The compiler reads the definitions from them. Can we parse those headers and extract structure definitions as well?

There are a few C structure parsing libraries in Python, for example cstruct and dissect.cstruct. But if you have had a look at a complex kernel structure, you’d know these approaches won’t “just work” — the struct definitions make very extensive use of #ifdef s based on configurations, specific alignment requirements like __cacheline_aligned_in_smp ,² not to mention __randomize_layout …³ My point is, it will be very hard to parse it correctly for someone who’s not the compiler actually building it.

Who’s else doing it, though? Debuggers. GDB allows you to print structs and access structs field. How do debuggers do it?

The DWARF debugging format (embedded in ELF files when you compile with -g ) can encode struct definitions. That’s what debuggers use, and also how the useful pahole tool does its tricks. There’s even also a Python implementation that does more or less the same.

At this point, I thought that relying of DWARF requires the target code to be recompiled with debug info. Now I can say it’s true. Anyway, I went ahead with the solution I had mind…

I decided I had to follow the way GCC extracts structs, and what’s the best way to do so if not from the inside of GCC itself? Time to write a GCC plugin.

So I found a simple example of a GCC plugin and did what developers do the best — take existing code they don’t fully understand, run it, modify it a little, run it again and learn from what’s changed. Shortly after I had a fully functioning plugin that you can load during your compilation, and it’ll dump all defined structs in as nice Pythonic objects. I named it struct_layout (just like the DRAWF parsing Python project I mentioned earlier) and you can find it on github.

So all you have to do is to include the headers you want into an empty C file, and compile with the plugin.

In hindsight, I could do similarly with DWARF… ;) Just compile an empty C file including the relevant headers, and read the dwarf output in the .o file.

But I’m happy with my choice. Writing a GCC plugin was a great experience I encourage all developers to take on themselves. It was sure more fun than parsing the DWARF section.

The rest was simple. As I said earlier, with a bit of Python magic you can easily wrap those definitions with objects that’ll allow the nice syntax sugar mimicking how it looks in C code.

>>> d = net_device(dev_get_by_name(init_net, "eth0"))

>>> # so this accesses the net_device's

>>> # "struct net_device_stats stats" field, and

>>> # then accesses "unsigned long rx_bytes"

>>> d.stats.rx_bytes

15158

And since it’s Python and everything’s dynamic, it also comes with a few cool features, like tab completion for fields and basic protection from NULL dereferences, array out-of-bounds access, unsigned/signed overflow detection and more. Neat!

Multi-threading

MicroPython supports multi-threading out-of-the-box, you just have to give it some basic primitives (like locks) and a TLS (thread-local storage), and you need to hack a GC implementation that takes all threads into account.

TLS implementations in usermode vary. For x86 , for example, you can clone a new thread with CLONE_SETTLS and the kernel will make sure that whenever your thread runs, it’ll have the fs register pointing to your specified data structure. Other basic possible implementation can be to use the bottom/top of the stack for storage. That’s the kernel implementation for current_thread_info in many architectures — for example, for ARM:

static inline struct thread_info *current_thread_info(void)

{

return (struct thread_info *) (current_stack_pointer & ~(THREAD_SIZE - 1));

}

These methods are useful when you control the thread — that is, you create it, you control the entry point, and so on. But what if you’re a callback in another thread, or just hooking onto some code which is to be run by threads you don’t control? You can’t use these tricks — you might interfere with the owner of the thread who’s using the same method.

So, you can just select one of the unused methods ( x86 doesn’t use the top of the stack — or the bottom, I can’t recall — so it’s free for your use) but since I needed to maintain the list of threads currently executing Python (we’ll see later why) I went with a more naive implementation — making use of the existing thread-based struct current , I keep a list of task_struct s to their TLS info. Then you can always get the TLS for current .

The next major component that needs special treatment when multi-threading is the garbage collector— MicroPython uses a mark & sweep garbage collector:

In mark & sweep, no object references are maintained, and when the GC runs it needs to scan the entire program memory (or, at least the relevant areas) for pointers into the heap. Heap blocks that are referenced are marked, and are added to the scan queue, recursively. All blocks still unmarked when the scan queue is empty are considered garbage and are freed.

Mark & sweep has the advantage of simpler code (no reference counting, no freeing, you just malloc stuff and everything works). But the requirement to “scan the entire program memory” is… impossible in the kernel. For example, the Python makes partial of the kernel heap. Do we need to scan the heap as well? It might be huge!

But we just need to make sure all the pointers into the heap are scanned… If we never place root pointers — pointers that no other object refers to — globally, but only on stacks of threads executing Python, then this scan can be narrowed down to the stacks of these threads, which is simple enough to do.

Using the list of threads we kept for TLS, we can get the stack start & end pointers of the relevant threads. The stack start can be narrowed to the “stack start when Python started” — if there’s a kernel stack containing 10 frames, and then a Python hook was called, we don’t need to scan those 10 previous frames. Care must be taken to use the right stack start, even for a thread with nested Python calls (that is, a Python hook calling kernel code that’s again hooked by Python). And since there’s no feasible way to get the current stack pointer for a thread currently running on another CPU, we’ll just use the real stack end.

Combining these, the GC is deterministic enough and safe enough to run, even from kernel hooks, which is quite cool.

Threading support is not complete and I still get crashes from time to time, but it’s not something that can’t be debugged and eliminated.

And’s that’s enough talking for this post. Onward to the real thing!