Contributed by pitrh on 2015-09-27 from the better-mp-safe-than-sorry dept.

David Gwynne (dlg@) writes in with our next report from the l2k15 hackathon, detailing all the networking dragons he and the crew faced.

Like mpi@, I was half prepared to spend the week discussing grand plans and ideas instead of working on code. My plan for the other half was to work on the code involved in handling Ethernet packets safe to run outside the big kernel lock, right up until it runs the IP network stack. More specifically, this meant making the following bits MP safe: the if_get() API, the interface input handler list, the carp(4), vlan(4), and trunk(4) interface input handlers, and the generic Ethernet input protocol handler. I turned up to the hackathon with a prototype for the if_get() changes for us to chew on, and most of the vlan(4) and interface input handler list changes already written. Fortunately it turned out we spent more time on code than discussion, and things moved pretty quickly. if_get() is used to turn an interface index (which is an int that identifies an interface in the kernel, and what gets stored on a packet as part of the mbuf structure) into a pointer to an interface, but it didn't provide any guarantees that this reference would remain valid if it was held outside the big kernel lock. Another CPU could be detaching and therefore freeing the interface while the current CPU was still using it. The solution I proposed was to add reference counters to the interface structure, and have if_get() increment this refcount before returning the pointer to the caller. When it comes time to actually detach the interface the detach code will sleep until all the references were released by the other CPUs, then it would be safe to free and destroy the interface. So we had if_get(), which increments the refcount, but nothing that would decrement it. Adding refcounts to interfaces meant we had to introduce an if_put() function that would decrement the refcount. None of those if_put() calls existed, and understanding some parts of the stack to figure out how to retrofit refcounting is non-trivial, so I hadn't pushed this diff hard before the hackathon. However, we couldn't come up with any better ideas. So mpi@, claudio@, and I knuckled down and added if_put() into the tree, and I got to push my MP safe backend for if_get() in. mpi@ and claudio@ focused on the hairiest code using if_get(), and in several cases refactored code to make the interface lifetimes easier to reason about before adding the if_put() calls. With further help from jsg@ and some static analysis tools, by the end of the week we basically had this task closed off. In between if_get() changes I worked on getting my MP safe vlan(4) input handler changes in and hacking on carp(4) to do the same. The vlan(4) input handler needs to find which vlan interface in the system a packet on a physical interface is for. It does that by looking up the packets vlan tag in a hash and traversing a list at each bucket looking for the right interface. I made that MP safe by turning the lists into SRP lists, and serialising modifications to all the hash buckets with a lock. To tweak carp(4) I had to spend a day or two simply reading the code to understand the relationships between all the data structures and figure out what gets used in the input path. It turns out that all the carp interfaces on a parent interface are placed on a list, and each carp interface can have a list of virtual host ids to support some of the load balancing algorithms. I used mikeb@'s change to the interface input handlers to associate the list of carp interfaces with the physical interface, and then replaced both the carp interface list and vhost list with SRP lists. The interface input handler list was a vanilla conversion of a singly linked list to an SRP list, and went in pretty early. SRP lists allow lock free traversal, so for bpf, if input handlers, and both carp and vlan this means we will be able to move toward processing packets for the same interface on multiple CPUs at some point and actually scale. One day... Fixing the generic Ethernet input handler was easy. Since it parses the packet on the stack, it was mostly already safe. The exception was revarp packet processing. Previously revarps were processed directly from ether_input which goes and touches a lot of stuff in the ARP stack. Rather than make all of ARP MP safe, the packets are queued for processing in softnet under the big lock like the rest of the IP and ARP stack. The if_get()+if_put(), vlan, and ether input changes are in the tree, and I should be able to commit the carp changes soon. mikeb@ took on changing trunk(4)'s packet input processing to be MP safe and spent a good chunk of the hackathon working on it. His diffs should end up in the tree soon too. Once carp(4) and trunk(4) are committed we should be able to remove the big lock around ethernet packet input processing. Along the way I factored out some common code I'd been introducing in lots of places to deal with reference counting, and particularly around sleeping until a refcount had dropped to zero. This is now in the tree as the refcnt API, which is loosely inspired by the FreeBSD refcount API. Also during the hackathon there was some talk about making more drivers MP safe, mostly around how hard annoying it is that we need to interlock between packet processing and bringing an interface down. As a result of that discussion kettenis@ implemented intr_barrier(9) which is modelled on synchronize_irq in Linux. It basically lets you disable interrupts on the hardware, and then guarantees that the specified interrupt handler is no longer running on any other CPU by the time it returns. Using this I was able to make vmx(4) MP safe without mutexes. It was also fun to watch kettenis@ make progress on MP safety in the memory subsystems. l2k15 was more productive than I expected. It seems we have enough infrastructure in place now to make faster progress on an MP safe network stack, and it was good to go over how to use these pieces with the other developers at the table. There were some really good ideas thrown around too, particularly with mikeb@ and bluhm@ which I hope to try and hack up soon. I'm looking forward to what mpi@ has been cooking for the IP stack next. Thank you to mpi@ and claudio@ for dealing with the gross bits of the if_put() conversion, and thank you to Tonimir and the other organisers of this hackathon.

Thanks for the very detailed report, David. We look forward to some more SMP goodness!