Storing unboxed trait objects in Rust

This blog post will outline the creation of dynstack , a stack datastructure that stores trait objects unboxed to minimize the number of heap allocations necessary.

Part 1: Implementing polymorphism

Rust, not being an object-oriented language, doesn't quite do inheritence like the others. In C++ or Java, subclasses extend superclasses. In Rust, structs implement traits.

This section is designed to provide background about how C++ and Rust implement polymorphism. If you're already familiar with vtables and fat pointers, jump ahead to Part 2.

Virtual calls in C++

Consider a typical example of inheritance in C++:

class Animal { public: virtual void MakeSound() = 0; }; class Dog : public Animal { public: void MakeSound() { printf("Ruff

"); } };

With C++ inheritance, we can think of Dog as having an implicit private member: DogVtable *vtable , with the following definition:

struct DogVtable { void (*MakeSound)(Dog*); // Some miscellaneous stuff. };

If we upcast Dog to Animal , and call the MakeSound method, we in essence are doing the following:

struct AnimalVtable { void (*MakeSound)(Animal*); // etc. } Dog d(); Dog *dog = &d; dog->vtable->MakeSound(dog); // prints "Ruff" Animal *animal = &d; animal->vtable->MakeSound(animal); // prints "Ruff"

Note why the second invocation works. animal is dog ; they both point to the same spot in memory. Assuming both Dog and Animal have their vtable members at the same offset, and assuming both vtables have the same layout (spoiler: they do), both lines are identical.

If you can make sense of x86 assembly, I encourage you to play around with this assembly explorer for C++ virtual calls.

Virtual calls in Rust

In Rust, things are a bit different. Traits have no data members, and any pre-implemented trait functions are duplicated among implementers. This means that unless you're using a trait object, Rust doesn't use vtables.

Trait objects complicate things, though. Imagine you want to have a vector of items that are all Animal s, but may otherwise be of different types.

// Consider the following definitions: trait Animal { fn make_sound(&self); } struct Dog; impl Animal for Dog { fn make_sound(&self) { println!("Ruff"); } } struct Cat; impl Animal for Cat { fn make_sound(&self) { println!("Meow"); } } struct Chinchilla; impl Animal for Chinchilla { fn make_sound(&self) { println!("..."); } } // Accompanied with the following code: let animals = Vec::<dyn Animal>::new(); animals.push(Dog); animals.push(Cat); animals.push(Chinchilla); for animal in animals { animal.make_sound(); }

Sadly, this example doesn't work. And it doesn't work because dyn Animal isn't a real object, it's a trait object.

You can think of a trait object as a sort of upcasted representation of a struct. Obviously, the only methods you can call on a trait object are the ones defined in its respective trait.

And there are two other things that are important to keep in mind with trait objects:

Note 1: trait objects are Dynamically Sized Types (DSTs).

This is pretty easy to reason about. Clearly, all of Dog , Cat , and Chinchilla have no members, so they have size 0. But maybe Kangaroo has a &Joey member, which makes it size 8. What size does dyn Animal have, then?

Well, it doesn't have a (static) size. It's dynamically sized, so we can only actually know at runtime for some specific instantiation. This also means that we can't store trait objects on the stack, because Rust doesn't permit variable stack usage (recursion aside).

Note 2: a pointer to a trait object encodes both its data address and its vtable address.

This shows that trait object pointers are fat pointers. On a 64-bit system, this fat pointer occupies 128 bits for its two component addresses.

Aside: The representation for a fat pointer, as of Rust 1.32, is a packed (*mut Struct, *const Vtable) tuple. Unfortunately, there's no guarantee that fat pointers will continue to be represented this way in the Future. Still, there's no foreseeable reason it would change.

To illustrate how these come together, let's write some Rust pseudocode that parallels the C++ example above:

let d: Dog = Dog; let dog: &Dog = &d; Dog::make_sound(dog); // prints "Ruff" let animal: &dyn Animal = &d; let (animal_data, animal_vtable) = decompose_fat_pointer(animal); ((*animal_vtable).make_sound)(animal_data); // prints "Ruff"

Again, animal_data is dog . They're both pointers to the exact same data. Also note that no vtable gets used unless we upcast to Animal .

Again, if you can make sense of x86 assembly, I encourage you to play around with this assembly explorer for Rust virtual calls.

Part 2: Storing trait objects

Generally, if we want to store trait objects, we have to allocate them on the heap. The Box smart pointer handles this nicely for us.

let animals = Vec::<Box<dyn Animal>>::new(); animals.push(Box::new(Dog)); animals.push(Box::new(Cat)); animals.push(Box::new(Chinchilla)); for animal in animals { animal.make_sound(); }

And it works!

The only problem here is that we need separate allocations every single time we push to the Vec . This is bad for performance!

We can do better

A Vec of size-N unboxed objects is laid out the following way in memory:

Offset | Data =========|======= +0 | item0 +N | item1 +2N | item2 ... | ...

While a Vec of boxed trait objects is laid out like so:

Offset | Data =========|======================= +0 | &item0, &item0_vtable +16 | &item1, &item1_vtable +32 | &item2, &item2_vtable ... | ...

I propose a data structure called DynStack , which takes elements from both of the above. DynStack has two interior data structures: 1. An item descriptor table 2. Contiguous item memory

The item memory is conceptually simple. Consider three trait objects of sizes I, J, and K. Pushing them to the stack would result in the following:

Offset | Data ========================|======= +0 | item0 +I (+ padding) | item1 +I + J (+ padding) | item2 +I + J + K (+ padding) | item3

The item descriptor table is laid out similarly to the Vec of boxed trait objects, with one important caveat. In place of pointers to individual items, we store offsets.

Offset | Data =========|================================ +0 | 0, &item0_vtable +16 | I + padding, &item1_vtable +32 | I + J + padding, &item2_vtable

Indexing an item still has runtime O(1). We just lookup the descriptor entry at i :

let (offset, vtable_ptr) = self.descriptors[i]; return create_fat_pointer(&self.data[offset], vtable_ptr)

More importantly, the number of memory allocations is now O(log N), as both the item descriptor table and contiguous item memory double in size upon reaching capacity.

Part 3: Stable Rust DynStack

How can we implement DynStack::push ? Note that we need to move the trait object into our data structure.

Our structure definition:

struct DynStack<T: ?Sized> { // ... }

And recall that we can't take trait object parameters by value.

First attempt: basic generics/impl Trait

impl<T: ?Sized> DynStack<T> { fn push<I>(&mut self, item: I) where I: T {} // error[E0404]: expected trait, found type parameter `T` }

impl T has the same issue. Turns out, there's no way to parametrize your type over a trait :(

Second attempt: CoerceUnsized

Turns out, Rust has a trait available for this exact purpose.

impl<T: ?Sized> DynStack<T> { fn push<I>(&mut self, mut item: I) where *mut I: CoerceUnsized<*mut T> { // ... } }

And it works! Unfortunately, CoerceUnsized is unstable. And it doesn't look like it'll be stabilized any time soon. Damn.

Third attempt: macros

First, I define a push function that just takes in a mutable reference to T.

impl<T: ?Sized> DynStack<T> { unsafe fn push(&mut self, item: &mut T) { // ... } }

Then, a safe macro that calls push while moving the value.

macro_rules! dyn_push { { $stack:expr, $item:expr } => {{ let mut t = $item; unsafe { $stack.push(&mut t) }; ::std::mem::forget(t); }} }

We need to make sure to forget the item because we don't want to call its destructor twice. The DynStack will Drop all its items later.

We just have one problem here: Vec . Turns out, &mut Vec<K> will coerce into &mut [K] , so push will take ownership of the slice and not the vector if we do the following:

let mut stack: DynStack<[u8]> = DynStack::new(); let item = vec![0u8; 10000000]; dyn_push!(stack, item);

And will then leak the Vec 's memory with the call to forget .

Fourth attempt: macros redux

All we have to do is change push's &mut T argument to *mut T :

impl<T: ?Sized> DynStack<T> { unsafe fn push(&mut self, item: *mut T) { // ... } }

This allows Rust to correctly error on the vector pushing code above, because &mut Vec<u8> won't coerce to *mut [u8] .

Actually implementing push

This part is pretty simple. I'll spare you the details, but here's the pseudocode:

fn push(item_ptr) { next_offset = align_to(self.data.end, item_ptr) self.data.grow_to_accomodate(next_offset + sizeof *item_ptr) self.descriptors.push((next_offset, vtable_of(item_ptr)) self.data.copy_into(item_ptr, next_offset) }

Part 4: Benchmarks

Benchmarks done on a 12-core Ryzen CPU with 16GB RAM, running Linux.

TL;DR: dynstack beats an equivalent Vec<Box<_>> across the board, but seeing heavily task- and allocator-dependent performance boosts.

This benchmark pushes 4x to a Vec<Box<dyn Trait>> / DynStack<dyn Trait> as quickly as possible, without popping.

Jemalloc (as of Rust 1.32):

push_speed_naive time: [77.472 ns 79.771 ns 82.055 ns] push_speed_dynstack time: [72.303 ns 74.035 ns 75.572 ns]

Linux system allocator (circa Rust 1.34):

push_speed_naive time: [62.828 ns 63.456 ns 64.181 ns] push_speed_dynstack time: [37.408 ns 37.696 ns 37.989 ns]

This benchmark pushes 100x to a Vec<Box<Fn>> / DynStack<Fn> and then runs the whole list sequentially.

Jemalloc (as of Rust 1.32):

push_and_run_naive time: [1.9763 us 1.9779 us 1.9796 us] push_and_run_dynstack time: [1.5346 us 1.5366 us 1.5387 us]

Linux system allocator (circa Rust 1.34):

push_and_run_naive time: [3.5431 us 3.5534 us 3.5684 us] push_and_run_dynstack time: [1.8091 us 1.8118 us 1.8148 us]

This benchmark pushes once to a Vec<Box<Fn>> / DynStack<Fn> , then calls the closure, then pops it from the list. It does this 100x per iteration.

Jemalloc (as of Rust 1.32):

pseudorecursive2_naive time: [1.5299 us 1.5307 us 1.5316 us] pseudorecursive2_dynstack time: [1.0138 us 1.0159 us 1.0188 us]

Linux system allocator (circa Rust 1.34):