A Quick Look at Trait Objects in Rust Rust is a genuinely interesting programming language: it has a number of features which are without precedent in mainstream languages, and those features combine in surprising and interesting ways. In many cases, it's a plausible replacement for C : it leads to fairly fast code; and because it doesn't have a garbage collector, its memory use is more predictable, which can be useful for some programs. I've been doing an increasing amount of programming in Rust for the last 3 or 4 years, the biggest thing being the grmtools set of libraries (a good starting place for those is the quickstart guide). However, there isn’t any doubt in my mind that Rust isn’t the easiest language to learn. That’s partly because it’s a fairly big language: in my opinion, language complexity grows exponentially with language size. But it’s also partly because a lot of Rust is unfamiliar. Indeed, my first couple of years in Rust reminded me of my early fumblings with programming: a lot of activity rewarded with only limited progress. However, it’s been worth it: Rust has proven to be a good match for many of the things I want to do. One of the things that baffled me for quite a long time are Rust’s “trait objects”: they felt like an odd part of the language and I was never quite sure whether I was using them or not, even when I wanted to be. Since I’ve recently had cause to look into them in more detail, I thought it might be helpful to write a few things down, in case anyone else finds my explanation useful. The first part of this blog post covers the basics and the second part takes a look at the performance implications of the way trait objects are implemented in Rust. The basics In general, Rust has a very strong preference for static dispatch of function calls, which is where the function matching a call is determined at compile-time. In other words, when I write a function call f() , the compiler statically works out which function f I’m referring to and makes my function call point directly to that function f . This stands in contrast to dynamic dispatch where the function matching a call is only determined at run-time. Dynamic dispatch is common in languages such as Java. Consider a class C which defines a method m which one or more subclasses override. If we have an object o of type C and a function call o.m() , then we have to wait until run-time to determine whether we should call C 's implementation of m or one of its subclasses. Simplifying slightly, static dispatch leads to faster performance while dynamic dispatch gives one more flexibility when structuring a program. These features can combine in a variety of ways: some languages only offer static dispatch; some only offer dynamic dispatch; and some require users to opt-in to dynamic dispatch. It can be easy to think Rust’s traits alone imply dynamic dispatch, but that is often not the case. Consider this code: trait T { fn m(&self) -> u64; } struct S { i: u64 } impl T for S { fn m(&self) -> u64 { self.i } } fn main() { let s = S{i : 100}; println!("{}", s.m()); } Here the trait T looks a bit like it's a Java interface, requiring any class/struct which implements it to have a method m to return an integer: indeed, the calling syntax in line 15 s.m() looks like a method call on an object which we might well expect to be dynamically dispatched. However, the Rust compiler statically resolves the call m() to the function T::m . This is possible because the variable s on line 14 is statically determined to have type S and, since S implements the T trait which has an m method, that function is the only possible match. As this suggests, Rust’s traits aren’t really like Java interfaces, and structs aren’t really like classes. However, Rust does allow dynamic dispatch, although at first it can feel like magic to make it happen. Let's assume we want to make a function which can take in any struct which implements the trait T and calls its m function. We might try and write this: fn f(x: T) { println!("{}", x.m()) } However compiling that leads to the following scary error: error[E0277]: the size for values of type `(dyn T + 'static)` cannot be known at compilation time --> src/main.rs:21:6 | 21 | fn f(x: T) { | ^ doesn't have a size known at compile-time | = help: the trait `std::marker::Sized` is not implemented for `(dyn T + 'static)` = note: to learn more, visit <https://doc.rust-lang.org/book/second-edition/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait> = note: all local variables must have a statically known size = help: unsized locals are gated as an unstable feature The length of that error is intimidating, but it actually says the same thing in three different ways . By specifying that we'll take in an argument of type T we're moving the run-time value into f . However — and I prefer to think of this in terms of running code -- we don't know how big any given struct that implements T will be, so we can’t generate a single chunk of machine code that handles it (should the machine code expect a struct of 8 bytes or 128 bytes or ...? how big should the stack the function requests be?). One sort-of solution to the problem is to change f to the following: fn f<X: T>(x: X) { println!("{}", x.m()) } This now compiles, but does so by duplicating code such that static dispatch is still possible. The code between angle brackets <X: T> defines a type parameter X : the type passed (possibly implicitly) for that parameter must implement the trait T . This looks a bit like it's a long-winded way of saying fn(x: T) but the type parameter means that monomorphisation kicks in: a specialised version of f is generated for every distinct type we pass to X , allowing Rust to make everything statically dispatched. Sometimes that's what we want, but it's not always sensible: monomorphisation means that code bloat can be a real concern (we can end up with many separate machine code versions of f ); and it can limit our ability to structure our program in a sensible fashion. Fortunately, there is an alternative solution which does enable dynamic dispatch, as shown in the following code: trait T { fn m(&self) -> u64; } struct S1 { i: u64 } impl T for S1 { fn m(&self) -> u64 { self.i * 2 } } struct S2 { j: u64 } impl T for S2 { fn m(&self) -> u64 { self.j * 4 } } fn f(x: &T) { println!("{}", x.m()) } fn main() { let s1 = S1{i : 100}; f(&s1); let s2 = S2{j : 100}; f(&s2); } Running this program prints out 200 and then 400 and that was achieved by dynamic dispatch! Hooray! But why did it work? The only real difference with our previous version is that we changed f to take a reference to an object of type T (line 21). Although we don't know how big a struct that implements T might be, every reference to an object that implements T will be the same size, so we only need a single machine code version of f to be generated. It's at this point that the first bit of magic happens: in line 27 we passed a reference to an object of type S1 to f , but f expects a reference to an object of type T . Why is that valid? In such cases, the compiler implicitly coerces &S1 to &T , because it knows that the struct S implements the trait T . Importantly, this coercion magically attaches some extra information (I’ll explain more below) so that the run-time system knows it needs to call S1 's methods on that object (and not, for example, S2 's methods). That such an implicit coercion is possible is, in my experience, very surprising to those new to Rust. If it’s any consolation, even experienced Rust programmers can fail to spot these coercions: nothing in f 's signature tells you that such a coercion will happen, unless you happen to know that T is a trait, and not a struct. To that end, recent versions of Rust let you add a syntactic signpost to make it clear: fn f(x: &dyn T) { println!("{}", x.m()) } The extra dyn keyword has no semantic effect, but you might feel it makes it a little more obvious that a coercion to a trait object is about to happen. Unfortunately, because the use of that keyword is currently a bit haphazard, one can never take its absence as a guarantee that dynamic dispatch won’t occur. Interestingly, there is another way to perform this same coercion, but without using references: we can put trait objects in boxes (i.e. put them on the heap). That way, no matter how big the struct stored inside the box, the size of the box we see is always the same. Here's a simple example: fn f2(x: Box<T>) { println!("{}", x.m()) } fn main() { let b: Box<S1> = Box::new(S1{i: 100}); f2(b); } At line 6 we have a variable of type Box<S1> but passing it to f2 automatically coerces it to Box<T> . In a sense, this is just a variant of the reference coercion: in both cases we’ve turned an unsized thing (a trait object) into a sized thing (a reference or a box). Fat pointers vs. inner vpointers I deliberately omitted something in my earlier explanation: while it's true that all references to an object of our trait type T are of the same size, it's not necessarily true that references to objects of different types are the same size. An easy way to see that is in this code, which executes without errors: use std::mem::size_of; trait T { } fn main() { assert_eq!(size_of::<&bool>(), size_of::<&u128>()); assert_eq!(size_of::<&bool>(), size_of::<usize>()); assert_eq!(size_of::<&dyn T>(), size_of::<usize>() * 2); } What this says is that the size of a reference to a bool is the same as the size of a reference to a u128 (line 6), and both of those are a machine word big (line 7). This isn't surprising: references are encoded as pointers. What might be surprising is that the size of a reference to a trait object is two machine words big (line 8). What's going on? Rust uses fat pointers extensively, including when trait objects are used. A fat pointer is simply a pointer-plus-some-other-stuff, so is at least two machine words big. In the case of a reference to trait objects, the first machine word is a pointer to the object in memory, and the second machine word is a pointer to that object’s vtable (which, itself, is a list of pointers to a struct’s dynamically dispatched functions). Although it’s not universally used terminology, let’s call a pointer to a vtable a vpointer. We can now make sense of the ‘magic’ part of the coercion from a struct to a trait object I mentioned earlier: the coercion from a pointer to a fat pointer adds the struct’s vpointer to the resulting fat pointer. In other words, any trait object coerced from an S1 struct will have a fat pointer with a vpointer to v1, and any trait object coerced from an S2 struct will have a vpointer to v2. v1 will, conceptually, have a single entry pointing to S1::m and v2 a single entry pointing to S2::m . If you want, using unsafe code, you can easily tease the object pointer and vpointer apart. If you’re a Haskell or Go programmer, this use of fat pointers is probably what you expect. Personally I’m used to vpointers living alongside objects, not alongside object pointers: as far as I know the former technique doesn’t have a name so let’s call it inner vpointers. For example, in a typical object orientated language, every object is dynamically dispatched, so every object carries around its own vpointer. In other words, the choices are between adding an extra machine word to pointers (fat pointers) or an extra machine word to objects (inner vpointers). Why might Rust have plumped for fat pointers instead of inner vpointers? Considering only performance as a criteria, the downside to inner vpointers is that every object grows in size. If every function call uses an object’s vpointer, this doesn’t matter. However, as I showed earlier, Rust goes out of its way to encourage you to use static dispatch: if it used inner vpointers, the vpointers would probably go unused most of the time . Fat pointers thus have the virtue of only imposing extra costs in the particular program locations where you want to use dynamic dispatch. What are the performance trade-offs of fat pointers vs. inner vpointers? Most Rust programs will use dynamic dispatch sparingly – any performance differences between fat pointers and inner vpointers are likely to be irrelevant. However, I want to write programs (language VMs) which use it extensively (for every user-language-level object). It’s therefore of some interest to me as to whether there’s any difference in performance between the two schemes. Unfortunately I haven’t been able to turn up any relevant performance comparisons : the nearest I’ve seen are papers in the security world, where fat pointers are used to encode certain security properties (e.g. the second part of a fat pointer might carry around a memory block’s length, so that all array accesses can be checked for safety), and thus clearly make performance worse. Our setting is quite different. In order to make a compelling comparison, I'd need real programs of note and measure them rigorously using both schemes, but that’s a bit difficult because I don’t have any such programs yet, and won’t for some time. So I’m going to have make do with a few micro-benchmarks and the inevitable caveat: one should never assume that anything these micro-benchmarks tell us will translate to larger programs. I'm also not going to go to quite the extremes I have in the past to measure performance: I'm looking for a rough indication rather than perfection. In order to keep things tractable, I made three assumptions about the sorts of program I'm interested in: Such programs will create trait objects occasionally but call methods in them frequently. I therefore care a lot more about calling costs than I do about creation costs. I care more about methods which read from the self object than those that don't. Are there any performance differences between the two? In general – and unlike most Rust programs – I'm likely to have a reasonable amount of aliasing of objects. Does that cause any performance differences when calling functions? In order to model this, I created a trait GetVal which contains a single method which returns an integer. I then created two structs which implement that trait: SNoRead returns a fixed integer (i.e. it doesn't read from self ); and SWithRead returns an integer stored in the struct (i.e. it reads from self ). Both structs are a machine word big, so they should have the same effects on the memory allocator (even though SNoRead doesn't ever read the integer stored within it). Eliding a few details, the code looks like this: trait GetVal { fn val(&self) -> usize; } struct SNoRead { i: usize } impl GetVal for SNoRead { fn val(&self) -> usize { 0 } } struct SWithRead { i: usize } impl GetVal for SWithRead { fn val(&self) -> usize { self.i } } To keep things simple, our first two benchmarks will stick with fat pointers, and we'll just measure the difference between calling SNoRead::m and SWithRead::m . Eliding a number of details, our first benchmark looks like this: fn time (mut f: F) where F: FnMut() { let before = Instant::now(); for _ in 0..ITERS { f(); } let d = Instant::now() - before; println!("{:?}", d.as_secs() as f64 + d.subsec_nanos() as f64 * 1e-9); } fn bench_fat_no_read() { let v: Vec<Box<dyn GetVal>> = Vec::with_capacity(VEC_SIZE); for _ in 0..VEC_SIZE { v.push(Box::new(SNoRead{i:0})); } time(|| { for e in &v { assert_eq!(e.val(), 0); } }); } In essence, we create a vector with VEC_SIZE elements (lines 11-14), each of which contains a boxed trait object. We then time (lines 2-7) how long it takes to iterate over the vector, calling the method m on each element (lines 16-18). Notice that we don’t measure the time it takes to create the vector and that we repeat the iterating over the vector ITERS times to make the benchmark run long enough. Although I don’t show it above, in order to make the resulting measurements more reliable, each benchmark is compiled into its own binary, and each binary is rerun 30 times. Thus the numbers I’m reporting are the mean of the benchmark run across 30 process executions, and I also report 99% confidence intervals calculated from the standard deviation. Only having one benchmark makes comparisons a little hard, so let's do the easiest variant first: a version of the above benchmark which uses SWithRead instead of SNoRead . This simply requires duplicating bench_fat_no_read , renaming it to bench_fat_with_read , and replacing SNoRead with SWithRead inside the function. We then have to decide how big to make the vector, and how many times we repeat iterating over it. I like to make benchmarks run for at least one second if possible, because that tends to lessen the effects of temporary jitter. As an initial measurement, I set VEC_SIZE to 1000000, and ITERS to 1000. Here are the results for the two benchmarks we’ve created thus far: bench_fat_no_read: 1.708 +/- 0.0138 bench_fat_with_read: 2.152 +/- 0.0103 This isn't too surprising: there's an inevitable fixed cost to iterating over the vector and jumping to another function which is shared by both benchmarks. However, bench_fat_with_read also measures the cost to read self.i in the SWithRead::m function which slows things down by over 25%. Now we can move onto the hard bit: creating inner vpointer variants of both benchmarks. This is a little bit fiddly, in part because we need to use unsafe Rust code . The basic technique we need to know is that a fat pointer can be split into its constituent parts as follows: let x: &dyn T = ...; let (ptr, vtable) = unsafe { mem::transmute<_, (usize, usize)>(x) }; You can look at transmute in several different ways, but I tend to think of it as a way of copying bits in memory and giving the copied bits an arbitrary type: in other words, it’s a way of completely and utterly bypassing Rust’s static type system. In this case, we take in a fat pointer which is two machine words big, and split it into two machine word-sized pointers (which I’ve encoded as usize s, because it slightly reduces the amount of code I have to enter later). What we need to do first is create a vector of thin (i.e. one machine word big) pointers to “vpointer + object” blocks on the heap. Eliding a few annoying details, here’s code which does just that: fn vec_vtable() -> Vec<*mut ()> { let mut v = Vec::with_capacity(VEC_SIZE); for _ in 0..VEC_SIZE { let s = SNoRead{i: 0}; let b = unsafe { let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s); let b: *mut usize = alloc(...) as *mut usize; b.copy_from(&vtable, 1); (b.add(1) as *mut SNoRead).copy_from(&s, 1); b as *mut () }; v.push(b); } v } The type <*mut ()> (line 1) is Rust’s rough equivalent of C’s void * pointer. Every time we make a new SNoRead object (line 4), we create a trait object for it (line 5), and pull out its vpointer (line 6; note that this will be the same value for every element in the vector). We then allocate memory (line 7) sufficient to store the vpointer (line 8) and the object itself (line 9): on a 64-bit machine, the vpointer will be stored at offset 0, and the object will be stored starting at offset 8. Notice that a significant portion of this function is unsafe code: that’s inevitable when fiddling around with low-level pointers like this in Rust. With that done, we can then create a benchmark: pub fn bench_innervtable_no_read() { let v = vec_vtable(); time(|| { for &e in &v { let vtable = unsafe { *(e as *const usize) }; let obj = unsafe { (e as *const usize).add(1) }; let b: *const dyn GetVal = unsafe { transmute((obj, vtable)) }; assert_eq!(unsafe { (&*b).val() }, 0); } }); } There are a couple of subtleties here. Notice how we recreate a fat pointer by recombining the object pointer and vpointer (lines 5-7): the *const dyn GetVal is vital here, as otherwise Rust won't know which trait we're trying to make a fat pointer for. In order to turn a raw fat pointer into a normal reference, we have to use the &* operator (line 8). With that done, we can then call method m . Using the same settings as for our earlier run, our new benchmark (and it's with_read variant) performs as follows: bench_innervtable_no_read: 2.111 +/- 0.0054 bench_innervtable_with_read: 2.128 +/- 0.0090 Unsurprisingly, bench_innervtable_no_read is much slower than its fat pointer cousin bench_fat_no_read : the former has to do a memory read on each iteration of the loop to read the vpointer, whereas the latter has that available in its fat pointer. bench_fat_no_read is thus very cache friendly because all it's doing is iterating linearly through memory (the vector) and then calling the same function ( SNoRead::m ) repeatedly. However, bench_innervtable_with_read is only just (taking into account the confidence intervals) slower than bench_innervtable_with_read . Why might this be? Well, reading the vpointer from memory will nearly always bring the object into the processor’s cache too, making the read in SWithRead::m much cheaper afterwards. Put another way, the first read of a random point in memory is often quite slow, as the processor waits for the value to be fetched from RAM; but reading another point almost immediately afterwards is quick, because an entire cache line (a chunk of memory: 64 bytes on x86) is brought into the processor’s cache in one go. If you look carefully, bench_innervtable_with_read is ever so slightly faster than bench_fat_with_read , although not by enough for me to consider it particularly significant. Is any of this interesting? Depending on your use case, yes, it might be. Imagine you're implementing a GUI framework, which is a classic use case for dynamic dispatch. A lot of the methods that are called will be empty, because the user doesn't need to handle the respective actions. The numbers above show that inner vpointers would slow you down in such a case: fat pointers are clearly the right choice. Let's make our final class of benchmark: what happens if, instead of our vector pointing to VEC_SIZE distinct objects, each element points to the same underlying object? In other words, what are the performance implications of having faster access to inner vpointers. First let's create our vector: fn vec_multialias_vtable() -> Vec<*mut ()> { let ptr = { let s = SNoRead{i: 0}; unsafe { let (_, vtable) = transmute::<&dyn GetVal, (usize, usize)>(&s); let b = alloc(...) as *mut usize; b.copy_from(&vtable, 1); (b.add(1) as *mut SNoRead).copy_from(&s, 1); b as *mut () } }; vec![ptr; VEC_SIZE] } Note that we only create a single pointer to an object (line 6), duplicating that pointer (but not the object!) VEC_SIZE times (line 12). By now, I suspect you've got the hang of the main benchmarks, so I won't repeat those. Let's go straight to the numbers: bench_fat_multialias_no_read: 1.709 +/- 0.0104 bench_fat_multialias_with_read: 1.709 +/- 0.0099 bench_innervtable_multialias_no_read: 1.641 +/- 0.0007 bench_innervtable_multialias_with_read: 1.644 +/- 0.0115 Interestingly, there is now a small, but noticeable difference: the inner vtable approach is definitely faster. Why is this? Well, it's probably because the fat pointer version of this benchmark consumes more memory. Each pointer consumes 2 machine words, so, at an abstract level, the total memory consumption of the program is (roughly) VEC_SIZE * 2 machine words big (I think we can ignore the single machine word needed for the one object). In contrast, the inner vpointer version consumes only VEC_SIZE machine words. It's thus a little more cache friendly, which probably explains the 4% performance difference. An important question is that, even for these very limited benchmarks, does anything change if the vector size changes? Yes, it does slightly: it seems that the smaller the vector, the less the difference between the two approaches (as you can see in the performance numbers for a number of variants). Conclusions To my simple mind, Trait objects in Rust are confusing because of the way they magically appear through implicit coercions. Maybe my explanation will help someone get to grips with them a little sooner than I managed to. With regards to the performance comparison, what does it mean for most people? I suggest it shows that Rust’s choice of fat pointers is probably the right one: if you don’t use trait objects, you don’t pay any costs; and if you do use trait objects, the performance is often the best anyway. What does this mean for me? Well, for VMs, the situation is a little different. In particular, in many VMs, the first thing a method on a VM-level object does is to read from self (e.g. due to biased locking). In such cases, and assuming a sensible object layout, the costs between fat pointers and inner vpointers are probably roughly comparable. Because most languages that aren't Rust allow aliasing, it's also likely that the run-time will see some aliasing, at which point inner vpointers might become a touch quicker; and, even if there's no performance difference, they will use slightly less memory. Of course, all of this is highly speculative, based on tiny benchmarks, and I'll probably try and keep my options open when implementing my first VM or two in Rust to see if one approach is better in practise than the other. Don't hold your breath for new results any time soon though! If you want to run, or fiddle with, the benchmarks shown in this blog post, they're available as vtable_bench. Acknowledgements: My thanks to Edd Barrett and Jacob Hughes for for comments on this blog post. Follow me on Twitter Footnotes [1] And, presumably, C++, though I don’t know that language well enough to say for sure. [2] Static dispatch makes inlining trivial. In my opinion, inlining is the mother of all optimisations: the more of it you can do, the faster your program will run. Note that inlining doesn’t have to be done statically (though it’s easier that way), but doing it dynamically tends to cause noticeable warmup. [3] Although this error is unnecessarily verbose, in general modern rustc gives very good quality error messages: older versions of rustc were quite a lot worse in this regard. Indeed, I've noticed that the better error messages make it a lot easier for those learning Rust today compared to 2015 or so. [4] It’s tempting to call these things “generics”, making them sound similar to the feature of that name found in languages such as Java. However, when you get a bit further into Rust, you start to realise that the “parameter” part of the ”type parameter” name is crucial, because you can pass type parameters around statically in a way that resembles “normal” parameters. [5] There’s another possible solution which I’m not going to talk about here, involving the CoerceUnsized trait. First, this has been unstable for years, and the lack of activity on the relevant issue suggests it will be unstable for a while to come. Second, it’s really hard to explain what it does. [6] There are some restrictions on what traits can contain and still be converted into trait objects. I won’t worry about those here. [7] To some extent, one could statically optimise away some of the costs of inner vpointers by noticing that some structs are never used with dynamic dispatch. That will still allow some wastage, because some structs are only sometimes used with dynamic dispatch. It’s hard to imagine an automatic analysis doing a great job of optimising such cases effectively. [8] Although it’s not quite the same thing, Gui Andrade's description of dynstack is well worth a read. [9] Unfortunately what unsafe code is allowed to do and what it's not allowed to do is largely unknown in Rust. This is a big problem for anyone who needs to drop out of safe Rust. There is some progress towards addressing this, but it’s not going to happen quickly. [10] I must admit that I found this operator's syntax confusing at first. &*x looks like it’s saying “dereference x and get a reference to where x is stored”, but it doesn’t dereference x: it just reinterprets a raw pointer as a Rust reference, which has no run-time effect.