Like many good stories, this one started with a simple question. Should small Rust structs be passed by-copy or by-borrow? For example:

struct Vector3 { x : f32 , y : f32 , z : f32 } fn dot_product_by_copy(a : Vector3, b : Vector3) -> float { a.x * b.x + a.y * b.y + a.z * b.z } fn dot_product_by_borrow(a : & Vector3, b : & Vector3) -> float { a.x * b.x + a.y * b.y + a.z * b.z }

This simple question sent me on a benchmarking odyssey with some surprising twists and discoveries.

Why It Matters

The answer to this question matters for two reasons — performance and ergonomics.

Performance

Passing by-copy should mean we copy 12-bytes per Vector3 . Passing by-borrow should pass an 8-byte pointer per Vector3 (on 64-bit). That's close enough to maybe not matter.

But if we change f32 to f64 now it's 8-bytes (by-borrow) versus 24-bytes (by-copy). For code that uses a Vector4 of f64 we're suddenly talking about 8-bytes versus 32-bytes.

Ergonomics

In C++ I know exactly how I'd write this.

struct Vector3 { float x; float y; float z; }; float dot_product (Vector3 const & a, Vector3 const & b) { return a.x * b.x + a.y * b.y + a.z * b.z }

Easy peasy. Pass by const-reference and call it day.

The problem with Rust is ergonomics. When passing by-copy you can combine mathematical operations cleanly and simply.

fn do_math(p1 : Vector3, p2 : Vector3, d1 : Vector3, d2 : Vector3, s : f32 , t : f32 ) -> f32 { let a = p1 + s * d1; let b = p2 + s * d2; dot_product(b - a, b - a) }

However when using borrow semantics it turns into this ugly mess:

fn do_math(p1 : & Vector3, p2 : & Vector3, d1 : & Vector3, d2 : & Vector3, s : f32 , t : f32 ) -> f32 { let a = p1 + & ( & d1 * s); let b = p2 + & ( & d2 * t); let result = dot_product( & ( & b - & a), & ( & b - & a)); }

Blech! Having to explicitly borrow temporary values is super gross. 🤮

Building a Benchmark

So, should Rust pass small structs, like Vector3 , by-copy or by-borrow?

None of Twitter, Reddit, or StackOverflow had a good answer. I checked popular crates like nalgebra (by-borrow) and cgmath (by-value) and found both ways are common.

I don't like the ergonomics of by-borrow. But what about performance? If by-copy is fast then none of this matters. So I did the only thing that seemed reasonable. I built a benchmark!

I wanted some to test something slightly more than raw operator performance. It's still a silly synthetic benchmark that is not representative of a real application. But it's a good starting point. Here's roughly what I came up with.

let num_shapes = 4000 ; for cycle in 0. .. 5 { let (spheres, capsules, segments, triangles) = generate_shapes(num_shapes); for run in 0. . 5 { for (a,b) in collision_pairs { test_by_copy(a,b) } for pair in collision_pairs { test_by_borrow( & a, & b) } } }

I randomly generate 4000 spheres, capsules, segments, and triangles. Then I perform a simple overlap test for SphereSphere, SphereCapsule, CapsuleCapsule, and SegmentTriangle for all pairs. These tests are run by-copy and by-borrow. Only time spent inside test_by_copy and test_by_borrow is counted.

Each full benchmark performs 3.2 billion comparisons and finds ~220 million overlapping pairs. Here are some results running single-threaded on my beefy i7-8700k Windows desktop. All times are in milliseconds.

Rust

f32 by-copy: 7,109

f32 by-borrow: 7,172 (0.88% slower)



f64 by-copy: 9,642

f64 by-borrow: 9,601 (0.42% faster)

Well this is mildly surprising. Passing by-copy or by-borrow barely makes a difference! These results are quite consistent. Although a difference of less than 1% is well within the margin of error.

Is this the answer to our question? Should we pass by-copy and call it a day? I'm not ready to say.

Down the C++ Rabbit Hole

After my initial Rust benchmarks I decided to port my test suite to C++. The code is similar, but not identical. Both Rust and C++ implementations are what I would consider idiomatic in their respective languages.

C++

f32 by-copy: 14,526

f32 by-borrow: 13,880 (4.5% faster)



f64 by-copy: 13,439

f64 by-borrow: 13,942 (3.8% slower)

Wait, what?! At least two things are super weird here.

double by-value is faster than float by-value C++ float is twice as slow as Rust f32

Inlining

Clearly something unexpected is going on. Using Visual Studio 2019 I grabbed a pair of quick CPU profiles.

C++ Profile

Rust Profile

Ah hah! Rust appears to be inlining almost everything. Let's copy Rust and throw a quick __forceinline infront of everything in our C++ impl.

C++ w/ inlining

f32 by-copy: 12,688

f32 by-borrow: 12,108 (4.5% faster)



f64 by-copy: 11,860

f64 by-borrow: 11,967 (0.9% slower)

Inlining C++ provides a decent ~12% boost. But double is still faster than float . And C++ is still way slower than Rust.

Aliasing

I would consider my C++ and Rust implementations to both be idiomatic. However they are different! C++ takes out parameters by reference while Rust returns a tuple. This is because Rust tuples are delightful to use and C++ tuples are a monstrosity. But I digress.

// Rust fn closest_pt_segment_segment(p1 : Vector3, q1 : Vector3, p2 : Vector3, q2 : Vector3) -> (T, T, T, Vector3, Vector3) { // Do math }

// C++ float closest_pt_segment_segment ( Vector3 p1, Vector3 q1, Vector3 p2, Vector3 q2, float & s, float & t, Vector3 & c1, Vector3 & c2) { // Do math }

This subtle difference could cause a huge impact on performance. The C++ version compiler can't be sure the out parameters aren't aliased. Which may limit its ability to optimize. Meanwhile Rust uses and returns local variables which are known to be non-aliased.

Interestingly, fixing the aliasing above doesn't make a difference! With inlining the compiler handles it already. Much to my surprise, what C++ does not handle well is the following:

void run_test ( vector < TSphere > const & _spheres, vector < TCapsule > const & _capsules, vector < TSegment > const & _segments, vector < TTriangle > const & _triangles, int64_t & num_overlaps, int64_t & milliseconds) { // perform overlaps }

Changing run_test to return std::tuple<int64_t, int64_t> provides a small but noticeable improvement.

C++ w/ inlining, tuples

f32 by-copy: 12,863

f32 by-borrow: 11,555 (10.17% faster)



f64 by-copy: 11,832

f64 by-borrow: 11,524 (2.60% faster)

Compile Flags

At this point both C++ and Rust are compiling with default options. Visual Studio exposes a ton of flags. I tried tweaking a bunch of flags to improve performance.

Favor fast code (/Ot)

Disable exceptions

Advanced Vector Extensions 2 (/arch:AVX2)

Floating Point Mode: Fast (/fp:fast)

Enable Floating Point Exceptions: No (/fp:except-)

Disable security check /GS-

Control flow guard: No

The only flags that made a real difference were "disable exceptions" and AVX2. Each about 10%. I decided to leave off AVX2 in an attempt to equal Rust.

C++ w/ inlining, tuples, no C++ exceptions

f32 by-copy: 11,651

f32 by-borrow: 10,455 (10.27% faster)



f64 by-copy: 10,866

f64 by-borrow: 10,467 (3.67% faster)

We've made three C++ optimizations but our two mysteries remain. Why is double faster than float ? And why is C++ still so much slower than Rust?

Going Deeper

I tried looking at the disassembly in Godbolt. There's obviously differences. But I'm not smart enough to quantify them.