Thu 17 May 2018

I don't have the opportunity to code in Rust professionally (C++ is still king when you're at the intersection of graphics and scientific computing), but I've been following its developpment for quite some time now (and, shameless plug, also evaluating it for writing a sparse matrix library).

Recently, a benchmark made it to the top of /r/programming, featuring Rust among other languages, and I was a bit surprised to see that the idiomatic Rust program was not competitive with the best-tuned C++ solution. The benchmark implements a binary tree, and the C++ solution leverages raw pointers while Rust would use an Option<Box<Node>> to represent its tree. Since Option knows that Box is non-nullable, it should compile down to a raw pointer. Quickly inspecting the Rust and C++ versions would not let me find where the performance difference came from.

I was to take the train that evening for quite a long trip, so I decided I would investigate. I cloned the benchmark repo on my laptop, and quickly ran perf and flamegraph the Rust version to get an idea of where the problem was. As a sanity check I also ran the C++ version under valgrind to check for memory issue and be sure the comparison would be fair. The real investigation would begin while on the train.

On the train, I realized the investigation would not be as easy as I had hoped: I had not installed all my usual developpment tools on my laptop, so I could not use perf or valgrind . I remembered from the earlier flamegraph that most of the time was spent inside the split_binary function.

Here's its implementation:

fn split_binary ( orig : NodeCell , value : i32 ) -> ( NodeCell , NodeCell ) { if let Some ( mut orig_node ) = orig { if orig_node . x < value { let split_pair = split_binary ( orig_node . right . take (), value ); orig_node . right = split_pair . 0 ; ( Some ( orig_node ), split_pair . 1 ) } else { let split_pair = split_binary ( orig_node . left . take (), value ); orig_node . left = split_pair . 1 ; ( split_pair . 0 , Some ( orig_node )) } } else { ( None , None ) } }

and the C++ one:

void Tree :: split ( NodePtr orig , NodePtr & lower , NodePtr & greaterOrEqual , int val ) { if ( ! orig ) { lower = greaterOrEqual = nullptr ; return ; } if ( orig -> x < val ) { lower = orig ; split ( lower -> right , lower -> right , greaterOrEqual , val ); } else { greaterOrEqual = orig ; split ( greaterOrEqual -> left , lower , greaterOrEqual -> left , val ); } }

Using objdump -d --demangle , I was able to compare the assembly for the Rust and C++ versions. In Rust:

0000000000006 f10 < rust :: idiomatic :: split_binary > : 6 f10 : 41 57 push % r15 6 f12 : 41 56 push % r14 6 f14 : 41 54 push % r12 6 f16 : 53 push % rbx 6 f17 : 50 push % rax 6 f18 : 48 89 fb mov % rdi , % rbx 6 f1b : 48 89 1 c 24 mov % rbx ,( % rsp ) 6 f1f : 48 85 db test % rbx , % rbx 6 f22 : 74 34 je 6 f58 < rust :: idiomatic :: split_binary + 0 x48 > 6 f24 : 39 73 10 cmp % esi , 0 x10 ( % rbx ) 6 f27 : 7 d 3 e jge 6 f67 < rust :: idiomatic :: split_binary + 0 x57 > 6 f29 : 4 c 8 d 73 08 lea 0 x8 ( % rbx ), % r14 6 f2d : 48 8 b 7 b 08 mov 0 x8 ( % rbx ), % rdi 6 f31 : 48 c7 43 08 00 00 00 movq $ 0 x0 , 0 x8 ( % rbx ) 6 f38 : 00 6 f39 : e8 d2 ff ff ff callq 6 f10 < rust :: idiomatic :: split_binary > 6 f3e : 49 89 c7 mov % rax , % r15 6 f41 : 49 89 d4 mov % rdx , % r12 6 f44 : 4 c 89 f7 mov % r14 , % rdi 6 f47 : e8 a4 fe ff ff callq 6 df0 < core :: ptr :: drop_in_place > 6 f4c : 4 c 89 7 b 08 mov % r15 , 0 x8 ( % rbx ) 6 f50 : 49 89 de mov % rbx , % r14 6 f53 : 4 c 89 e3 mov % r12 , % rbx 6 f56 : eb 2 f jmp 6 f87 < rust :: idiomatic :: split_binary + 0 x77 > 6 f58 : 48 89 e7 mov % rsp , % rdi 6 f5b : e8 90 fe ff ff callq 6 df0 < core :: ptr :: drop_in_place > 6 f60 : 45 31 f6 xor % r14d , % r14d 6 f63 : 31 db xor % ebx , % ebx 6 f65 : eb 20 jmp 6 f87 < rust :: idiomatic :: split_binary + 0 x77 > 6 f67 : 48 8 b 3 b mov ( % rbx ), % rdi 6 f6a : 48 c7 03 00 00 00 00 movq $ 0 x0 ,( % rbx ) 6 f71 : e8 9 a ff ff ff callq 6 f10 < rust :: idiomatic :: split_binary > 6 f76 : 49 89 c6 mov % rax , % r14 6 f79 : 49 89 d7 mov % rdx , % r15 6 f7c : 48 89 df mov % rbx , % rdi 6 f7f : e8 6 c fe ff ff callq 6 df0 < core :: ptr :: drop_in_place > 6 f84 : 4 c 89 3 b mov % r15 ,( % rbx ) 6 f87 : 4 c 89 f0 mov % r14 , % rax 6 f8a : 48 89 da mov % rbx , % rdx 6 f8d : 48 83 c4 08 add $ 0 x8 , % rsp 6 f91 : 5 b pop % rbx 6 f92 : 41 5 c pop % r12 6 f94 : 41 5 e pop % r14 6 f96 : 41 5 f pop % r15 6 f98 : c3 retq 6 f99 : 0 f 1 f 80 00 00 00 00 nopl 0 x0 ( % rax )

And in C++:

00000000000061 b0 < Tree :: split ( Tree :: Node * , Tree :: Node *& , Tree :: Node *& , int ) > : 61 b0 : 48 85 ff test % rdi , % rdi 61 b3 : 74 17 je 61 cc < Tree :: split ( Tree :: Node * , Tree :: Node *& , Tree :: Node *& , int ) + 0 x1c > 61 b5 : 0 f 1 f 00 nopl ( % rax ) 61 b8 : 39 0 f cmp % ecx ,( % rdi ) 61 ba : 7 d 24 jge 61 e0 < Tree :: split ( Tree :: Node * , Tree :: Node *& , Tree :: Node *& , int ) + 0 x30 > 61 bc : 48 89 3 e mov % rdi ,( % rsi ) 61 bf : 48 8 d 77 10 lea 0 x10 ( % rdi ), % rsi 61 c3 : 48 8 b 7 f 10 mov 0 x10 ( % rdi ), % rdi 61 c7 : 48 85 ff test % rdi , % rdi 61 ca : 75 ec jne 61 b8 < Tree :: split ( Tree :: Node * , Tree :: Node *& , Tree :: Node *& , int ) + 0 x8 > 61 cc : 48 c7 02 00 00 00 00 movq $ 0 x0 ,( % rdx ) 61 d3 : 48 c7 06 00 00 00 00 movq $ 0 x0 ,( % rsi ) 61 da : c3 retq 61 db : 0 f 1 f 44 00 00 nopl 0 x0 ( % rax , % rax , 1 ) 61 e0 : 48 89 3 a mov % rdi ,( % rdx ) 61 e3 : 48 8 d 57 08 lea 0 x8 ( % rdi ), % rdx 61 e7 : 48 8 b 7 f 08 mov 0 x8 ( % rdi ), % rdi 61 eb : 48 85 ff test % rdi , % rdi 61 ee : 75 c8 jne 61 b8 < Tree :: split ( Tree :: Node * , Tree :: Node *& , Tree :: Node *& , int ) + 0 x8 > 61 f0 : 48 c7 02 00 00 00 00 movq $ 0 x0 ,( % rdx ) 61 f7 : 48 c7 06 00 00 00 00 movq $ 0 x0 ,( % rsi ) 61 fe : c3 retq 61 ff : 90 nop

Now I'm far from an expert at reading assembly, but here it looked like the main difference was coming from the drop_in_place calls. While the C++ version was not freeing any pointer, the Rust version was calling destructors. How come? I remembered that the valgrind run did not report any memory leak, so I could not blame the C++ code for forgetting to free nodes. The conclusion was clear: something in the Rust code was being dropped needlessly.

The problem lied in these lines:

let split_pair = split_binary ( orig_node . right . take (), value ); orig_node . right = split_pair . 0 ;

Option::take moved the right child of the node out of its Option , replacing it by None , and later the same right child would be overwritten by split_pair.0 . But in that assignment, the right child has to be dropped, even though it is None . I remembered that std::mem::forget was safe and designed for these kind of cases. I therefore replaced the assignment by a swap, and asked Rust to forget the remaining None :

let mut split_pair = split_binary ( orig_node . left . take (), value ); :: std :: mem :: swap ( & mut orig_node . left , & mut split_pair . 1 ); :: std :: mem :: forget ( split_pair . 1 );

With this optimization repeated everywhere I could find the issue, the execution time of the Rust variant went from 0.49s to 0.32s, closer to the C++ version which takes 0.26s. Rust should fare better in this benchmark once the PR is merged, but it's not the end of the story, C++ is still a bit faster. It looks like gcc did a huge amount of inlining. Maybe when I take the train back home I'll find out where the difference comes from. And this time, with perf installed...

Comments on /r/rust.