I can reproduce your results on my machine with those options you write in your post.

However, if I also enable link time optimization (I also pass the -flto flag to gcc 4.7.2), the results are identical:

(I am compiling your original code, with container.push_back(Item()); )

$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 35.426793 task-clock # 0.986 CPUs utilized ( +- 1.75% ) 4 context-switches # 0.116 K/sec ( +- 5.69% ) 0 CPU-migrations # 0.006 K/sec ( +- 66.67% ) 19,801 page-faults # 0.559 M/sec 99,028,466 cycles # 2.795 GHz ( +- 1.89% ) [77.53%] 50,721,061 stalled-cycles-frontend # 51.22% frontend cycles idle ( +- 3.74% ) [79.47%] 25,585,331 stalled-cycles-backend # 25.84% backend cycles idle ( +- 4.90% ) [73.07%] 141,947,224 instructions # 1.43 insns per cycle # 0.36 stalled cycles per insn ( +- 0.52% ) [88.72%] 37,697,368 branches # 1064.092 M/sec ( +- 0.52% ) [88.75%] 26,700 branch-misses # 0.07% of all branches ( +- 3.91% ) [83.64%] 0.035943226 seconds time elapsed ( +- 1.79% ) $ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 35.510495 task-clock # 0.988 CPUs utilized ( +- 2.54% ) 4 context-switches # 0.101 K/sec ( +- 7.41% ) 0 CPU-migrations # 0.003 K/sec ( +-100.00% ) 19,801 page-faults # 0.558 M/sec ( +- 0.00% ) 98,463,570 cycles # 2.773 GHz ( +- 1.09% ) [77.71%] 50,079,978 stalled-cycles-frontend # 50.86% frontend cycles idle ( +- 2.20% ) [79.41%] 26,270,699 stalled-cycles-backend # 26.68% backend cycles idle ( +- 8.91% ) [74.43%] 141,427,211 instructions # 1.44 insns per cycle # 0.35 stalled cycles per insn ( +- 0.23% ) [87.66%] 37,366,375 branches # 1052.263 M/sec ( +- 0.48% ) [88.61%] 26,621 branch-misses # 0.07% of all branches ( +- 5.28% ) [83.26%] 0.035953916 seconds time elapsed

As for the reasons, one needs to look at the generated assembly code ( g++ -std=c++11 -O3 -S regr.cpp ). In C++11 mode the generated code is significantly more cluttered than for C++98 mode and inlining the function

void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)

fails in C++11 mode with the default inline-limit .

This failed inline has a domino effect. Not because this function is being called (it is not even called!) but because we have to be prepared: If it is called, the function argments ( Item.a and Item.b ) must already be at the right place. This leads to a pretty messy code.

Here is the relevant part of the generated code for the case where inlining succeeds:

.L42: testq %rbx, %rbx # container$D13376$_M_impl$_M_finish je .L3 #, movl $0, (%rbx) #, container$D13376$_M_impl$_M_finish_136->a movl $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b .L3: addq $8, %rbx #, container$D13376$_M_impl$_M_finish subq $1, %rbp #, ivtmp.106 je .L41 #, .L14: cmpq %rbx, %rdx # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage jne .L42 #,

This is a nice and compact for loop. Now, let's compare this to that of the failed inline case:

.L49: testq %rax, %rax # D.15772 je .L26 #, movq 16(%rsp), %rdx # D.13379, D.13379 movq %rdx, (%rax) # D.13379, *D.15772_60 .L26: addq $8, %rax #, tmp75 subq $1, %rbx #, ivtmp.117 movq %rax, 40(%rsp) # tmp75, container.D.13376._M_impl._M_finish je .L48 #, .L28: movq 40(%rsp), %rax # container.D.13376._M_impl._M_finish, D.15772 cmpq 48(%rsp), %rax # container.D.13376._M_impl._M_end_of_storage, D.15772 movl $0, 16(%rsp) #, D.13379.a movl $0, 20(%rsp) #, D.13379.b jne .L49 #, leaq 16(%rsp), %rsi #, leaq 32(%rsp), %rdi #, call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ #

This code is cluttered and there is a lot more going on in the loop than in the previous case. Before the function call (last line shown), the arguments must be placed appropriately:

leaq 16(%rsp), %rsi #, leaq 32(%rsp), %rdi #, call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ #

Even though this is never actually executed, the loop arranges the things before:

movl $0, 16(%rsp) #, D.13379.a movl $0, 20(%rsp) #, D.13379.b

This leads to the messy code. If there is no function call because inlining succeeds, we have only 2 move instructions in the loop and there is no messing going with the %rsp (stack pointer). However, if the inlining fails, we get 6 moves and we mess a lot with the %rsp .

Just to substantiate my theory (note the -finline-limit ), both in C++11 mode:

$ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 84.739057 task-clock # 0.993 CPUs utilized ( +- 1.34% ) 8 context-switches # 0.096 K/sec ( +- 2.22% ) 1 CPU-migrations # 0.009 K/sec ( +- 64.01% ) 19,801 page-faults # 0.234 M/sec 266,809,312 cycles # 3.149 GHz ( +- 0.58% ) [81.20%] 206,804,948 stalled-cycles-frontend # 77.51% frontend cycles idle ( +- 0.91% ) [81.25%] 129,078,683 stalled-cycles-backend # 48.38% backend cycles idle ( +- 1.37% ) [69.49%] 183,130,306 instructions # 0.69 insns per cycle # 1.13 stalled cycles per insn ( +- 0.85% ) [85.35%] 38,759,720 branches # 457.401 M/sec ( +- 0.29% ) [85.43%] 24,527 branch-misses # 0.06% of all branches ( +- 2.66% ) [83.52%] 0.085359326 seconds time elapsed ( +- 1.31% ) $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 37.790325 task-clock # 0.990 CPUs utilized ( +- 2.06% ) 4 context-switches # 0.098 K/sec ( +- 5.77% ) 0 CPU-migrations # 0.011 K/sec ( +- 55.28% ) 19,801 page-faults # 0.524 M/sec 104,699,973 cycles # 2.771 GHz ( +- 2.04% ) [78.91%] 58,023,151 stalled-cycles-frontend # 55.42% frontend cycles idle ( +- 4.03% ) [78.88%] 30,572,036 stalled-cycles-backend # 29.20% backend cycles idle ( +- 5.31% ) [71.40%] 140,669,773 instructions # 1.34 insns per cycle # 0.41 stalled cycles per insn ( +- 1.40% ) [88.14%] 38,117,067 branches # 1008.646 M/sec ( +- 0.65% ) [89.38%] 27,519 branch-misses # 0.07% of all branches ( +- 4.01% ) [86.16%] 0.038187580 seconds time elapsed ( +- 2.05% )

Indeed, if we ask the compiler to try just a little bit harder to inline that function, the difference in performance goes away.

So what is the take away from this story? That failed inlines can cost you a lot and you should make full use of the compiler capabilities: I can only recommend link time optimization. It gave a significant performance boost to my programs (up to 2.5x) and all I needed to do is to pass the -flto flag. That's a pretty good deal! ;)

However, I do not recommend trashing your code with the inline keyword; let the compiler decide what to do. (The optimizer is allowed to treat the inline keyword as white space anyway.)

Great question, +1!