Last time I post a blog Boosting Hyper with MAY that you can see the coroutine version is slower than the future version with a single working thread. I’m curious about why it’s slow. Some questions arise in my head. Maybe the context switching cost is too high? Or maybe the logic of thread version is not as optimized as the future version. After all the thread version of hyper is not actively developed compared with the master branch.

So I decide to profile the server to see what actually happens.

I use cargo-profile, it’s not actively developed now, but usable.

I use the following command to install it on my ubuntu vm

1

2

$ sudo apt-get install valgrind

$ cargo install cargo-profiler



Modify the echo server

We need to let the server exit normally to make the profile tool happy. Just insert this code in the echo server example

1

2

3

4

5

6

7

8

9

fn main () {

......



std::thread::spawn(|| {

std::thread::sleep(std::time::Duration::from_secs( 10 ));

std::process::exit( 0 );

});

println! ( "Listening on http://127.0.0.1:3000" );

}



Running the profile

We use the release version to see what happened with hyper.

1

2

3

$ cd hyper

$ cargo build --example=hello --release

$ cargo profiler callgrind --bin target/release/examples/hello



And run the client in another terminal at the same time.

1

$ wrk http://127.0.0.1:3000 -d 10 -t 2 -c 20



After 10 seconds you will see the result printed out by cargo-profile.

The may version result

1

2

3

4

5

6

7

8

9

10

70,276,542 (8.9%) ???:_..std..io..Write..write_fmt..Adaptor....a$C$..T....as..core..fmt..Write..::write_str

-----------------------------------------------------------------------

64,212,448 (8.2%) memcpy-sse2-unaligned.S:__memcpy_sse2_unaligned

-----------------------------------------------------------------------

61,684,172 (7.8%) ???:_..hyper..header..NewlineReplacer....a$C$....b....as..core..fmt..Write..::write_str

-----------------------------------------------------------------------

55,352,952 (7.0%) ???:_..std..io..buffered..BufWriter..W....as..std..io..Write..::write

-----------------------------------------------------------------------

47,665,042 (6.1%) ???:_..hyper..http..h1..HttpWriter..W....as..std..io..Write..::write

-----------------------------------------------------------------------



The future version result

1

2

3

4

5

6

7

8

9

10

11

76,417,403 (12.0%) memcpy-sse2-unaligned.S:__memcpy_sse2_unaligned

-----------------------------------------------------------------------

71,438,100 (11.2%) memset.S:__GI_memset

-----------------------------------------------------------------------

32,188,192 (5.0%) ???:_..futures..future..map_err..MapErr..A$C$..F....as..futures..future..Future..::poll

-----------------------------------------------------------------------

27,514,102 (4.3%) ???:hyper::proto::h1::parse::_..impl..hyper..proto..Http1Transaction..for..hyper..proto..ServerTransaction..::parse

-----------------------------------------------------------------------

22,228,235 (3.5%) ???:tokio_core::reactor::Core::poll

-----------------------------------------------------------------------

19,713,445 (3.1%) ???:_..hyper..proto..dispatch..Dispatcher..D$C$..Bs$C$..I$C$..B$C$..T$C$..K....::poll_read



Conclusion

Apparently the hyper thread version is not fully optimized. It spend it’s most time to format strings. While the future version spend most of it’s time to copy memory which makes sense that the server is just echo back strings.

And we also notice that the future based version spend a noticeable amount of time to call the poll method by the framework

(3.5%) ???:tokio_core::reactor::Core::poll

May version context switch runtime is not that heavy, I found this in the profile

(1.8%) ???:_..F..as..generator..gen_impl..FnBox..::call_box

For a fare comparison for simple http echo server, I think the most proper candidates are tokio_minihttp and the http example in may project. They don’t evolve too much frameworks and just do the echo things.

Below is the result on my ubuntu vm

tokio_minihttp(1 thread)

1

2

3

4

5

6

7

8

9

10

$ wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20

Running 10s test @ http://127.0.0.1:8080

2 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 262.43us 818.35us 17.52ms 97.63%

Req/Sec 61.71k 9.21k 79.79k 84.50%

1228552 requests in 10.01s, 124.19MB read

Requests/sec: 122691.03

Transfer/sec: 12.40MB

wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20 1.80s user 7.25s system 90% cpu 10.017 total



may http example(1 thread)

1

2

3

4

5

6

7

8

9

10

$ wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20

Running 10s test @ http://127.0.0.1:8080

2 threads and 20 connections

Thread Stats Avg Stdev Max +/- Stdev

Latency 502.23us 2.12ms 28.01ms 95.76%

Req/Sec 87.78k 8.60k 113.63k 79.50%

1748461 requests in 10.02s, 155.07MB read

Requests/sec: 174571.22

Transfer/sec: 15.48MB

wrk http://127.0.0.1:8080 -d 10 -t 2 -c 20 1.80s user 9.59s system 113% cpu 10.019 total

