Hyper: Accepting connections could be faster with level polling

Created on 19 Apr 2018 · 19Comments · Source: hyperium/hyper

Disclaimer: I'm new to rust.

I've been developing a "sort of" http server (it just uses the same protocol, but in reality it only get GET requests, so no body, nothing complex really) which really needs high performance. The earlier version was written in Go, actually performs really well (benchmarks down), but I just really don't like the occasional spikes in CPU usage due to GC.

That led me to Rust. I wanted to try it out years ago, but the lack of tooling/docs just scared me away. But! I thought now is the time, let's do it. I heard many good things about it, specially it being really fast. Even saw benchmarks (like the TechEmpower one), where Rust really dominates. I was like YEAH, NICE! That's all I need.

Now here comes the problem: my app always closes the connection, sends the Connection: close HTTP header. But... those benchmarks uses keep-alive. So I did some simple "hello world" benchmarking with wrk (since my app is basically an http server), and here's my problem:

hyper

> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   138.03us  332.44us   9.08ms   96.40%
    Req/Sec    11.95k     1.01k   13.53k    86.32%
  477757 requests in 20.10s, 43.28MB read
Requests/sec:  23769.81
Transfer/sec:      2.15MB

Avg. CPU usage during test: 360%.

> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.03ms    9.72ms 173.51ms   98.64%
    Req/Sec    15.67k     2.63k   21.31k    76.88%
  621721 requests in 20.00s, 56.33MB read
Requests/sec:  31085.62
Transfer/sec:      2.82MB

Avg. CPU usage during test: 105% with occasional spikes to ~130% (GC).

Safe to say that's not really promising... So I went ahead, thought it might be an issue (?) in Hyper, and wrote a bare minimum implementation of the same thing on top of tokio to see if there's any difference.

Tokio

> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   140.13us  347.23us  10.41ms   96.41%
    Req/Sec    11.98k     1.09k   16.11k    86.28%
  478077 requests in 20.10s, 43.31MB read
Requests/sec:  23786.07
Transfer/sec:      2.15MB

Avg. CPU usage during test: 360%.

Basically it's the same as hyper - which is not surprising since hyper is built on top of tokio.

Now, as I said earlier I'm very new to Rust, read the book yesterday, so I'm probably doing something wrong. Is there some kind of option I'm missing?

Thanks!

Edit: forgot to mention I'm using --release when building.

B-upstream S-performance

Source

orangesoup

Most helpful comment

So I've gone a little deeper now, wrote this little app with mio:

extern crate mio;

use std::io::{Write};
use std::net::Shutdown;

use mio::net::{TcpListener};
use mio::*;

const SERVER: Token = Token(0);

fn main() {
    let addr = "127.0.0.1:8080".parse().unwrap();
    let server = TcpListener::bind(&addr).unwrap();

    let poll = Poll::new().unwrap();

    poll.register(
        &server,
        SERVER,
        Ready::readable(),
        PollOpt::edge(),
    ).unwrap();

    let mut events = Events::with_capacity(1024);

    loop {
        poll.poll(&mut events, None).unwrap();

        for event in events.iter() {
            match event.token() {
                SERVER => {
                    let (mut stream, _) = server.accept().unwrap();
                    let _ = stream.set_keepalive(None);
                    let _ = stream.write_all(b"HTTP/1.1 200 OK\r\nContent-Length: 11\r\nContent-Type: text/plain\r\nConnection: close\r\n\r\nhello world");
                    let _ = stream.shutdown(Shutdown::Read);
                }
                _ => unreachable!(),
            }
        }
    }
}

And this got me to even lower: 22k req/s (although only ~50% cpu usage).

But... this example uses edge-triggered epoll. So I've swapped out PollOpt::edge() for PollOpt::empty(), and got... 53k req/s with only 90% avg cpu usage. This is just superb now.

I'm guessing under the hood tokio is using edge-triggered epoll, but I honestly really think this should be an available option to set at least in tokio. Probably wouldn't make much sense for hyper since most of its users will serve HTTP 1.1/2 clients with keep-alive.

orangesoup on 21 Apr 2018

🎉4 👍1

All 19 comments

Would you be able to share what versions of hyper and tokio are being used? The exact versions can be found in the Cargo.lock file.

seanmonstar on 19 Apr 2018

@seanmonstar Absolutely!

[[package]]
name = "hyper"
version = "0.11.25"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "base64 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)",
 "bytes 0.4.6 (registry+https://github.com/rust-lang/crates.io-index)",
 "futures 0.1.21 (registry+https://github.com/rust-lang/crates.io-index)",
 "futures-cpupool 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)",
 "httparse 1.2.4 (registry+https://github.com/rust-lang/crates.io-index)",
 "iovec 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "language-tags 0.2.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "log 0.4.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "mime 0.3.5 (registry+https://github.com/rust-lang/crates.io-index)",
 "percent-encoding 1.0.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "relay 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "time 0.1.39 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-core 0.1.17 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-io 0.1.6 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-proto 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-service 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
 "unicase 2.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "tokio"
version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "futures 0.1.21 (registry+https://github.com/rust-lang/crates.io-index)",
 "mio 0.6.14 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-executor 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-io 0.1.6 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-reactor 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-tcp 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-threadpool 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-timer 0.2.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "tokio-udp 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
]

orangesoup on 19 Apr 2018

This is just a hunch, and may not be the issue, but I've seen an impact on benchmarks before: try pinning to v0.1.12 of tokio-core. So, in your Cargo.toml, have the line for tokio be tokio-core = "=0.1.12".

After that version, tokio upgraded to a new reactor implementation. It's actually better in real world situations, since it more fairly distributes work on your system, but when benchmarking with wrk, where the work is always exactly the same, you only notice the cost of a fairer reactor/executor.

seanmonstar on 19 Apr 2018

Sadly it doesn't seem to compile with 0.1.12, throws a whole bunch of errors.

orangesoup on 19 Apr 2018

Hmmmm, I think for the TechEmpower benchmark, I also had to pin tokio-io directly, I remember some errors... As seen here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/hyper/Cargo.toml

seanmonstar on 19 Apr 2018

Yeah, that worked, made a huge difference as well. Still a ~15% off though.

wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   100.47us   63.45us   4.02ms   92.08%
    Req/Sec    13.60k     1.11k   16.16k    64.68%
  543793 requests in 20.10s, 77.27MB read
Requests/sec:  27055.20
Transfer/sec:      3.84MB

CPU usage went down from 360% to 100%.

orangesoup on 19 Apr 2018

Hm, how many threads is the Go version allowed to use? I assume if using tokio-core, you've stuck to 1 thread. That could be a difference.

seanmonstar on 20 Apr 2018

The Go version is single-threaded. Technically it uses runtime.GOMAXPROCS(runtime.NumCPU()), but it's using a single-threaded event loop and just barfing out hello world on the same thread. So even if I increase threads and connections in wrk, the CPU usage stays around 100% (increases when GC running, but that's on another thread afaik) and the throughput is around the same with less connections and threads:

wrk -t8 -c16 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
  8 threads and 16 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.49ms   25.75ms 344.71ms   98.09%
    Req/Sec     4.80k   711.35     5.89k    86.02%
  758862 requests in 20.10s, 68.75MB read
Requests/sec:  37754.70
Transfer/sec:      3.42MB

orangesoup on 20 Apr 2018

I'd think that call would actually make the Go app use as many threads as you have CPUs. At least, that's what the docs of that method says.

As to hyper itself, I haven't really benchmarked with 1 request per connection, so it is possible there is some low hanging fruit. If I had to guess, it's probably not anything with accepting the connection, and probably something about setting up the HTTP state machine after accepted (and read and write buffers).

seanmonstar on 20 Apr 2018

Just to be sure I set GOMAXPROCS to 1, got the same result. Under the hood that app uses https://github.com/tidwall/evio, which - as states - is a single-threaded event loop.

To be fair, the pure tokio implementation (without hyper) does the same as hyper. I really don't have any deep knowledge about this, but wouldn't it suggest that the bottleneck is either in tokio or in mio?

orangesoup on 20 Apr 2018

Oh interesting. If the version not using hyper performs the same, then yea, it must be at a lower level than hyper.

seanmonstar on 20 Apr 2018

Do you reckon I should open an issue about this somewhere else? And if so, where?

orangesoup on 20 Apr 2018

What would be most helpful would require a bit more time from you, which would be graph a profile or flame graph of the version not using hyper, looking where the app is spending time, and then filing an issue on the repo that owns that method...

Another option which might take more or less time, would be to see if a small example could be made just using mio. That'd help to determine if its in tokio, or mio...

seanmonstar on 20 Apr 2018

@seanmonstar Tbh I don't feel comfortable enough in Rust yet to be sure that my tokio/mio implementation would be perfect, so I've ran multiple benchmarks for all test cases again except tokio - so hyper old/latest, and go. Done flamegraph for all options, and also used _strace_ if there's anything interesting happening there.
All these things were running separately, so while benchmarking throughput I didn't use _strace_ or _perf_ and vica versa.

Go

wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    55.62us   51.12us   3.09ms   87.85%
    Req/Sec    18.71k     2.47k   23.86k    60.60%
  746448 requests in 20.10s, 67.63MB read
Requests/sec:  37137.91
Transfer/sec:      3.36MB

Avg. CPU load: 100% (single thread).

strace

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.72    0.656000        7718        85         6 futex
  0.62    0.004110           0     23435           epoll_wait
  0.27    0.001764           0     23405           close
  0.19    0.001245           0     23403           write
  0.09    0.000565           0     46824           fcntl
  0.04    0.000292           0     70222           epoll_ctl
  0.04    0.000235           0     23412           accept
  0.02    0.000160           0     23407           read
  0.02    0.000127           0     23407           getsockname
  0.00    0.000006           2         4           sched_yield
  0.00    0.000000           0         2           pselect6
  0.00    0.000000           0         5           epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00    0.664504                257611         6 total

graph

https://cdn.rawgit.com/orangesoup/09c33197de50615d0527a5088543affd/raw/351666be5c759a7c39780063208a9bc6a48f1a40/flamegraph_go.svg

hyper (tokio-core = "=0.1.12"; tokio-io = "=0.1.4")

wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    98.89us   65.99us   4.25ms   92.33%
    Req/Sec    13.47k     1.21k   15.62k    69.15%
  538551 requests in 20.10s, 76.53MB read
Requests/sec:  26793.80
Transfer/sec:      3.81MB

Avg. CPU load: 100% (as you said, this version is single-threaded).

strace

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.89    0.008857           0    124116           close
 33.18    0.006696           0    124112           writev
  5.62    0.001135           0    155298     31183 accept4
  4.65    0.000939           0    248230           setsockopt
  4.64    0.000937           0    124115           readv
  3.21    0.000648           0     93556           epoll_wait
  2.83    0.000572           0    124115           epoll_ctl
  1.96    0.000396           0    124115           ioctl
  0.00    0.000000           0         8         3 read
  0.00    0.000000           0         3           write
  0.00    0.000000           0         1           open
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         1           lseek
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         1           munmap
------ ----------- ----------- --------- --------- ----------------
100.00    0.020180               1117674     31186 total

graph

https://cdn.rawgit.com/orangesoup/c68ef2fe9cbe0fb43bf01f7b883d8096/raw/0de030d2f0e3465d7466ca9350b8982317b1a322/flamegraph_hyper_old.svg

hyper (latest, tokio 0.1.5)

wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   121.06us  105.47us   7.58ms   97.88%
    Req/Sec    12.00k     1.16k   13.51k    73.38%
  480046 requests in 20.10s, 68.21MB read
Requests/sec:  23883.08
Transfer/sec:      3.39MB

Avg. CPU load: 115%.

strace

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 40.55    0.009221           0    112527           close
 29.99    0.006820           0    112523           writev
  9.18    0.002087           0    225052           epoll_ctl
  5.94    0.001350           0    141657     29131 accept4
  4.49    0.001021           0    113994      1378 futex
  4.08    0.000928           0    225052           setsockopt
  3.94    0.000895           0    112526           readv
  1.83    0.000417           0    112526           ioctl
  0.00    0.000000           0         2           read
  0.00    0.000000           0         1           open
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         1           lseek
  0.00    0.000000           0         1           mmap
  0.00    0.000000           0         1           munmap
------ ----------- ----------- --------- --------- ----------------
100.00    0.022739               1155865     30509 total

graph

https://cdn.rawgit.com/orangesoup/dedaab5cd10a26ececb80b86e7b3e3aa/raw/9bcd8d92b8669f38cc75d7b7017776845b0d35f4/flamegraph_hyper_latest.svg

orangesoup on 20 Apr 2018

The flamegraphs suggest that a lot of time is spent in __libc_close... I'm just guessing wildly at this point, but I wonder if something like setting TCP_NODELAY would have any impact...

seanmonstar on 20 Apr 2018

I think I've tried using that option with the pure tokio implementation to see if there's any difference, but didn't change much really. But then again, it might be that I just messed that up totally. Is there like a pure tokio "hello world" example that's not an echo server?

orangesoup on 20 Apr 2018

So I've gone a little deeper now, wrote this little app with mio:

extern crate mio;

use std::io::{Write};
use std::net::Shutdown;

use mio::net::{TcpListener};
use mio::*;

const SERVER: Token = Token(0);

fn main() {
    let addr = "127.0.0.1:8080".parse().unwrap();
    let server = TcpListener::bind(&addr).unwrap();

    let poll = Poll::new().unwrap();

    poll.register(
        &server,
        SERVER,
        Ready::readable(),
        PollOpt::edge(),
    ).unwrap();

    let mut events = Events::with_capacity(1024);

    loop {
        poll.poll(&mut events, None).unwrap();

        for event in events.iter() {
            match event.token() {
                SERVER => {
                    let (mut stream, _) = server.accept().unwrap();
                    let _ = stream.set_keepalive(None);
                    let _ = stream.write_all(b"HTTP/1.1 200 OK\r\nContent-Length: 11\r\nContent-Type: text/plain\r\nConnection: close\r\n\r\nhello world");
                    let _ = stream.shutdown(Shutdown::Read);
                }
                _ => unreachable!(),
            }
        }
    }
}

And this got me to even lower: 22k req/s (although only ~50% cpu usage).

But... this example uses edge-triggered epoll. So I've swapped out PollOpt::edge() for PollOpt::empty(), and got... 53k req/s with only 90% avg cpu usage. This is just superb now.

orangesoup on 21 Apr 2018

🎉4 👍1

Wow, fantastic find. With that, a case could be made on the tokio repo that the TcpListener should perhaps register on level triggered events, instead of edge... Or at least be an option. Care would be needed that someone is actually accepting the sockets as fast as possible, or else epoll would start to waste CPU alerting to an event that the user isn't ready to handle...

seanmonstar on 21 Apr 2018

@orangesoup you probably get better performance w/ level because the edge version is broken.

With edge triggered, you must drain the resource. In your case, you only try accepting one socket. There are probably more sockets waiting to be accepted, but you loop back into poll. So, sockets are pending being accepted while you block in poll.