Disclaimer: I'm new to rust.
I've been developing a "sort of" http server (it just uses the same protocol, but in reality it only get GET requests, so no body, nothing complex really) which really needs high performance. The earlier version was written in Go, actually performs really well (benchmarks down), but I just really don't like the occasional spikes in CPU usage due to GC.
That led me to Rust. I wanted to try it out years ago, but the lack of tooling/docs just scared me away. But! I thought now is the time, let's do it. I heard many good things about it, specially it being really fast. Even saw benchmarks (like the TechEmpower one), where Rust really dominates. I was like YEAH, NICE! That's all I need.
Now here comes the problem: my app always closes the connection, sends the Connection: close HTTP header. But... those benchmarks uses keep-alive. So I did some simple "hello world" benchmarking with wrk (since my app is basically an http server), and here's my problem:
> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 138.03us 332.44us 9.08ms 96.40%
Req/Sec 11.95k 1.01k 13.53k 86.32%
477757 requests in 20.10s, 43.28MB read
Requests/sec: 23769.81
Transfer/sec: 2.15MB
Avg. CPU usage during test: 360%.
> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.03ms 9.72ms 173.51ms 98.64%
Req/Sec 15.67k 2.63k 21.31k 76.88%
621721 requests in 20.00s, 56.33MB read
Requests/sec: 31085.62
Transfer/sec: 2.82MB
Avg. CPU usage during test: 105% with occasional spikes to ~130% (GC).
Safe to say that's not really promising... So I went ahead, thought it might be an issue (?) in Hyper, and wrote a bare minimum implementation of the same thing on top of tokio to see if there's any difference.
> wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 140.13us 347.23us 10.41ms 96.41%
Req/Sec 11.98k 1.09k 16.11k 86.28%
478077 requests in 20.10s, 43.31MB read
Requests/sec: 23786.07
Transfer/sec: 2.15MB
Avg. CPU usage during test: 360%.
Basically it's the same as hyper - which is not surprising since hyper is built on top of tokio.
Now, as I said earlier I'm very new to Rust, read the book yesterday, so I'm probably doing something wrong. Is there some kind of option I'm missing?
Thanks!
Edit: forgot to mention I'm using --release when building.
Would you be able to share what versions of hyper and tokio are being used? The exact versions can be found in the Cargo.lock file.
@seanmonstar Absolutely!
[[package]]
name = "hyper"
version = "0.11.25"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"base64 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)",
"bytes 0.4.6 (registry+https://github.com/rust-lang/crates.io-index)",
"futures 0.1.21 (registry+https://github.com/rust-lang/crates.io-index)",
"futures-cpupool 0.1.8 (registry+https://github.com/rust-lang/crates.io-index)",
"httparse 1.2.4 (registry+https://github.com/rust-lang/crates.io-index)",
"iovec 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
"language-tags 0.2.2 (registry+https://github.com/rust-lang/crates.io-index)",
"log 0.4.1 (registry+https://github.com/rust-lang/crates.io-index)",
"mime 0.3.5 (registry+https://github.com/rust-lang/crates.io-index)",
"percent-encoding 1.0.1 (registry+https://github.com/rust-lang/crates.io-index)",
"relay 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
"time 0.1.39 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-core 0.1.17 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-io 0.1.6 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-proto 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-service 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
"unicase 2.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
]
[[package]]
name = "tokio"
version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"futures 0.1.21 (registry+https://github.com/rust-lang/crates.io-index)",
"mio 0.6.14 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-executor 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-io 0.1.6 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-reactor 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-tcp 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-threadpool 0.1.2 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-timer 0.2.1 (registry+https://github.com/rust-lang/crates.io-index)",
"tokio-udp 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
]
This is just a hunch, and may not be the issue, but I've seen an impact on benchmarks before: try pinning to v0.1.12 of tokio-core. So, in your Cargo.toml, have the line for tokio be tokio-core = "=0.1.12".
After that version, tokio upgraded to a new reactor implementation. It's actually better in real world situations, since it more fairly distributes work on your system, but when benchmarking with wrk, where the work is always exactly the same, you only notice the cost of a fairer reactor/executor.
Sadly it doesn't seem to compile with 0.1.12, throws a whole bunch of errors.
Hmmmm, I think for the TechEmpower benchmark, I also had to pin tokio-io directly, I remember some errors... As seen here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/hyper/Cargo.toml
Yeah, that worked, made a huge difference as well. Still a ~15% off though.
wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 100.47us 63.45us 4.02ms 92.08%
Req/Sec 13.60k 1.11k 16.16k 64.68%
543793 requests in 20.10s, 77.27MB read
Requests/sec: 27055.20
Transfer/sec: 3.84MB
CPU usage went down from 360% to 100%.
Hm, how many threads is the Go version allowed to use? I assume if using tokio-core, you've stuck to 1 thread. That could be a difference.
The Go version is single-threaded. Technically it uses runtime.GOMAXPROCS(runtime.NumCPU()), but it's using a single-threaded event loop and just barfing out hello world on the same thread. So even if I increase threads and connections in wrk, the CPU usage stays around 100% (increases when GC running, but that's on another thread afaik) and the throughput is around the same with less connections and threads:
wrk -t8 -c16 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
8 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.49ms 25.75ms 344.71ms 98.09%
Req/Sec 4.80k 711.35 5.89k 86.02%
758862 requests in 20.10s, 68.75MB read
Requests/sec: 37754.70
Transfer/sec: 3.42MB
I'd think that call would actually make the Go app use as many threads as you have CPUs. At least, that's what the docs of that method says.
As to hyper itself, I haven't really benchmarked with 1 request per connection, so it is possible there is some low hanging fruit. If I had to guess, it's probably not anything with accepting the connection, and probably something about setting up the HTTP state machine after accepted (and read and write buffers).
Just to be sure I set GOMAXPROCS to 1, got the same result. Under the hood that app uses https://github.com/tidwall/evio, which - as states - is a single-threaded event loop.
To be fair, the pure tokio implementation (without hyper) does the same as hyper. I really don't have any deep knowledge about this, but wouldn't it suggest that the bottleneck is either in tokio or in mio?
Oh interesting. If the version not using hyper performs the same, then yea, it must be at a lower level than hyper.
Do you reckon I should open an issue about this somewhere else? And if so, where?
What would be most helpful would require a bit more time from you, which would be graph a profile or flame graph of the version not using hyper, looking where the app is spending time, and then filing an issue on the repo that owns that method...
Another option which might take more or less time, would be to see if a small example could be made just using mio. That'd help to determine if its in tokio, or mio...
@seanmonstar Tbh I don't feel comfortable enough in Rust yet to be sure that my tokio/mio implementation would be perfect, so I've ran multiple benchmarks for all test cases again except tokio - so hyper old/latest, and go. Done flamegraph for all options, and also used _strace_ if there's anything interesting happening there.
All these things were running separately, so while benchmarking throughput I didn't use _strace_ or _perf_ and vica versa.
wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:13370
Running 20s test @ http://127.0.0.1:13370
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 55.62us 51.12us 3.09ms 87.85%
Req/Sec 18.71k 2.47k 23.86k 60.60%
746448 requests in 20.10s, 67.63MB read
Requests/sec: 37137.91
Transfer/sec: 3.36MB
Avg. CPU load: 100% (single thread).
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.72 0.656000 7718 85 6 futex
0.62 0.004110 0 23435 epoll_wait
0.27 0.001764 0 23405 close
0.19 0.001245 0 23403 write
0.09 0.000565 0 46824 fcntl
0.04 0.000292 0 70222 epoll_ctl
0.04 0.000235 0 23412 accept
0.02 0.000160 0 23407 read
0.02 0.000127 0 23407 getsockname
0.00 0.000006 2 4 sched_yield
0.00 0.000000 0 2 pselect6
0.00 0.000000 0 5 epoll_pwait
------ ----------- ----------- --------- --------- ----------------
100.00 0.664504 257611 6 total
wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 98.89us 65.99us 4.25ms 92.33%
Req/Sec 13.47k 1.21k 15.62k 69.15%
538551 requests in 20.10s, 76.53MB read
Requests/sec: 26793.80
Transfer/sec: 3.81MB
Avg. CPU load: 100% (as you said, this version is single-threaded).
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
43.89 0.008857 0 124116 close
33.18 0.006696 0 124112 writev
5.62 0.001135 0 155298 31183 accept4
4.65 0.000939 0 248230 setsockopt
4.64 0.000937 0 124115 readv
3.21 0.000648 0 93556 epoll_wait
2.83 0.000572 0 124115 epoll_ctl
1.96 0.000396 0 124115 ioctl
0.00 0.000000 0 8 3 read
0.00 0.000000 0 3 write
0.00 0.000000 0 1 open
0.00 0.000000 0 2 fstat
0.00 0.000000 0 1 lseek
0.00 0.000000 0 1 mmap
0.00 0.000000 0 1 munmap
------ ----------- ----------- --------- --------- ----------------
100.00 0.020180 1117674 31186 total
wrk -t2 -c4 -d20s -H "Connection: close" http://127.0.0.1:8080
Running 20s test @ http://127.0.0.1:8080
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 121.06us 105.47us 7.58ms 97.88%
Req/Sec 12.00k 1.16k 13.51k 73.38%
480046 requests in 20.10s, 68.21MB read
Requests/sec: 23883.08
Transfer/sec: 3.39MB
Avg. CPU load: 115%.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
40.55 0.009221 0 112527 close
29.99 0.006820 0 112523 writev
9.18 0.002087 0 225052 epoll_ctl
5.94 0.001350 0 141657 29131 accept4
4.49 0.001021 0 113994 1378 futex
4.08 0.000928 0 225052 setsockopt
3.94 0.000895 0 112526 readv
1.83 0.000417 0 112526 ioctl
0.00 0.000000 0 2 read
0.00 0.000000 0 1 open
0.00 0.000000 0 2 fstat
0.00 0.000000 0 1 lseek
0.00 0.000000 0 1 mmap
0.00 0.000000 0 1 munmap
------ ----------- ----------- --------- --------- ----------------
100.00 0.022739 1155865 30509 total
The flamegraphs suggest that a lot of time is spent in __libc_close... I'm just guessing wildly at this point, but I wonder if something like setting TCP_NODELAY would have any impact...
I think I've tried using that option with the pure tokio implementation to see if there's any difference, but didn't change much really. But then again, it might be that I just messed that up totally. Is there like a pure tokio "hello world" example that's not an echo server?
So I've gone a little deeper now, wrote this little app with mio:
extern crate mio;
use std::io::{Write};
use std::net::Shutdown;
use mio::net::{TcpListener};
use mio::*;
const SERVER: Token = Token(0);
fn main() {
let addr = "127.0.0.1:8080".parse().unwrap();
let server = TcpListener::bind(&addr).unwrap();
let poll = Poll::new().unwrap();
poll.register(
&server,
SERVER,
Ready::readable(),
PollOpt::edge(),
).unwrap();
let mut events = Events::with_capacity(1024);
loop {
poll.poll(&mut events, None).unwrap();
for event in events.iter() {
match event.token() {
SERVER => {
let (mut stream, _) = server.accept().unwrap();
let _ = stream.set_keepalive(None);
let _ = stream.write_all(b"HTTP/1.1 200 OK\r\nContent-Length: 11\r\nContent-Type: text/plain\r\nConnection: close\r\n\r\nhello world");
let _ = stream.shutdown(Shutdown::Read);
}
_ => unreachable!(),
}
}
}
}
And this got me to even lower: 22k req/s (although only ~50% cpu usage).
But... this example uses edge-triggered epoll. So I've swapped out PollOpt::edge() for PollOpt::empty(), and got... 53k req/s with only 90% avg cpu usage. This is just superb now.
I'm guessing under the hood tokio is using edge-triggered epoll, but I honestly really think this should be an available option to set at least in tokio. Probably wouldn't make much sense for hyper since most of its users will serve HTTP 1.1/2 clients with keep-alive.
Wow, fantastic find. With that, a case could be made on the tokio repo that the TcpListener should perhaps register on level triggered events, instead of edge... Or at least be an option. Care would be needed that someone is actually accepting the sockets as fast as possible, or else epoll would start to waste CPU alerting to an event that the user isn't ready to handle...
@orangesoup you probably get better performance w/ level because the edge version is broken.
With edge triggered, you must drain the resource. In your case, you only try accepting one socket. There are probably more sockets waiting to be accepted, but you loop back into poll. So, sockets are pending being accepted while you block in poll.
Most helpful comment
So I've gone a little deeper now, wrote this little app with mio:
And this got me to even lower:
22k req/s(although only ~50% cpu usage).But... this example uses edge-triggered epoll. So I've swapped out
PollOpt::edge()forPollOpt::empty(), and got... 53k req/s with only 90% avg cpu usage. This is just superb now.I'm guessing under the hood tokio is using edge-triggered epoll, but I honestly really think this should be an available option to set at least in tokio. Probably wouldn't make much sense for hyper since most of its users will serve HTTP 1.1/2 clients with keep-alive.