tokio: 0.1.22
tokio-current-thread: 0.1.6
tokio-executor: 0.1.8
tokio-reactor: 0.1.9
tokio-sync: 0.1.6
tokio-threadpool: 0.1.15
Linux jerry 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux
Probably somewhere in the interaction between tokio-sync, tokio-executor, and tokio-reactor.
I am using a bounded tokio-sync::mpsc channel to send requests from a number of clients (each with a Sender) to a single consumer (with a Receiver). The senders each only have one request outstanding at the time. With every request, they also include a oneshot::Sender for the receiver to return the response to the request on. The receiver is a spawned future that polls the receiver, and does some compute for every request before it returns the response on the included oneshot, and then goes on to handle the next request.
I have coded this little setup up in https://github.com/jonhoo/tower-buffer-benchmark/blob/master/src/channel.rs, along with some statistics tracking. In particular, this minimal example will run 150 senders for 10s, and then track the time from when a sender sent a request (after poll_ready returned Ready) until the receiver received that request, as well as the time from when the receiver send a response on the oneshot until the receiver sees that response.
With a service time of 10µs (see the sleep in impl Future for Svc), and a buffer length of 5, we would expect send -> recv time around 5*10µs (recall that we time from _after_ Sender::poll_ready returns). Instead, the results on a 40-core machine with two NUMA nodes are as follows:
------------------------------[ send -> recv ]------------------------------
mean: 347.9µs, p50: 352µs, p90: 368µs, p99: 385µs, p999: 401µs, max: 786µs, #: 142621
0µs - 1µs | | 0.0th %-ile
1µs - 2µs | | 0.0th %-ile
2µs - 4µs | | 0.0th %-ile
4µs - 8µs | | 0.0th %-ile
8µs - 16µs | | 0.0th %-ile
16µs - 32µs | | 0.0th %-ile
32µs - 64µs | * | 0.0th %-ile
64µs - 128µs | | 0.0th %-ile
128µs - 256µs | * | 0.0th %-ile
256µs - 512µs | **************************************** | 100.0th %-ile
512µs - 1024µs | * | 100.0th %-ile
------------------------------[ recv -> send ]------------------------------
mean: 4.1µs, p50: 8µs, p90: 8µs, p99: 8µs, p999: 16µs, max: 245µs, #: 142621
0µs - 1µs | **************************************** | 99.4th %-ile
1µs - 2µs | | 99.4th %-ile
2µs - 4µs | | 99.4th %-ile
4µs - 8µs | | 99.4th %-ile
8µs - 16µs | * | 100.0th %-ile
16µs - 32µs | * | 100.0th %-ile
32µs - 64µs | * | 100.0th %-ile
64µs - 128µs | * | 100.0th %-ile
128µs - 256µs | * | 100.0th %-ile
The results on my 8-HT laptop are:
------------------------------[ send -> recv ]------------------------------
mean: 368.7µs, p50: 385µs, p90: 385µs, p99: 409µs, p999: 442µs, max: 1556µs, #: 134878
0µs - 1µs | | 0.0th %-ile
1µs - 2µs | | 0.0th %-ile
2µs - 4µs | | 0.0th %-ile
4µs - 8µs | | 0.0th %-ile
8µs - 16µs | | 0.0th %-ile
16µs - 32µs | * | 0.0th %-ile
32µs - 64µs | | 0.0th %-ile
64µs - 128µs | * | 0.0th %-ile
128µs - 256µs | * | 0.0th %-ile
256µs - 512µs | **************************************** | 100.0th %-ile
512µs - 1024µs | * | 100.0th %-ile
1024µs - 2048µs | * | 100.0th %-ile
There is very clearly _something_ that causes the send/recv cycle to be significantly slower than expected.
If we set the service time to 0, the effect goes away:
------------------------------[ send -> recv ]------------------------------
mean: 17.8µs, p50: 24µs, p90: 32µs, p99: 40µs, p999: 49µs, max: 1212µs, #: 2708727
0µs - 1µs | * | 0.2th %-ile
1µs - 2µs | | 0.2th %-ile
2µs - 4µs | | 0.2th %-ile
4µs - 8µs | | 0.2th %-ile
8µs - 16µs | ******************* | 46.5th %-ile
16µs - 32µs | ********************* | 96.7th %-ile
32µs - 64µs | ** | 100.0th %-ile
64µs - 128µs | * | 100.0th %-ile
128µs - 256µs | * | 100.0th %-ile
256µs - 512µs | * | 100.0th %-ile
512µs - 1024µs | * | 100.0th %-ile
1024µs - 2048µs | * | 100.0th %-ile
This suggests that the issue manifests if the receiver does not yield for a time.
If we use the current thread runtime instead (with service time 10µs), we see:
------------------------------[ send -> recv ]------------------------------
mean: 137.1µs, p50: 139µs, p90: 270µs, p99: 278µs, p999: 286µs, max: 335µs, #: 146280
0µs - 1µs | ***** | 11.8th %-ile
1µs - 2µs | | 11.8th %-ile
2µs - 4µs | | 11.8th %-ile
4µs - 8µs | | 11.8th %-ile
8µs - 16µs | **** | 19.8th %-ile
16µs - 32µs | * | 20.1th %-ile
32µs - 64µs | * | 20.2th %-ile
64µs - 128µs | *********** | 47.5th %-ile
128µs - 256µs | **************** | 85.7th %-ile
256µs - 512µs | ****** | 100.0th %-ile
------------------------------[ recv -> send ]------------------------------
mean: 336.4µs, p50: 344µs, p90: 352µs, p99: 360µs, p999: 360µs, max: 376µs, #: 146280
0µs - 1µs | * | 0.0th %-ile
1µs - 2µs | | 0.0th %-ile
2µs - 4µs | | 0.0th %-ile
4µs - 8µs | | 0.0th %-ile
8µs - 16µs | * | 0.0th %-ile
16µs - 32µs | | 0.0th %-ile
32µs - 64µs | | 0.0th %-ile
64µs - 128µs | | 0.0th %-ile
128µs - 256µs | | 0.0th %-ile
256µs - 512µs | **************************************** | 100.0th %-ile
The issue is still present (with some more variance in the results), suggesting that this is not a lock contention issue. recv -> send is much slower, but this is as expected -- with only one executor, receiving all the responses takes time.
I think that this specific artifact (at least on my PC) might be partly caused by the sleep.
If you replace std::thread::sleep with a busy loop that runs for ~10k nanoseconds, something like this:
let start = time.now();
loop {
let now = time.now();
self.4 += now; // self.4 is u64 in Svc, just to prevent possible optimizations
if now - start >= 10_000 {
break;
}
}
the latencies get several times lower.
If I measure the "duration" of the sleep:
let start = time.now();
std::thread::sleep(std::time::Duration::from_nanos(10_000));
println!("{}", time.now() - start);
I consistently get numbers around 60k nanoseconds, so ~6 times more than the sleep intended (I get this number consistently using Quanta, Instant and SystemClock). This might also explain the numbers that you got.
I get these mean send -> recv times on my machine:
Single-threaded runtime:
Multi-threaded runtime:
The 6x higher latency seems to be consistent with the actual additional time spent in sleep.
I'm not sure why sleep is so sloppy (at least on my system), a context switch should be several microseconds, maybe 10 us is just too fast for my system. I have noticed that with the sleep there are usually 2-3x more CPU migrations (perf stat -e cpu-migrations) than with the busy loop, but that might be an unrelated artifact.
I plan on investigating this in more detail soon, thanks for the additional info :+1:
I'm going to close this as the new scheduler improves this case.
Most helpful comment
I think that this specific artifact (at least on my PC) might be partly caused by the sleep.
If you replace
std::thread::sleepwith a busy loop that runs for ~10k nanoseconds, something like this:the latencies get several times lower.
If I measure the "duration" of the sleep:
I consistently get numbers around 60k nanoseconds, so ~6 times more than the sleep intended (I get this number consistently using
Quanta,InstantandSystemClock). This might also explain the numbers that you got.I get these mean
send -> recvtimes on my machine:Single-threaded runtime:
Multi-threaded runtime:
The 6x higher latency seems to be consistent with the actual additional time spent in sleep.
I'm not sure why
sleepis so sloppy (at least on my system), a context switch should be several microseconds, maybe 10 us is just too fast for my system. I have noticed that with thesleepthere are usually 2-3x more CPU migrations (perf stat -e cpu-migrations) than with the busy loop, but that might be an unrelated artifact.