Tokio: sync/runtime: high latency between mpsc send and receive

Created on 3 Jul 2019 · 3Comments · Source: tokio-rs/tokio

Version

tokio: 0.1.22
tokio-current-thread: 0.1.6
tokio-executor: 0.1.8
tokio-reactor: 0.1.9
tokio-sync: 0.1.6
tokio-threadpool: 0.1.15

Platform

Linux jerry 5.1.15-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 25 04:49:39 UTC 2019 x86_64 GNU/Linux

Subcrates

Probably somewhere in the interaction between tokio-sync, tokio-executor, and tokio-reactor.

Description

I am using a bounded tokio-sync::mpsc channel to send requests from a number of clients (each with a Sender) to a single consumer (with a Receiver). The senders each only have one request outstanding at the time. With every request, they also include a oneshot::Sender for the receiver to return the response to the request on. The receiver is a spawned future that polls the receiver, and does some compute for every request before it returns the response on the included oneshot, and then goes on to handle the next request.

I have coded this little setup up in https://github.com/jonhoo/tower-buffer-benchmark/blob/master/src/channel.rs, along with some statistics tracking. In particular, this minimal example will run 150 senders for 10s, and then track the time from when a sender sent a request (after poll_ready returned Ready) until the receiver received that request, as well as the time from when the receiver send a response on the oneshot until the receiver sees that response.

With a service time of 10µs (see the sleep in impl Future for Svc), and a buffer length of 5, we would expect send -> recv time around 5*10µs (recall that we time from _after_ Sender::poll_ready returns). Instead, the results on a 40-core machine with two NUMA nodes are as follows:

------------------------------[ send -> recv ]------------------------------
mean: 347.9µs, p50: 352µs, p90: 368µs, p99: 385µs, p999: 401µs, max: 786µs, #: 142621
   0µs -    1µs |                                          |  0.0th %-ile
   1µs -    2µs |                                          |  0.0th %-ile
   2µs -    4µs |                                          |  0.0th %-ile
   4µs -    8µs |                                          |  0.0th %-ile
   8µs -   16µs |                                          |  0.0th %-ile
  16µs -   32µs |                                          |  0.0th %-ile
  32µs -   64µs | *                                        |  0.0th %-ile
  64µs -  128µs |                                          |  0.0th %-ile
 128µs -  256µs | *                                        |  0.0th %-ile
 256µs -  512µs | **************************************** | 100.0th %-ile
 512µs - 1024µs | *                                        | 100.0th %-ile

------------------------------[ recv -> send ]------------------------------
mean: 4.1µs, p50: 8µs, p90: 8µs, p99: 8µs, p999: 16µs, max: 245µs, #: 142621
   0µs -    1µs | **************************************** | 99.4th %-ile
   1µs -    2µs |                                          | 99.4th %-ile
   2µs -    4µs |                                          | 99.4th %-ile
   4µs -    8µs |                                          | 99.4th %-ile
   8µs -   16µs | *                                        | 100.0th %-ile
  16µs -   32µs | *                                        | 100.0th %-ile
  32µs -   64µs | *                                        | 100.0th %-ile
  64µs -  128µs | *                                        | 100.0th %-ile
 128µs -  256µs | *                                        | 100.0th %-ile

The results on my 8-HT laptop are:

------------------------------[ send -> recv ]------------------------------
mean: 368.7µs, p50: 385µs, p90: 385µs, p99: 409µs, p999: 442µs, max: 1556µs, #: 134878
   0µs -    1µs |                                          |  0.0th %-ile
   1µs -    2µs |                                          |  0.0th %-ile
   2µs -    4µs |                                          |  0.0th %-ile
   4µs -    8µs |                                          |  0.0th %-ile
   8µs -   16µs |                                          |  0.0th %-ile
  16µs -   32µs | *                                        |  0.0th %-ile
  32µs -   64µs |                                          |  0.0th %-ile
  64µs -  128µs | *                                        |  0.0th %-ile
 128µs -  256µs | *                                        |  0.0th %-ile
 256µs -  512µs | **************************************** | 100.0th %-ile
 512µs - 1024µs | *                                        | 100.0th %-ile
1024µs - 2048µs | *                                        | 100.0th %-ile

There is very clearly _something_ that causes the send/recv cycle to be significantly slower than expected.

If we set the service time to 0, the effect goes away:

------------------------------[ send -> recv ]------------------------------
mean: 17.8µs, p50: 24µs, p90: 32µs, p99: 40µs, p999: 49µs, max: 1212µs, #: 2708727
   0µs -    1µs | *                                        |  0.2th %-ile
   1µs -    2µs |                                          |  0.2th %-ile
   2µs -    4µs |                                          |  0.2th %-ile
   4µs -    8µs |                                          |  0.2th %-ile
   8µs -   16µs | *******************                      | 46.5th %-ile
  16µs -   32µs | *********************                    | 96.7th %-ile
  32µs -   64µs | **                                       | 100.0th %-ile
  64µs -  128µs | *                                        | 100.0th %-ile
 128µs -  256µs | *                                        | 100.0th %-ile
 256µs -  512µs | *                                        | 100.0th %-ile
 512µs - 1024µs | *                                        | 100.0th %-ile
1024µs - 2048µs | *                                        | 100.0th %-ile

This suggests that the issue manifests if the receiver does not yield for a time.

If we use the current thread runtime instead (with service time 10µs), we see:

------------------------------[ send -> recv ]------------------------------
mean: 137.1µs, p50: 139µs, p90: 270µs, p99: 278µs, p999: 286µs, max: 335µs, #: 146280
   0µs -    1µs | *****                                    | 11.8th %-ile
   1µs -    2µs |                                          | 11.8th %-ile
   2µs -    4µs |                                          | 11.8th %-ile
   4µs -    8µs |                                          | 11.8th %-ile
   8µs -   16µs | ****                                     | 19.8th %-ile
  16µs -   32µs | *                                        | 20.1th %-ile
  32µs -   64µs | *                                        | 20.2th %-ile
  64µs -  128µs | ***********                              | 47.5th %-ile
 128µs -  256µs | ****************                         | 85.7th %-ile
 256µs -  512µs | ******                                   | 100.0th %-ile

------------------------------[ recv -> send ]------------------------------
mean: 336.4µs, p50: 344µs, p90: 352µs, p99: 360µs, p999: 360µs, max: 376µs, #: 146280
   0µs -    1µs | *                                        |  0.0th %-ile
   1µs -    2µs |                                          |  0.0th %-ile
   2µs -    4µs |                                          |  0.0th %-ile
   4µs -    8µs |                                          |  0.0th %-ile
   8µs -   16µs | *                                        |  0.0th %-ile
  16µs -   32µs |                                          |  0.0th %-ile
  32µs -   64µs |                                          |  0.0th %-ile
  64µs -  128µs |                                          |  0.0th %-ile
 128µs -  256µs |                                          |  0.0th %-ile
 256µs -  512µs | **************************************** | 100.0th %-ile

The issue is still present (with some more variance in the results), suggesting that this is not a lock contention issue. recv -> send is much slower, but this is as expected -- with only one executor, receiving all the responses takes time.

Source

jonhoo

😕2 👍1

Most helpful comment

I think that this specific artifact (at least on my PC) might be partly caused by the sleep.

If you replace std::thread::sleep with a busy loop that runs for ~10k nanoseconds, something like this:

let start = time.now();
loop {
    let now = time.now();
    self.4 += now;    // self.4 is u64 in Svc, just to prevent possible optimizations
    if now - start >= 10_000 {
        break;
    }
}

the latencies get several times lower.

If I measure the "duration" of the sleep:

let start = time.now();
std::thread::sleep(std::time::Duration::from_nanos(10_000));
println!("{}", time.now() - start);

I consistently get numbers around 60k nanoseconds, so ~6 times more than the sleep intended (I get this number consistently using Quanta, Instant and SystemClock). This might also explain the numbers that you got.

I get these mean send -> recv times on my machine:

Single-threaded runtime:

busy loop: 22 us
sleep: 146 us (~6 times higher)

Multi-threaded runtime:

busy loop: 64 us
sleep: 392 us (~6 times higher)

The 6x higher latency seems to be consistent with the actual additional time spent in sleep.

I'm not sure why sleep is so sloppy (at least on my system), a context switch should be several microseconds, maybe 10 us is just too fast for my system. I have noticed that with the sleep there are usually 2-3x more CPU migrations (perf stat -e cpu-migrations) than with the busy loop, but that might be an unrelated artifact.

Kobzol on 28 Aug 2019

👍2

All 3 comments

I think that this specific artifact (at least on my PC) might be partly caused by the sleep.

If you replace std::thread::sleep with a busy loop that runs for ~10k nanoseconds, something like this:

let start = time.now();
loop {
    let now = time.now();
    self.4 += now;    // self.4 is u64 in Svc, just to prevent possible optimizations
    if now - start >= 10_000 {
        break;
    }
}

the latencies get several times lower.

If I measure the "duration" of the sleep:

let start = time.now();
std::thread::sleep(std::time::Duration::from_nanos(10_000));
println!("{}", time.now() - start);

I get these mean send -> recv times on my machine:

Single-threaded runtime:

busy loop: 22 us
sleep: 146 us (~6 times higher)

Multi-threaded runtime:

busy loop: 64 us
sleep: 392 us (~6 times higher)

The 6x higher latency seems to be consistent with the actual additional time spent in sleep.

Kobzol on 28 Aug 2019

👍2

I plan on investigating this in more detail soon, thanks for the additional info :+1:

carllerche on 28 Aug 2019

❤1

I'm going to close this as the new scheduler improves this case.

carllerche on 21 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Missing doc. No found SO_REUSEADDR solution for TcpListener

heretic13 · 3Comments

std::io::Error → tokio::io::Error conversion

jplatte · 3Comments

tokio: introduce I/O utility helpers

carllerche · 5Comments

Better error from tokio::spawn() if no `Runtime` is running

djc · 4Comments

trace-core: Pre-0.1 docs pass

hawkw · 5Comments