Tokio: Performance regression in tokio-threadpool 0.1.8

Created on 13 Nov 2018 · 27Comments · Source: tokio-rs/tokio

Version

tokio 0.1.11
tokio-codec 0.1.1
tokio-core 0.1.17
tokio-current-thread 0.1.3
tokio-executor 0.1.5
tokio-fs 0.1.4
tokio-io 0.1.10
tokio-openssl 0.2.1
tokio-pool 0.1.0
tokio-process 0.2.2
tokio-proto 0.1.1
tokio-reactor 0.1.6
tokio-service 0.1.0
tokio-signal 0.2.6
tokio-tcp 0.1.2
tokio-threadpool 0.1.8
tokio-timer 0.1.2
tokio-timer 0.2.7
tokio-tls 0.1.4
tokio-tls-api 0.1.20
tokio-udp 0.1.2
tokio-uds 0.1.7
tokio-uds 0.2.3

Platform

Linux, 64-bit.

Subcrates

tokio-threadpool

Description

One of our tests shows a large performance regression (20x slowdown in p95 request latency; p75 is unchanged) with tokio-threadpool 0.1.8. Rolling just that crate back to 0.1.7 restores the original performance.

Source

jsgf

Most helpful comment

I ran a test, but realized I'd tested with non-random placement (which shows a 2x regression in p95 latency vs 0.1.7).

I'll re-run with a more meaningful test.

Update: this change seems to completely mitigate the random placement regression.

jsgf on 2 Dec 2018

🎉2

All 27 comments

Cc stjepang

Any chance you can check master and provide a profile?

carllerche on 13 Nov 2018

So the diff between these two versions is:

753336d threadpool: Arc instead of Inner in Notifier (#702)
adb0ba7 threadpool: worker threads shouldn't respect keep_alive (#692)
e27b0a4 threadpool: spawn new tasks onto a random worker (#683)
d35d051 runtime: create reactor per worker (#660)

The first commit just eliminates some contention on task notification and should improve performance in all cases.

The second one is more of a bugfix and should not affect performance.

The third commit spawns new tasks onto random workers (rather than the current worker). I can imagine this hurting performance in some cases, but should overall be a win because it tends to distribute work among threads more equally. This commit could be the problem here, though. We should check.

The fourth commit is a big architectural change. Rather than having a single global reactor running in a dedicated thread, every worker is driving its own reactor in its own thread during idle time and occasionally between "task context switches" (the function that does this is called "sleep_light").

@jsgf My bet is on one of the last two commits being the culprit (e27b0a4 or d35d051). Can you help us figure out which one it is?

stjepang on 13 Nov 2018

@stjepang My guess was the random worker, but without any real basis. I agree the last two seem to be the most likely. I'll try dropping each out to see the effect (are they independent?).

@carllerche Not sure about profiling; it's a p95 latency tail of a specific measurement in a much more complex system, so it might be hard to extract a good signal (esp if it's not CPU-bound).

jsgf on 13 Nov 2018

@jsgf what does a 20x perf regression translate to in terms of ms?

carllerche on 13 Nov 2018

@jsgf

I'll try dropping each out to see the effect (are they independent?).

They should be independent. The fourth commit is big, but the third is tiny (see https://github.com/tokio-rs/tokio/pull/683/files - the only real change is in src/pool/mod.rs and src/sender.rs).

stjepang on 13 Nov 2018

It looks like it is random worker placement - rolling back just that change completely restores the performance. I have some nice graphs here; let me work out how to show you.

Here you are:
https://gist.github.com/jsgf/d916de2ce6107fb09da2458630541077

jsgf on 14 Nov 2018

❤1

@jsgf Thanks for the graphs!

So the reason why we spawn tasks onto random workers is to distribute tasks among workers (and IO handles among reactors) as equally as possible. tokio-io-pool spawns tasks in a round-robin fashion, which should have a similar effect.

Perhaps we should revert https://github.com/tokio-rs/tokio/pull/683 until we figure out a better strategy, but a question before we do that: Do you have any insight why spawning tasks onto random workers is not preferrable in your scenario?

stjepang on 21 Nov 2018

What we need to do is get tokio-trace landed so we can instrument the various points in the pool to try to debug this...

carllerche on 21 Nov 2018

No idea yet why there's so much impact here. We'll need to do a code audit to make sure there isn't anything obviously broken in the design but better tools would make it a lot easier, so +1 to tokio-tracing.

jsgf on 22 Nov 2018

Does 0.1.9 have a fix for this?

jsgf on 29 Nov 2018

@jsgf I don't think so, but if you were testing with 0.1.8, it would be worth testing with 0.1.9 as well.

carllerche on 29 Nov 2018

0.1.9 has basically the same problem. I'm trying again with random placement reverted.

jsgf on 29 Nov 2018

@jsgf your project is OSS no? Is there any way to get a repro even if it is in a large system?

carllerche on 29 Nov 2018

The revert is still useful. I'll make a PR.

@jsgf your project is OSS no? Is there any way to get a repro even if it is in a large system?

It's not usefully OSS right now. You can see a lot of the code at https://github.com/facebookexperimental/mononoke but it isn't buildable with standard tools, and doesn't have code for all the pieces you'd need (it uses bits of FB infra which we need to provide non-FB alternatives for).

jsgf on 29 Nov 2018

We could revert the random placement change. Although, I worry that the spawn-on-current-worker strategy might disproportionately put a lot of pressure on a single reactor while others get underutilized.

But maybe that's okay in typical real-world scenarios? Perhaps the benefits of balanced work among reactors get outweighted by the drawbacks of underutilized cache locality? I don't know - could be true.

Anyways, I'd be okay with reverting that change until we understand the problem a bit better.

@carllerche Thoughts?

stjepang on 29 Nov 2018

I really would like to understand what caused the regression so we can fix it. As mentioned, the issue with reverting is it will put pressure on the single reactor...

@jsgf Is there any way for you to tell if the latency is caused because of delay between a task being spawned and the first poll or not? i.e is the latency in the spawn process or is it during task execution.

carllerche on 29 Nov 2018

@jsgf are you able to ping us on Gitter?

carllerche on 29 Nov 2018

@jsgf Can you verify whether https://github.com/tokio-rs/tokio/pull/784 fixes your latency spikes?

stjepang on 29 Nov 2018

I ran a test, but realized I'd tested with non-random placement (which shows a 2x regression in p95 latency vs 0.1.7).

I'll re-run with a more meaningful test.

Update: this change seems to completely mitigate the random placement regression.

jsgf on 2 Dec 2018

🎉2

/cc @stjepang

Looks like the stealing change worked?

tobz on 5 Dec 2018

@jsgf Awesome, thank you for running a test!

We now know how to resolve the problem, I'll just have to write a proper PR instead of the simple hack in #784.

stjepang on 7 Dec 2018

@stjepang the real fix is switching to a single bounded mpmc queue per worker with an overflow segqueue is the strategy?

carllerche on 7 Dec 2018

@carllerche Correct. Alternatively, we could use unbounded mpmc queues, but bounded queues would be quite a bit faster.

stjepang on 8 Dec 2018

@carllerche Sorry, I misunderstood what Go does. It has bounded queues, but they're SPMC, not MPMC. Our unbounded crossbeam-deques are perfectly fine, but Go additionally has a single global MPMC queue. That is what we should do, too.

stjepang on 15 Dec 2018

@stjepang why are Go's bounded queues spmc? It seems like a bounded mpmc is a perfect fit here?

carllerche on 17 Dec 2018

@carllerche I'm not 100% sure, but my guess would be because the "steal-half" (aka steal_many) operation is more complicated and slower with MPMC queues.

stjepang on 18 Dec 2018

@stjepang I just realized I read it wrong... I read "mpsc" and not spmc... that makes more sense. (mpsc makes no sense which is why I was confused).

carllerche on 18 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

async fn and async block not support task::current().id(),how to get Coroutines id?

zhuxiujia · 3Comments

Consider matching `std::process` behavior

carllerche · 4Comments

tokio-tls: programmatic control over certificate validation

izolyomi · 5Comments

async-await-preview feature missing in master.

peterhuene · 4Comments

Consider exposing simulation's types in Tokio

davidbarsky · 3Comments