tokio 0.1.11
tokio-codec 0.1.1
tokio-core 0.1.17
tokio-current-thread 0.1.3
tokio-executor 0.1.5
tokio-fs 0.1.4
tokio-io 0.1.10
tokio-openssl 0.2.1
tokio-pool 0.1.0
tokio-process 0.2.2
tokio-proto 0.1.1
tokio-reactor 0.1.6
tokio-service 0.1.0
tokio-signal 0.2.6
tokio-tcp 0.1.2
tokio-threadpool 0.1.8
tokio-timer 0.1.2
tokio-timer 0.2.7
tokio-tls 0.1.4
tokio-tls-api 0.1.20
tokio-udp 0.1.2
tokio-uds 0.1.7
tokio-uds 0.2.3
Linux, 64-bit.
tokio-threadpool
One of our tests shows a large performance regression (20x slowdown in p95 request latency; p75 is unchanged) with tokio-threadpool 0.1.8. Rolling just that crate back to 0.1.7 restores the original performance.
Cc stjepang
Any chance you can check master and provide a profile?
So the diff between these two versions is:
753336d threadpool: Arc instead of Inner in Notifier (#702)
adb0ba7 threadpool: worker threads shouldn't respect keep_alive (#692)
e27b0a4 threadpool: spawn new tasks onto a random worker (#683)
d35d051 runtime: create reactor per worker (#660)
The first commit just eliminates some contention on task notification and should improve performance in all cases.
The second one is more of a bugfix and should not affect performance.
The third commit spawns new tasks onto random workers (rather than the current worker). I can imagine this hurting performance in some cases, but should overall be a win because it tends to distribute work among threads more equally. This commit could be the problem here, though. We should check.
The fourth commit is a big architectural change. Rather than having a single global reactor running in a dedicated thread, every worker is driving its own reactor in its own thread during idle time and occasionally between "task context switches" (the function that does this is called "sleep_light").
@jsgf My bet is on one of the last two commits being the culprit (e27b0a4 or d35d051). Can you help us figure out which one it is?
@stjepang My guess was the random worker, but without any real basis. I agree the last two seem to be the most likely. I'll try dropping each out to see the effect (are they independent?).
@carllerche Not sure about profiling; it's a p95 latency tail of a specific measurement in a much more complex system, so it might be hard to extract a good signal (esp if it's not CPU-bound).
@jsgf what does a 20x perf regression translate to in terms of ms?
@jsgf
I'll try dropping each out to see the effect (are they independent?).
They should be independent. The fourth commit is big, but the third is tiny (see https://github.com/tokio-rs/tokio/pull/683/files - the only real change is in src/pool/mod.rs and src/sender.rs).
It looks like it is random worker placement - rolling back just that change completely restores the performance. I have some nice graphs here; let me work out how to show you.
Here you are:
https://gist.github.com/jsgf/d916de2ce6107fb09da2458630541077
@jsgf Thanks for the graphs!
So the reason why we spawn tasks onto random workers is to distribute tasks among workers (and IO handles among reactors) as equally as possible. tokio-io-pool spawns tasks in a round-robin fashion, which should have a similar effect.
Perhaps we should revert https://github.com/tokio-rs/tokio/pull/683 until we figure out a better strategy, but a question before we do that: Do you have any insight why spawning tasks onto random workers is not preferrable in your scenario?
What we need to do is get tokio-trace landed so we can instrument the various points in the pool to try to debug this...
No idea yet why there's so much impact here. We'll need to do a code audit to make sure there isn't anything obviously broken in the design but better tools would make it a lot easier, so +1 to tokio-tracing.
Does 0.1.9 have a fix for this?
@jsgf I don't think so, but if you were testing with 0.1.8, it would be worth testing with 0.1.9 as well.
0.1.9 has basically the same problem. I'm trying again with random placement reverted.
@jsgf your project is OSS no? Is there any way to get a repro even if it is in a large system?
The revert is still useful. I'll make a PR.
@jsgf your project is OSS no? Is there any way to get a repro even if it is in a large system?
It's not usefully OSS right now. You can see a lot of the code at https://github.com/facebookexperimental/mononoke but it isn't buildable with standard tools, and doesn't have code for all the pieces you'd need (it uses bits of FB infra which we need to provide non-FB alternatives for).
We could revert the random placement change. Although, I worry that the spawn-on-current-worker strategy might disproportionately put a lot of pressure on a single reactor while others get underutilized.
But maybe that's okay in typical real-world scenarios? Perhaps the benefits of balanced work among reactors get outweighted by the drawbacks of underutilized cache locality? I don't know - could be true.
Anyways, I'd be okay with reverting that change until we understand the problem a bit better.
@carllerche Thoughts?
I really would like to understand what caused the regression so we can fix it. As mentioned, the issue with reverting is it will put pressure on the single reactor...
@jsgf Is there any way for you to tell if the latency is caused because of delay between a task being spawned and the first poll or not? i.e is the latency in the spawn process or is it during task execution.
@jsgf are you able to ping us on Gitter?
@jsgf Can you verify whether https://github.com/tokio-rs/tokio/pull/784 fixes your latency spikes?
I ran a test, but realized I'd tested with non-random placement (which shows a 2x regression in p95 latency vs 0.1.7).
I'll re-run with a more meaningful test.
Update: this change seems to completely mitigate the random placement regression.
/cc @stjepang
Looks like the stealing change worked?
@jsgf Awesome, thank you for running a test!
We now know how to resolve the problem, I'll just have to write a proper PR instead of the simple hack in #784.
@stjepang the real fix is switching to a single bounded mpmc queue per worker with an overflow segqueue is the strategy?
@carllerche Correct. Alternatively, we could use unbounded mpmc queues, but bounded queues would be quite a bit faster.
@carllerche Sorry, I misunderstood what Go does. It has bounded queues, but they're SPMC, not MPMC. Our unbounded crossbeam-deques are perfectly fine, but Go additionally has a single global MPMC queue. That is what we should do, too.
@stjepang why are Go's bounded queues spmc? It seems like a bounded mpmc is a perfect fit here?
@carllerche I'm not 100% sure, but my guess would be because the "steal-half" (aka steal_many) operation is more complicated and slower with MPMC queues.
@stjepang I just realized I read it wrong... I read "mpsc" and not spmc... that makes more sense. (mpsc makes no sense which is why I was confused).
Most helpful comment
I ran a test, but realized I'd tested with non-random placement (which shows a 2x regression in p95 latency vs 0.1.7).
I'll re-run with a more meaningful test.
Update: this change seems to completely mitigate the random placement regression.