Tokio: Lost wakeups in threadpool

Created on 7 Aug 2018 · 35Comments · Source: tokio-rs/tokio

It has been reported (https://github.com/rust-lang-nursery/futures-rs/issues/1170) that there have been lost wake ups.

Downgrading to threadpool 0.1.4 reportedly solves the problem. #459 is the only meaningful change.

Source

carllerche

Most helpful comment

I have a potential fix for this issue. Doing some testing before I open a PR.

hawkw on 27 Aug 2018

🎉3

All 35 comments

After thinking a bunch about the description of the problem, I don't believe that the bug being fixed is real. From the PR:

When unparking a worker thread, we do the following steps:

1) Try transitioning IDLE -> NOTIFY.
2) Lock the mutex.
3) Try transitioning SLEEP -> NOTIFY and wakeup on success.

The problem occurs if the thread transitions from SLEEP to IDLE between steps 1 and 2. In that case, step 3 will fail to transition and simply return from unpark.

However, this case would not be a problem as the sleeping thread that transitions from SLEEP -> IDLE will establish a happens before relationship with the notifying thread (due to the SeqCst ordering). So, when the sleeping thread wakes up, it is guaranteed to see the change that the notifying thread is notifying for. This requires SeqCst and would not be true w/ acquire / release.

That said, that doesn't mean the PR introduced a bug...

carllerche on 7 Aug 2018

This line is suspicious to me now... it isn't acquiring anything from notifier threads...

It probably should be a swap.

Edit: This line too: https://github.com/tokio-rs/tokio/blob/0490280d662f000aff674593cc9a4f69a1cd1171/tokio-threadpool/src/park/default_park.rs#L138

carllerche on 7 Aug 2018

This line is suspicious to me now... it isn't acquiring anything from notifier threads...

Well, it's consuming the notification and going back to the IDLE state. The worst that could happen is a spurious wakeup. std::thread::park has the same store operation.

Edit: This line too:

std::thread::park uses a cmpxchg operation here, but again - the worst case scenario is just a spurious wakeup.

stjepang on 7 Aug 2018

@stjepang that isn't the issue. The issue is that there is no acquire ordering established with this line.

Acquire is needed to get access to the change that the unpark thread is notifying for.

carllerche on 7 Aug 2018

That said, on x86 assuming mfence is used it would be ok...

carllerche on 7 Aug 2018

This code should be arch-independent which makes it hard to analyze. On x86 locked stores are totally ordered WRT locked stores and loads from the same location, so it's reasonably easy to get right.

I'm assuming that because everything is SeqCst then that also applies to other architectures with weaker ordering rules.

(The heavy use of compare_and_swap makes me wonder whether the "check state outside of the lock" is actually worthwhile - how often do the "park when already notified" and "unpark while not sleeping" cases get hit? Depending on how mutex is implemented, the cmpxchg might be more expensive, esp if the lock is not contended.)

jsgf on 7 Aug 2018

@jsgf I understand that it _should_ be arch dependent... however, I'm trying to figure out if there currently is a bug on x86.

Either way, the issues pointed out here should be fixed.

carllerche on 7 Aug 2018

Getting a possibly-relevant deadlock in https://github.com/remexre/garlichat ; cd garlichat && cargo run, the first character typed prints in the top region (as it should), but further characters are lost. (Note: the terminal is put into raw mode, so it's necessary to kill the garlichat process.)

eprintln!() shows the getch-ing thread receives additional characters on keypresses, but the event loop doesn't get woken up.

EDIT: unbounded_send is here, corresponding poll is here.

remexre on 10 Aug 2018

I am debugging this case.

carllerche on 10 Aug 2018

It is looking like the issue in garlichat is not related :(

carllerche on 10 Aug 2018

I believe I have a potential reproduction for this issue: https://github.com/hawkw/tokio-deadlock-repro. I've written a simple program which spawns a task that sends numbers into an MPSC, and then repeatedly blocks on reading values out of the channel. After a varying number of calls to Runtime::block_on, this program will hang. This occurs with both the current master, and tokio 0.1.7.

I've tested this reproduction on macOS 10.12.6 with rustc 1.28.0.

hawkw on 22 Aug 2018

❤1 👍1

@stjepang Are you able to investigate this repro? ^^

carllerche on 23 Aug 2018

If it's useful, here's logs from the threadpool prior to and while it hits (what I believe is) the deadlock:

Logs

 DEBUG deadlock_repro            > Some(246) received
 TRACE tokio_threadpool::task    >     -> task complete
 TRACE tokio_threadpool::worker  > task complete; state=pool::State { lifecycle: Running, num_futures: 1 }
 INFO  deadlock_repro            > 246th block_on finished
 TRACE tokio_threadpool::worker  > try_steal_task -- signal_work; self=2; from=1
 TRACE tokio_threadpool::sender  > execute; count=2
 TRACE tokio_threadpool::pool    >   -> submitting to random; idx=3
 TRACE tokio_threadpool::worker  > Worker::check_run_state; delegate signal
 TRACE tokio_threadpool::worker  > found work while draining; signal_work
 TRACE tokio_threadpool::task    > Task::run; state=Running
 DEBUG deadlock_repro            > Some(247) received
 TRACE tokio_threadpool::task    >     -> task complete
 TRACE tokio_threadpool::worker  > task complete; state=pool::State { lifecycle: Running, num_futures: 1 }
 INFO  deadlock_repro            > 247th block_on finished
 TRACE tokio_threadpool::sender  > execute; count=2
 TRACE tokio_threadpool::pool    >   -> submitting to random; idx=0
 TRACE tokio_threadpool::worker  > Worker::check_run_state; delegate signal
 TRACE tokio_threadpool::worker  > found work while draining; signal_work
 TRACE tokio_threadpool::task    > Task::run; state=Running
 DEBUG deadlock_repro            > Some(248) received
 TRACE tokio_threadpool::task    >     -> task complete
 INFO  deadlock_repro            > 248th block_on finished
 TRACE tokio_threadpool::worker  > task complete; state=pool::State { lifecycle: Running, num_futures: 1 }
 TRACE tokio_threadpool::sender  > execute; count=2
 TRACE tokio_threadpool::pool    >   -> submitting to random; idx=7
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(6)
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=6
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(4)
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=6
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(5)
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=4
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(1)
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(2)
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(3)
 TRACE tokio_threadpool::worker  > Worker::sleep; worker=WorkerId(0)
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=5
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=4
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=1
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=2
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=3
 TRACE tokio_threadpool::worker  >   sleeping -- push to stack; idx=0
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=5
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=1
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=2
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=0
 TRACE tokio_threadpool::worker  >     -> starting to sleep; idx=3

After that point, it doesn't log anything else.

hawkw on 23 Aug 2018

@hawkw Are you running --release or just normal build? What kind of numbers of iterations do you see before a hang?

With:

RUST_LOG=info cargo run --release

I'm not able to repo after 1.2M block_on calls. (Rust 1.28, Linux, 32 cores.)

jsgf on 24 Aug 2018

@jsgf I've been running this in debug mode primarily, and I see it hang typically after <500 block_ons. That's interesting to hear, I'll see if it still hangs in release mode.

hawkw on 24 Aug 2018

Hmm, I'm still seeing it hang in release mode, after anywhere from between 350-2.5k block_ons...

hawkw on 24 Aug 2018

@hawkw can you try bisecting this?

carllerche on 24 Aug 2018

👍1

I think I've root-caused at least some of my hang to a deadlock between backtrace and jemalloc heap profiling (specifically, dl_iterate_phdr). I'm making some changes to confirm, and I'm not sure if it completely explains what we're seeing.

jsgf on 27 Aug 2018

👍2

@carllerche I did a bisect with my reproduction, and it appears that the first bad commit is 990186ec9d396a3cd2111c934d4dfc81d2aa866f ("Optimize spinning in Worker::run"). I haven't looked _too_ closely at that change, but it certainly seems plausible that it introduced the issue, based on what changed.

/cc @stjepang

hawkw on 27 Aug 2018

👍1

I have a potential fix for this issue. Doing some testing before I open a PR.

hawkw on 27 Aug 2018

🎉3

I've done some additional testing, and it looks like the issue I've reproduced and fixed is macOS-specific. So, if the issue reported in rust-lang-nursery/futures-rs#1170 was observed on Linux, there's probably two+ unrelated hangs going on here...

hawkw on 28 Aug 2018

That code doesn't look macOS specific - is there a difference in the behaviour of thread::yield_now() or something that makes it so?

jsgf on 28 Aug 2018

I think this dates back even further? Or is.... Unrelated in a way? :confused:

Locally I pulled 9aaa8f0 (going back further wont compile), and altered @hawkw code to do the oneshot itself (block_on doesn't exist) and this still reproduces (Linux Debian 4.9.110). This will reproduce (debug build) sometimes as early as the 1st/2nd iteration.

One funny thing I am noticing however. with unbounded() this reproduces right away. if i use channel(n) it runs totally fine... As you increase the channel size (i did 500K up to 25M) you will notice a pause and then it will continue running just fine. This pause seemingly happening right ~ the number i commonly see unbounded() stop. The larger the channel bound, the longer the pause. 10M ~ 5s, 25M ~ 10s.

Is it possible the send_all is _taking over_ or something? If you run with unbounded() you will see the memory usage continues to grow even after it _hangs_ leading me to believe the process isn't totally hung, and the sender is still sending/loading the channel?, but for some reason the new oneshot/block_on isn't getting called? But if the channel reaches its max (i am assuming) then it triggers it to end up getting called anyway?

I am new to rust so I am not sure how much, if this helps at all. The ticket intrigues me :)

zacheryph on 28 Aug 2018

I hope this will be helpful, even if I'm not quite sure yet about what the best way to solve this is
(or how many different underlying bugs there are regarding this issue).

Reproduction

To reproduce this easier:

remove the thread::yield
tune MAX_SPINS according to your OS scheduler and operating system, e.g. for windows I set it to 1000000000 to get a lost wakeup after the first iteration consistently
set threadpool pool_size for the tokio runtime to e.g. 2 (makes debugging easier)

Trace Output

 TRACE mio::poll > registering with poller
 TRACE tokio_threadpool::builder > build; num-workers=2
 DEBUG tokio_reactor::background > starting background reactor
 TRACE mio::sys::windows::selector > select; timeout=None
 TRACE tokio_threadpool::sender    > execute; count=1
 TRACE mio::sys::windows::selector > polling IOCP
 TRACE tokio_threadpool::pool      > submit to existing worker; idx=1; state=worker::State { lifecycle: Shutdown, is_pushed: true }
 TRACE tokio_threadpool::sender    > execute; count=2
 TRACE tokio_threadpool::worker    > found work while draining; signal_work
 TRACE tokio_threadpool::pool      > submit to existing worker; idx=0; state=worker::State { lifecycle: Shutdown, is_pushed: true }
 TRACE tokio_threadpool::worker    > try run WorkerId(1)
 TRACE tokio_threadpool::task      > Task::run; state=Running
 TRACE tokio_threadpool::worker    > found work while draining; signal_work
 TRACE tokio_threadpool::task      > Task::run; state=Running
 DEBUG deadlock_repro              > Some(1) received
 TRACE tokio_threadpool::task      >     -> task complete
 INFO  deadlock_repro              > 1th block_on finished
 TRACE tokio_threadpool::worker    > task complete; state=pool::State { lifecycle: Running, num_futures: 1 }
 TRACE tokio_threadpool::sender    > execute; count=2
 TRACE tokio_threadpool::pool      >   -> submitting to random; idx=0
 TRACE tokio_threadpool::worker    > Worker::check_run_state; delegate signal
 TRACE tokio_threadpool::worker    > found work while draining; signal_work
 TRACE tokio_threadpool::task      > Task::run; state=Running
 DEBUG deadlock_repro              > Some(2) received
 TRACE tokio_threadpool::task      >     -> task complete
 INFO  deadlock_repro              > 2th block_on finished
 TRACE tokio_threadpool::sender    > execute; count=2
 TRACE tokio_threadpool::pool      >   -> submitting to random; idx=1

Diagnosis

It actually isn't a deadlock but just unfortunate:

We submit to a random worker, which is worker 1 here /1/ .
This worker happens to also run the send_all of the writer task.
The task send_all is currently running and never stops, since data is always ready.
This leads to not exiting this loop here ever.
This is why the worker loop in tokio is never run (again), it never stops running this one task.
Now you might ask, but why do the other tasks not steal this one?
Well the task is still stuck in the .inbound (external) queue (it was pushed there in /1/) of the worker, but stealing only works from the .worker queue (internal).

steffengy on 28 Aug 2018

❤1

Thanks everyone for digging into the problem, this is super helpful! <3

I can confirm that @hawkw's code reproduces the issue on my Linux machine.

The diagnosis by @steffengy seems like a legitimate problem. I've tried replacing inbound per-worker queues with a single globally shared queue of tasks so that spawned tasks never get stuck in a busy worker's inbound queue. However, the bug issue was still reproducing.

After that I realized it is possible for a task to be spawned just before a worker pushes itself onto the sleeper stack and goes to sleep. That way the thread that spawned the task thinks someone will pick it up while the thread going to sleep is convinced there are no tasks available. We need an additional check just after the thread is pushed onto the sleeper stack. After adding the check I can no longer reproduce the issue.

The changes in question are submitted in #621. I hope this also fixes the originally reported problems, but we'll see.

stjepang on 7 Sep 2018

👍1

@stjepang

I can confirm that @hawkw's code reproduces the issue on my Linux machine.

Now _that_'s interesting --- I was only able to get that repro code to hang on macOS. Curious! It's possible I didn't try for long enough on Linux.

hawkw on 7 Sep 2018

@hawkw It hangs pretty much instantly in both debug and release mode with either 2 worker threads or 4 threads - doesn't really matter. So it's pretty consistent.

Sometimes it even hangs at the second block_on:

tokio-deadlock-repro master $ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/deadlock-repro`
DEBUG 2018-09-07T16:53:48Z: deadlock_repro: Some(1) received
 INFO 2018-09-07T16:53:48Z: deadlock_repro: 1th block_on finished

stjepang on 7 Sep 2018

@stjepang huh, interesting! I haven't seen it hang that fast even on my Mac. I'll run it again on Linux and see what happens.

I'm also going to try running against your changes in #621 to make sure you've fixed the same issues I've been seeing. It seems like there could possibly be one or more platform-specific issues as well, so I want to try and rule that out.

hawkw on 7 Sep 2018

@stjepang It sounds like there are two separate. The first is the behavior described by @steffengy and the second an actual bug with submitting tasks to workers.

Is it possible to split up the two into separate PRs?

I'm not necessarily against the change, but a task that does not yield back to the executor is a rogue task (if there is only one thread, there is nothing to do). Also, this change would go against the goal of #424.

So, if possible, lets fix the bug you described. If it is broken into a dedicated PR, it will be easier for me to understand and review. Then, we can discuss moving to a single inbound queue separately.

carllerche on 7 Sep 2018

As I mentioned in https://github.com/tokio-rs/tokio/pull/621#pullrequestreview-153481767, I've tested @stjepang's fix against my repro, and it appears to fix the problem on my Mac as well. :)

hawkw on 7 Sep 2018

@hawkw were you able to repro the original issue using this code? https://github.com/facebookexperimental/mononoke/blob/06f619717f69cceeb856533bf57d79ead38c2597/cmds/blobimport.rs#L102

@StanislavGlebik are you still seeing the deadock? If so, we could use help reproducing it.

carllerche on 18 Sep 2018

@carllerche No, but IIRC, I didn't spend as much time trying to repro with mononoke as I did trying to write a minimal repro --- I can make another attempt to repro with that code.

hawkw on 18 Sep 2018

Also @jsgf, could you help repro the issue (if it exists) with mononoke?

carllerche on 18 Sep 2018

@carllerche The only deadlocks we've seen lately is a lock-ordering deadlock between the backtrace crate and jemalloc heap profiling. That could have been our deadlock all along.

jsgf on 19 Sep 2018

👍1

Ok, I'm going to close this issue since, as far as we know, there is no bug.

@sfackler if you find that you are still having problems and believe that it is due to the Tokio pool, please open a new issue with your findings and reference this one for context.

carllerche on 20 Sep 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings