When messing around with Bevy I tried to run the breakout example on a single core. In debug mode this crashes with an subtraction overflow.
Specifically:
$ RUST_BACKTRACE=full taskset --cpu-list 1 cargo run --example breakout
Finished dev [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/examples/breakout`
thread 'main' panicked at 'attempt to subtract with overflow', crates/bevy_app/src/task_pool_options.rs:122:13
stack backtrace:
0: 0x564228b6d1d5 - backtrace::backtrace::libunwind::trace::h14d338b30b3ea0a7
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
1: 0x564228b6d1d5 - backtrace::backtrace::trace_unsynchronized::h73ea91d74a3fd67f
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
2: 0x564228b6d1d5 - std::sys_common::backtrace::_print_fmt::hd42948c952866e12
at src/libstd/sys_common/backtrace.rs:78
3: 0x564228b6d1d5 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::ha8f928866ff7571e
at src/libstd/sys_common/backtrace.rs:59
4: 0x564228b992dc - core::fmt::write::he0c1e5f7426d2718
at src/libcore/fmt/mod.rs:1076
5: 0x564228b69382 - std::io::Write::write_fmt::hf3afc6cfd57d0033
at src/libstd/io/mod.rs:1537
6: 0x564228b6fe70 - std::sys_common::backtrace::_print::hfc0110703f3696fd
at src/libstd/sys_common/backtrace.rs:62
7: 0x564228b6fe70 - std::sys_common::backtrace::print::h3f77c6990ddfaa22
at src/libstd/sys_common/backtrace.rs:49
8: 0x564228b6fe70 - std::panicking::default_hook::{{closure}}::heae49580a8d62d75
at src/libstd/panicking.rs:198
9: 0x564228b6fbbc - std::panicking::default_hook::hecc34e3f729e213c
at src/libstd/panicking.rs:217
10: 0x564228b704b3 - std::panicking::rust_panic_with_hook::he82f5d0644692441
at src/libstd/panicking.rs:526
11: 0x564228b700ab - rust_begin_unwind
at src/libstd/panicking.rs:437
12: 0x564228b96f61 - core::panicking::panic_fmt::h09c929f06bb87c98
at src/libcore/panicking.rs:85
13: 0x564228b96ead - core::panicking::panic::h7ece43057e5422d4
at src/libcore/panicking.rs:50
14: 0x564228a2a8b0 - bevy_app::task_pool_options::DefaultTaskPoolOptions::create_default_pools::habcc68afd5fb0990
at crates/bevy_app/src/task_pool_options.rs:122
15: 0x564228a0dc61 - bevy_app::app::App::run::h6e074714d77a3dba
at crates/bevy_app/src/app.rs:67
16: 0x564228a22b65 - bevy_app::app_builder::AppBuilder::run::h67c1d0cf83da2c83
at crates/bevy_app/src/app_builder.rs:43
17: 0x564227702255 - breakout::main::h414756007391d499
at examples/game/breakout.rs:9
18: 0x56422770c6cb - std::rt::lang_start::{{closure}}::h89900d1c279f354c
at /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9/src/libstd/rt.rs:67
19: 0x564228b70883 - std::rt::lang_start_internal::{{closure}}::h5d3ea623498f5f43
at src/libstd/rt.rs:52
20: 0x564228b70883 - std::panicking::try::do_call::hac65e71be769a440
at src/libstd/panicking.rs:348
21: 0x564228b70883 - std::panicking::try::hd4706e264bcf6712
at src/libstd/panicking.rs:325
22: 0x564228b70883 - std::panic::catch_unwind::h948a0fb4a8b3ee82
at src/libstd/panic.rs:394
23: 0x564228b70883 - std::rt::lang_start_internal::h72cc068ed2d0ac53
at src/libstd/rt.rs:51
24: 0x56422770c6a7 - std::rt::lang_start::h805844f9977a4781
at /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9/src/libstd/rt.rs:67
25: 0x564227704eca - main
26: 0x7f745c198152 - __libc_start_main
27: 0x5642276f57ce - _start
28: 0x0 - <unknown>
While the crash only happens on master and not with the stable release (which makes me suspect it is related to the recent task queue replacing rayon), the example simply hangs on 2 or 3 cores as well.
Breakout works for me on two cores with commit db8ec7d5 and breaks on the next commit 17e76426.
This was introduced by PR 384.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Pentium(R) CPU G3420 @ 3.20GHz
Stepping: 3
CPU MHz: 3137.792
CPU max MHz: 3200.0000
CPU min MHz: 800.0000
BogoMIPS: 6400.69
Virtualization: VT-x
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 512 KiB
L3 cache: 3 MiB
NUMA node0 CPU(s): 0,1
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopolo
gy nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer xsave rdrand lahf_lm abm cpuid_fault invpcid_sin
gle pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust erms invpcid xsaveopt dtherm arat pln pts md_clear flush_l1d
I'll look at it, probably an easy fix. (It should make 3 1-thread task pools in the low-core-count case)
For a workaround, grab the PR or change your main() temporarily to something like this:
use bevy::app::DefaultTaskPoolOptions;
fn main() {
App::build()
.add_resource(DefaultTaskPoolOptions::with_num_threads(4))
.add_default_plugins()
Ok, got to the bottom of this and have some interesting results to share.
There are two issues:
fn main() {
rayon::ThreadPoolBuilder::default()
.num_threads(1) // <-- (1) Force only a single thread in the pool
.build_global();
let (tx, rx) = crossbeam_channel::unbounded();
rayon::scope(|scope| {
scope.spawn(|scope| {
println!("Inner scope started"); // <-- (3) This line never runs because no thread is available
tx.send("hi");
});
rx.recv(); // <-- (2) This line will execute first and block the pool's only thread
})
}
While on the surface it may seem like "always put at least 2 thread in the pool" would solve the problem, this is not a general solution. If you have N tasks blocking on a resource at the same time, you must have at least N+1 threads in the pool.
Two possible solutions:
As a game engine, we have strict latency requirements and we know our workload, so we should really be shooting for option 2.
In short, this problem exists with both schedulers, but the new scheduler permits the use of await as it is async aware, allowing us to properly solve it.
Near term, we need to replace blocking code in parallel_executor.rs with a non-blocking alternative (i.e. await). Long term, end users will need to be careful to avoid having systems blocking on other systems.
I think a solution for the "spawn work and wait for completion" case is to let the spawning task also do work:
Stjepan also recently deprecated multitask in favour of async-executor: https://docs.rs/crate/async-executor/0.2.1
The changes are very minimal: Ticker is removed, and Executor now has async functions for ticking and running tasks. I think the async functions allow to compose Executors, which is kinda neat, and could allow the spawning/waiting thread to tick the executor while waiting to ensure progress for a task pool.
I think (2) is reasonable given that as a game engine we have pretty tight control over the execution environment. We can (and usually should) provide whatever abstractions people need to interact with the outside world. Ex: networking, asset management, file-io.
That being said, we might also want to consider @kabergstrom's approach. Would that give us a nice way to support single-threaded environments? Ex: "main" thread can do work, we spawn zero threads for each thread pool, and the main thread cranks through all of the work?
It should be worth trying to replace the blocking crossbeam channel with something async-friendly, as a first step. This would effectively turn the scheduling itself into a task, with system tasks waking the scheduler upon completion.
A more long-term solution would be to experiment with fully async schedulers - spawn the entire stage's worth of systems, with system tasks awaiting their borrow acquisition and dependencies. @aclysma has prototyped something similar in an unrelated project, so it's not outright impossible.
@kabergstrom I think a solution for the "spawn work and wait for completion" case is to let the spawning task also do work
I agree, a more polished version of the scheduler would allow the caller to be a worker thread. This would avoid having to ship workload from one thread to another - and that has definite benefits.
That said, even if the calling thread was allowed to be a worker (as it does with rayon) if the rayon task pool is configured for 1 worker, the expected behavior would be just one worker (in this case the calling thread only - as @cart described.) This would still deadlock, just as rayon deadlocks in this case (see example code above).
@Ratysz A more long-term solution would be to experiment with fully async schedulers
While I mentioned those prototypes in discord, at this point I'm not recommending them as a good path.
I have been thinking about this more.. I personally haven't seen anyone deadlock an ECS before and most of them run on rayon - which we can see is clearly possible to deadlock with the example code. So I've been asking myself, why does this not happen more often? I think the answer is, writing code where one system blocks on another system completing is a very unusual thing to do. Generally speaking people are relying on the ECS to schedule their systems - which means systems just do their work and return without having to deal with synchronization primitives themselves. So at this point I think we need to focus on making ParallelExecutor not block.
Hence "experiment". It would be necessary to come up with something a lot less naive than what I sketched out, an approach that takes the concerns you listed into consideration. We'll either land close to where we are now, rendering the idea moot, or stumble onto interesting API and/or performance gains.
writing code where one system blocks on another system completing is a very unusual thing to do
It is indeed, to the point where I haven't seen that in the wild, ever; neither it is the issue here.
So at this point I think we need to focus on making ParallelExecutor not block.
What are your thoughts on making the scheduling into a non-blocking task?
Still looking at options, I've considered a few:
One of the reasons I like (3) is that it reminds me of Naughty Dog's GDC talk on fibers. Essentially they use a counter that can be awaited to hit 0. I think this maps well to systems depending on other systems to run. https://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine
All of these are greedily executing a DAG of tasks and have the same potential issues I mentioned above. I'm thinking now that bevy's original solution probably had them too so we probably won't be any worse off than we were.
As far as I'm concerned, this is fixed by #437. Thanks @aclysma!
If anyone is still hitting this, just let me know.
Most helpful comment
Ok, got to the bottom of this and have some interesting results to share.
What is the root cause?
There are two issues:
Why wasn't this happening with rayon?
While on the surface it may seem like "always put at least 2 thread in the pool" would solve the problem, this is not a general solution. If you have N tasks blocking on a resource at the same time, you must have at least N+1 threads in the pool.
What to do about this?
Two possible solutions:
As a game engine, we have strict latency requirements and we know our workload, so we should really be shooting for option 2.
In short, this problem exists with both schedulers, but the new scheduler permits the use of await as it is async aware, allowing us to properly solve it.
Near term, we need to replace blocking code in parallel_executor.rs with a non-blocking alternative (i.e. await). Long term, end users will need to be careful to avoid having systems blocking on other systems.