On a recent kusama, I'm running:
kusama git:(master) ./target/release/polkadot --light
At some point, the sync panics with
thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow
[1] 23325 abort ./target/release/polkadot --light
Gist: https://gist.github.com/amaurymartiny/b5b85f88b3bf2e590409ac371acba288
The "at some point" depends, I tried running twice:
I'm not sure if it's related to https://github.com/paritytech/substrate/issues/4527, if yes, we can close this issue and continue there.
Specs:
Can you try to run with gdb (or similiar) to get a proper stack trace? I tried it on my PC and could not yet reproduce it.
macos has default stack size of 512k for non-main threads. So this should be reproducible on linux with something like ulimit -s 500
Yep, getting it ;)
(it is related to decoding the ForkTree which is a recursive data structure)
I'm getting this, on a full node when trying to sync for https://github.com/paritytech/substrate/pull/7225. The command I'm running is simply target/release/substrate --execution native --wasm-execution compiled.
So if the recursive nature of the ForkTree is the problem here, what's the solution? To rewrite it so that it's a flat array?
Maybe there is a bug and we should already have pruned some of the state. Not sure. @andresilva probably knows more.
We will prune BABE's epoch tree as we finalize new blocks, in this case we are not catching up to the latest finalized block and as we sync BABE blocks the epoch tree gets deeper and deeper. Since all of the operations on the fork tree are currently implemented using recursion we eventually reach the limits of the call stack. The solution is to rewrite the operations on the fork tree to not use recursion and instead use some auxiliary data structure as stack (my brain understands recursion better so that's why I wrote it like that the first time around).
It should also be possible to write Node::decode using a loop without recursion.
It should also be possible to write
Node::decodeusing a loop without recursion.
I just tried this and it had no effect, so this doesn't seem to be a problem with decoding. Here's the gdb backtrace:
Thread 68 "tokio-runtime-w" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa30d4640 (LWP 28733)]
0x0000555557056a6c in <sc_client_db::BlockchainDb<Block> as sp_blockchain::header_metadata::HeaderMetadata<Block>>::header_metadata ()
(gdb) backtrace
#0 0x0000555557056a6c in <sc_client_db::BlockchainDb<Block> as sp_blockchain::header_metadata::HeaderMetadata<Block>>::header_metadata ()
#1 0x0000555556c996c2 in sp_blockchain::header_metadata::lowest_common_ancestor ()
#2 0x0000555556979bc1 in sc_client_api::utils::is_descendent_of::{{closure}} ()
#3 0x0000555556b47431 in fork_tree::node_implementation::Node<H,N,V>::import ()
#4 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#5 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#6 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#7 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#8 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#9 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#10 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#11 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#12 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#13 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#14 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#15 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#16 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#17 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#18 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#19 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#20 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#21 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#22 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#23 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#24 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#25 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#26 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#27 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#28 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#29 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#30 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#31 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#32 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#33 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#34 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#35 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
...
#2436 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#2437 0x0000555556b4730c in fork_tree::node_implementation::Node<H,N,V>::import ()
#2438 0x00005555570db022 in fork_tree::ForkTree<H,N,V>::import ()
#2439 0x00005555569c2e93 in sc_consensus_epochs::EpochChanges<Hash,Number,E>::import ()
#2440 0x0000555556c8e31f in <sc_consensus_babe::BabeBlockImport<Block,Client,Inner> as sp_consensus::block_import::BlockImport<Block>>::import_block ()
#2441 0x0000555556c85371 in sp_consensus::import_queue::import_single_block_metered ()
#2442 0x0000555556a29c9c in <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll ()
#2443 0x0000555556be40ba in <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll ()
#2444 0x0000555556df14c9 in <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll ()
#2445 0x0000555556a2e94a in <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll ()
#2446 0x000055555750e070 in <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll ()
#2447 0x000055555750645c in <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll ()
#2448 0x0000555557508df9 in <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll ()
#2449 0x0000555556c23f7c in std::thread::local::LocalKey<T>::with ()
#2450 0x0000555556ac4191 in futures_executor::local_pool::block_on ()
#2451 0x00005555570ccb73 in tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut ()
#2452 0x0000555556f8cd7d in tokio::runtime::task::core::Core<T,S>::poll ()
#2453 0x0000555556fb7449 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#2454 0x0000555556b21098 in tokio::runtime::task::harness::Harness<T,S>::poll ()
#2455 0x0000555558097d35 in tokio::runtime::blocking::pool::Inner::run ()
#2456 0x000055555809d1f9 in tokio::runtime::context::enter ()
#2457 0x00005555580a0820 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#2458 0x00005555580a80a0 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#2459 0x00005555585bd16a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/397b390cc76ba1d98f80b2a24a371f708dcc9169/library/alloc/src/boxed.rs:1042
#2460 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/397b390cc76ba1d98f80b2a24a371f708dcc9169/library/alloc/src/boxed.rs:1042
#2461 std::sys::unix::thread::Thread::new::thread_start () at library/std/src/sys/unix/thread.rs:87
#2462 0x00007ffff39293e9 in start_thread () from /usr/lib/libpthread.so.0
#2463 0x00007ffff383d293 in clone () from /usr/lib/libc.so.6
@expenses you have rewritten the decode as iterative function? Perfect! Decode and import suffer from the same problem, so we need a solution for both of them ;)
Unfortunately, import is a lot harder to iterative-ize :thinking:
If we're going to re-write the fork tree, I think it makes sense to use a library for the tree structure. I'm not really sure which ones are commonly used, but I don't think we need to do this.indextree seemed decent.
I have written an iterative version of import: https://github.com/paritytech/substrate/compare/82867294110481d421f77bddb04aef10c0674e09
The node still panics though due to find_node_index_where being recursive.
Thanks! Yeah I think we can keep the forktree since it's small enough and specialized for our two use cases (BABE and GRANDPA), although I appreciate that you looked into alternatives. A PR for removing recursion out of it (the work that you already started here) would be greatly appreciated :)
Most helpful comment
I have written an iterative version of
import: https://github.com/paritytech/substrate/compare/82867294110481d421f77bddb04aef10c0674e09The node still panics though due to
find_node_index_wherebeing recursive.