Jormungandr: Leader failed to validate its own blocks

Created on 4 Oct 2019 · 22Comments · Source: input-output-hk/jormungandr

Mandatory Information

jcli 0.5.5 (master-c3f2dcc4+, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]
jormungandr 0.5.5 (master-c3f2dcc4+, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

To Reproduce

$RUST_BACKTRACE=1 jormungandr --config node-config.yaml --genesis-block-hash adbdd5ede31637f6c9bad5c271eec0bc3d0cb9efb86a5b913bb55cba549d07
70 --secret node-secret.yaml --log-level debug

   0: error_chain::backtrace::imp::InternalBacktrace::new
   1: <error_chain::State as core::default::Default>::default
   2: jormungandr::blockchain::chain::Blockchain::post_check_header
   3: futures::future::chain::Chain<A,B,C>::poll
   4: futures::future::chain::Chain<A,B,C>::poll
   5: <futures::future::map::Map<A,F> as futures::future::Future>::poll
   6: futures::task_impl::std::set
   7: std::thread::local::LocalKey<T>::with
   8: futures::future::Future::wait
   9: jormungandr::blockchain::process::handle_input
  10: <futures::stream::for_each::ForEach<S,F,U> as futures::future::Future>::poll
  11: std::panicking::try::do_call
  12: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:80
  13: <futures::future::catch_unwind::CatchUnwind<F> as futures::future::Future>::poll
  14: <futures::future::map_err::MapErr<A,F> as futures::future::Future>::poll
  15: futures::task_impl::std::set
  16: std::panicking::try::do_call
  17: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:80
  18: tokio_threadpool::task::Task::run
  19: tokio_threadpool::worker::Worker::run_task
  20: tokio_threadpool::worker::Worker::run
  21: std::thread::local::LocalKey<T>::with
  22: std::thread::local::LocalKey<T>::with
  23: std::thread::local::LocalKey<T>::with
  24: tokio::runtime::threadpool::builder::Builder::build::{{closure}}
  25: std::thread::local::LocalKey<T>::with
  26: std::thread::local::LocalKey<T>::with
  27: std::sys_common::backtrace::__rust_begin_short_backtrace
  28: std::panicking::try::do_call
  29: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:80
  30: core::ops::function::FnOnce::call_once{{vtable.shim}}
  31: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
  32: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs:787
      std::sys_common::thread::start_thread
             at src/libstd/sys_common/thread.rs:13
      std::sys::unix::thread::Thread::new::thread_start
             at src/libstd/sys/unix/thread.rs:79
  33: start_thread
  34: __clone
) } })
thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1166:5
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.34/src/backtrace/mod.rs

bug consensus-Genesis Priority - High A-jormungandr

Source

oldcryptogeek

All 22 comments

Thanks, the issue is the one just above. We are looking into it. The second one Expect the event to not close is a side effect of the first issue. What are the few lines above?

NicolasDP on 4 Oct 2019

I captured this prior to running backtrace.

Oct 04 09:52:26.002 INFO leader event starting, date: 224.3566, leader: 1, task: leadership
Oct 04 09:52:26.003 DEBG processing block from leader event, date: 224.3566, parent: aa8f5206849d8c9ae5a8520a3fc03bcc0bfb1cfd2ac234ed269e80abc8f29210, hash: 0876e3f069bca
2ee133a5872803e857cfa7bddf1ad70fa2f32d6982d1c78b334, task: block
thread 'block0' panicked at 'called Result::unwrap() on an Err value: Error(BlockHeaderVerificationFailed("Block signature is invalid"), State { next_error: None, bac
ktrace: InternalBacktrace { backtrace: None } })', src/libcore/result.rs:1084:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.
called Result::unwrap() on an Err value: Error(BlockHeaderVerificationFailed("Block signature is invalid"), State { next_error: None, backtrace: InternalBacktrace { b
acktrace: None } })thread 'leadership0' panicked at 'Expect the event to not close
', src/libcore/option.rs:1166:5
Expect the event to not close

oldcryptogeek on 4 Oct 2019

did you register a stakepool at the last epoch (yesterday)?

NicolasDP on 4 Oct 2019

My pool was registered several days ago, yesterday I delegated stake to it.
The pool ID:
0f30fcf32277b69afe537d6af47f9ebc04955e8ea138d703981cc7420bab5290

oldcryptogeek on 4 Oct 2019

I am experiencing the same issue. It only happens /w a registered stake pool.
But why does a node panic when a Block's header verification failed (message is corrupted, invalid or any other issues with any incoming data)?

As, every node, at least, should be highly resilient (would say silver-bullet-proof) to any message they're receiving from any (adversary or friendly) nodes or servers as otherwise the whole network can be shut down /w some malformed message.

ilap on 6 Oct 2019

👍1

@ilap You are correct, there should be proper error handling instead of a panic on invalid input, and we still have a few places that were left with a quick and dirty unwrap.

mzabaluev on 6 Oct 2019

👍1

i ran this passive - without secret node

Oct 06 14:32:33.249 INFO update current branch tip, date: 225.40769, parent: 1de18f2726aa138a83351f43d5d72e7ea8caee262a9f75acb8e5674a9cc7c7cd, hash: b4b9077be0c6e6305d2491759c7887558dd9f5ff0be36736343dff17b547f36c, task: block
thread 'block0' panicked at 'called Result::unwrap() on an Err value: Error(Msg("cannot check if the block is in the storage"), State { next_error: Some(BackendError(SqliteFailure(Error { code: CannotOpen, extended_code: 14 }, Some("unable to open database file")))), backtrace: InternalBacktrace { backtrace: None } })', src/libcore/result.rs:1084:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.
called Result::unwrap() on an Err value: Error(Msg("cannot check if the block is in the storage"), State { next_error: Some(BackendError(SqliteFailure(Error { code: CannotOpen, extended_code: 14 }, Some("unable to open database file")))), backtrace: InternalBacktrace { backtrace: None } })
thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1166:5

cliffc2 on 6 Oct 2019

Thanks, the issue is the one just above. We are looking into it. The second one Expect the event to not close is a side effect of the first issue. What are the few lines above?

Interestingly, it's working now. Is it due to the fact that we've entered to the next epoch?

ilap on 7 Oct 2019

It still crashes bit for different reason.

It seams to me some memory, memleak or ulimit issue as the rust crashed /w memory allocation of 694 bytes failedAborted.
So, how the mem is cleaned /w GC?
Also what is the recommended ulimit settings for a stake-pools.
As the srv has enough memory, 4GB, and jormangundr only used ~370MB total (~270MB heap, 128KB stack, and the others), about 15 mins before the crash.

ilap on 7 Oct 2019

What’s your --full-version?

NicolasDP on 7 Oct 2019

What’s your --full-version?

jormungandr 0.5.5 (HEAD-aec29a1, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]
But, it's a bit misleading, as it's running on a 32bit OS, and I use the following binary:
jormungandr-v0.5.5-armv7-unknown-linux-gnueabihf.tar.gz

ilap on 7 Oct 2019

Here comes the total memory consumption of jormungandr (collected in every minute) since I started it:

Rss:               69692 kB
Rss:               86908 kB
Rss:              100616 kB
Rss:              114320 kB
Rss:              128164 kB
Rss:              149320 kB
Rss:              291524 kB
Rss:              292716 kB
Rss:              292968 kB
Rss:              309044 kB
Rss:              334752 kB
Rss:              366172 kB
Rss:              369904 kB
Rss:              371684 kB
Rss:              378924 kB
Rss:              381676 kB
Rss:              387240 kB
Rss:              393340 kB
Rss:              404864 kB
Rss:              410200 kB
Rss:              410208 kB
Rss:              410220 kB
Rss:              410236 kB
Rss:              404812 kB
Rss:              404836 kB
Rss:              404844 kB
Rss:              407488 kB

ilap on 7 Oct 2019

Ok, it works now and it seems to me stable (running more than an our /w same memory).
The issue was that I accidentally DID NOT download and use the
jormungandr-v0.5.5-armv7-unknown-linux-gnueabihf.tar.gz (for newer ARMHF CPUs) I mentioned above, but the other (the older one for ARMEL)
jormungandr-v0.5.5-arm-unknown-linux-gnueabi.tar.gz, which behaved very unstable.

ilap on 7 Oct 2019

i ran this passive - without secret node

Oct 06 14:32:33.249 INFO update current branch tip, date: 225.40769, parent: 1de18f2726aa138a83351f43d5d72e7ea8caee262a9f75acb8e5674a9cc7c7cd, hash: b4b9077be0c6e6305d2491759c7887558dd9f5ff0be36736343dff17b547f36c, task: block
thread 'block0' panicked at 'called Result::unwrap() on an Err value: Error(Msg("cannot check if the block is in the storage"), State { next_error: Some(BackendError(SqliteFailure(Error { code: CannotOpen, extended_code: 14 }, Some("unable to open database file")))), backtrace: InternalBacktrace { backtrace: None } })', src/libcore/result.rs:1084:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.
called Result::unwrap() on an Err value: Error(Msg("cannot check if the block is in the storage"), State { next_error: Some(BackendError(SqliteFailure(Error { code: CannotOpen, extended_code: 14 }, Some("unable to open database file")))), backtrace: InternalBacktrace { backtrace: None } })
thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1166:5

My node crashed with this same error on version:
jormungandr 0.5.6 (master-8d96e03c+, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

Chris-Graffagnino on 8 Oct 2019

Did you delete the storage file in between ?

NicolasDP on 8 Oct 2019

Did you delete the storage file in between ?

I deleted the contents of /tmp/jormungandr/ prior to updating to 0.5.6, with the expectation that the files would be repopulated. I can see now that I am missing blocks.sqlite-shm and blocks.sqlite-wal. Funny thing is that the newly updated node ran overnight with no problems (was receiving blocks).

Chris-Graffagnino on 8 Oct 2019

Hi,

im facing the same issue.

Oct 09 16:45:17.303 INFO Starting jormungandr 0.5.6+lock (master-5ddbe91, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)], task: init Oct 09 16:45:17.304 INFO storing blockchain in '"/root/IOHK/jor-test/jormungandr/scripts/jormungandr-storage-test/blocks.sqlite"', task: init Oct 09 16:45:17.305 WARN No trusted peers joinable to bootstrap the network, task: bootstrap Oct 09 16:45:17.306 INFO starting task, task: client-query Oct 09 16:45:17.306 INFO starting task, task: network Oct 09 16:45:17.306 INFO adding P2P Topology module: trusted-peers, task: network Oct 09 16:45:17.306 INFO start listening and accepting gRPC connections on 127.0.0.1:8299, task: network thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1166:5 note: run withRUST_BACKTRACE=1environment variable to display a backtrace. Expect the event to not close

I did compine the sources multiple times, but it didn't work...
Any Idea?

Thank you

Cheers

pheelLikeWater on 9 Oct 2019

I reproduced this issue running a node with a secret.yaml file and not registering that stake pool on the blockchain.

Config of node:

{
  "log": {
    "format": "plain",
    "level": "info",
    "output": {
      "gelf": {
        "backend": "monitoring.stakepool.cardano-testnet.iohkdev.io:12201",
        "log_id": "steak-and-eggs"
      }
    }
  },
  "p2p": {
    "public_address": "/ip4/45.55.58.244/tcp/3100",
    "topics_of_interest": {
      "blocks": "high",
      "messages": "high"
    },
    "trusted_peers": [
      "/ip4/3.123.177.192/tcp/3000",
      "/ip4/52.57.157.167/tcp/3000",
      "/ip4/3.123.155.47/tcp/3000",
      "/ip4/3.115.57.216/tcp/3000",
      "/ip4/3.112.185.217/tcp/3000",
      "/ip4/18.139.40.4/tcp/3000",
      "/ip4/18.140.134.230/tcp/3000"
    ]
  },
  "rest": {
    "listen": "10.40.33.1:3101"
  },
  "storage": "/var/lib/jormungandr"
}

Ran with command:

/nix/store/ixdq1m73ahkk75d0r3bc57m5rdjvicyv-jormungandr-0.5.6+lock/bin/jormungandr --genesis-block-hash adbdd5ede31637f6c9bad5c271eec0bc3d0cb9efb86a5b913bb55cba549d0770 --config /nix/store/39fr7qyv161zr1085ylqyz0fav5v7ijy-config.yaml --secret /run/keys/jormungandr-secret.yaml

Started Jormungandr node service.
Oct 10 13:10:27.990 INFO Starting jormungandr 0.5.6+lock (, release, linux [x86_64]) - [rustc 1.37.0 (eae3437df 2019-08-13)], task: init
Oct 10 13:10:27.991 INFO storing blockchain in '"/var/lib/jormungandr/blocks.sqlite"', task: init
Oct 10 13:11:30.971 INFO connecting to bootstrap peer 3.115.57.216:3000, peer_addr: 3.115.57.216:3000, task: bootstrap
Oct 10 13:11:31.295 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 3.115.57.216:3000, task: bootstrap
Oct 10 13:11:31.297 INFO connecting to bootstrap peer 3.112.185.217:3000, peer_addr: 3.112.185.217:3000, task: bootstrap
Oct 10 13:11:31.624 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 3.112.185.217:3000, task: bootstrap
Oct 10 13:11:31.627 INFO connecting to bootstrap peer 52.57.157.167:3000, peer_addr: 52.57.157.167:3000, task: bootstrap
Oct 10 13:11:31.809 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 52.57.157.167:3000, task: bootstrap
Oct 10 13:11:31.811 INFO connecting to bootstrap peer 3.123.177.192:3000, peer_addr: 3.123.177.192:3000, task: bootstrap
Oct 10 13:11:31.996 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 3.123.177.192:3000, task: bootstrap
Oct 10 13:11:31.998 INFO connecting to bootstrap peer 18.139.40.4:3000, peer_addr: 18.139.40.4:3000, task: bootstrap
Oct 10 13:11:32.442 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 18.139.40.4:3000, task: bootstrap
Oct 10 13:11:32.445 INFO connecting to bootstrap peer 18.140.134.230:3000, peer_addr: 18.140.134.230:3000, task: bootstrap
Oct 10 13:11:32.897 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 18.140.134.230:3000, task: bootstrap
Oct 10 13:11:32.897 INFO connecting to bootstrap peer 3.123.155.47:3000, peer_addr: 3.123.155.47:3000, task: bootstrap
Oct 10 13:11:33.079 WARN with initial bootstrap, reason: bootstrap pull stream failed, peer_addr: 3.123.155.47:3000, task: bootstrap
Oct 10 13:11:33.081 INFO starting task, task: network
Oct 10 13:11:33.081 INFO adding P2P Topology module: trusted-peers, task: network
Oct 10 13:11:33.082 INFO start listening and accepting gRPC connections on 45.55.58.244:3100, task: network
Oct 10 13:11:33.082 INFO starting task, task: client-query
Oct 10 13:11:33.083 INFO connecting to initial gossip peer, peer_addr: 3.112.185.217:3000, task: network
Oct 10 13:11:33.085 INFO connecting to initial gossip peer, peer_addr: 3.115.57.216:3000, task: network
Oct 10 13:11:33.086 INFO connecting to initial gossip peer, peer_addr: 52.57.157.167:3000, task: network
Oct 10 13:11:33.086 INFO connecting to initial gossip peer, peer_addr: 18.139.40.4:3000, task: network
Oct 10 13:11:33.087 INFO connecting to initial gossip peer, peer_addr: 3.123.177.192:3000, task: network
Oct 10 13:11:33.088 INFO connecting to initial gossip peer, peer_addr: 3.123.155.47:3000, task: network
Oct 10 13:11:33.090 INFO connecting to initial gossip peer, peer_addr: 18.140.134.230:3000, task: network
thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1034:5
stack backtrace:
   0:     0x5641add6c34b - std::panicking::default_hook::{{closure}}::h847a2eb38b396f14
   1:     0x5641add6c027 - std::panicking::default_hook::h2ca0f9a30a0e206b
   2:     0x5641add6cac0 - std::panicking::rust_panic_with_hook::hffcefc09751839d1
   3:     0x5641add6c642 - std::panicking::continue_panic_fmt::hc0f142c930c846fc
   4:     0x5641add6c526 - rust_begin_unwind
   5:     0x5641add897ed - core::panicking::panic_fmt::h2daf88b2616ca2b2
   6:     0x5641add89857 - core::option::expect_failed::h4b77ebe6e62ec3a1
   7:     0x5641ad759f03 - futures::future::chain::Chain<A,B,C>::poll::ha355895fbd52404c
   8:     0x5641ad6c2e68 - <futures::future::loop_fn::LoopFn<A,F> as futures::future::Future>::poll::h81adca4422d6b8a6
   9:     0x5641ad4dbca7 - std::panicking::try::do_call::hfba1fe95d49ab22e
  10:     0x5641add70e1a - __rust_maybe_catch_panic
  11:     0x5641ad649c8c - <futures::future::catch_unwind::CatchUnwind<F> as futures::future::Future>::poll::hcb3290538a78d67a
  12:     0x5641ad422163 - <futures::future::map_err::MapErr<A,F> as futures::future::Future>::poll::hf286e919e0bc929a
  13:     0x5641add46a4d - futures::task_impl::Spawn<T>::poll_future_notify::hb23b3373513e934c
  14:     0x5641add46641 - std::panicking::try::do_call::h228954743a9e9cf4
  15:     0x5641add70e1a - __rust_maybe_catch_panic
  16:     0x5641add44e59 - tokio_threadpool::task::Task::run::h3c1ea0ba7374fc64
  17:     0x5641add3dd79 - tokio_threadpool::worker::Worker::run_task::h5a2aca95cc3c9906
  18:     0x5641add3d4f7 - tokio_threadpool::worker::Worker::run::hc73017d12b48a593
  19:     0x5641add1ba1f - std::thread::local::LocalKey<T>::with::h9a23aa648baa86b7
  20:     0x5641add1b908 - std::thread::local::LocalKey<T>::with::h76110cfda97f3d8c
  21:     0x5641add1b1df - tokio_reactor::with_default::h492df3d33b38d57c
  22:     0x5641add21e6f - tokio::runtime::threadpool::builder::Builder::build::{{closure}}::h084c0bd1fdbad0f9
  23:     0x5641add444b6 - std::thread::local::LocalKey<T>::with::hd8efb6f6b3779307
  24:     0x5641add445cf - std::thread::local::LocalKey<T>::with::hda880bf1bdd22f0e
  25:     0x5641add3e671 - std::sys_common::backtrace::__rust_begin_short_backtrace::h90d0a2345cf9a3bb
  26:     0x5641add4668c - std::panicking::try::do_call::h31c1fb4fa4bc192b
  27:     0x5641add70e1a - __rust_maybe_catch_panic
  28:     0x5641add3f910 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h8451465f67620f58
  29:     0x5641add5d36f - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h42806b83647d4c79
  30:     0x5641add701c0 - std::sys::unix::thread::Thread::new::thread_start::h4570080769500bcd
  31:     0x7f8757154ef7 - start_thread
  32:     0x7f8756e7222f - __GI___clone
  33:                0x0 - <unknown>
Expect the event to not close
jormungandr.service: Main process exited, code=exited, status=66/NOINPUT
jormungandr.service: Failed with result 'exit-code'.
jormungandr.service: Consumed 1min 3.405s CPU time.
jormungandr.service: Service RestartSec=100ms expired, scheduling restart.
jormungandr.service: Scheduled restart job, restart counter is at 440.
Stopped Jormungandr node service.
jormungandr.service: Consumed 1min 3.405s CPU time.

disassembler on 10 Oct 2019

@disassembler Great, thanks.We need to repro with a debug build to see what's causing it, this backtrace looks like some call frames are optimized away.

mzabaluev on 10 Oct 2019

Seeing this panic on 0.6.0 too.

mmahut on 15 Oct 2019

This should be fixed in 0.7. Let me know if this is not the case.

NicolasDP on 24 Oct 2019

still there it seems. 0.7.0-rc1

Oct 24 19:37:21.001 INFO leader event starting, date: 1.386, leader: 1, task: leadership
thread 'block1' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Msg("cannot process leadership block"), State { next_error: Some(Error(BlockHeaderVerificationFailed("Block proof is invalid"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })', src/libcore/result.rs:1084:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
called `Result::unwrap()` on an `Err` value: Error(Msg("cannot process leadership block"), State { next_error: Some(Error(BlockHeaderVerificationFailed("Block proof is invalid"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })thread 'leadership0' panicked at 'Expect the event to not close', src/libcore/option.rs:1166:5

ubuntu@ip-172-31-32-11:~/jor-test/jormungandr$ jormungandr --full-version
jormungandr 0.7.0-rc1 (HEAD-5583e5e8, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]