Describe the bug
Node crashes instantly when jcli rest v0 utxo get -h http://127.0.0.1:3101/api is queried. Testet in the private Testnet, where lots of transactions were made. First it just gave nothing back, but after a few thousand txs it just crashed with following error:
Oct 24 01:42:41.423 INFO Starting jormungandr 0.7.0-alpha.dev.1 (HEAD-3af63d09+, release, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)], task: init
Oct 24 01:42:41.423 WARN Node started without path to the stored secret keys (not a stake pool or a BFT leader), task: init
Oct 24 01:42:41.423 INFO Starting 1 workers, scope: global
Oct 24 01:42:41.424 INFO Starting server on 127.0.0.1:3101, scope: global
Oct 24 01:42:41.424 INFO storing blockchain in '"/home/consuman/CARDANO/testnet_alt/storage/blocks.sqlite"', task: init
Oct 24 01:42:41.426 INFO fetching block c63a07f3e0db52c9abf886453316a1698d41a77023972c708345acf6645a8c0c, peer_address: 45.32.188.204:4000, block: c63a07f3e0db52c9abf886453316a1698d41a77023972c708345acf6645a8c0c, task: init
Oct 24 01:42:42.054 INFO initial bootstrap completed, peer_address: 45.32.188.204:4000, block: c63a07f3e0db52c9abf886453316a1698d41a77023972c708345acf6645a8c0c, task: init
Oct 24 01:42:42.140 INFO connecting to bootstrap peer 51.91.96.237:3100, peer_addr: 51.91.96.237:3100, task: bootstrap
Oct 24 01:42:54.629 ERRO Error occurred during request handling, status: 503 Service Unavailable Full REST context not available yet, scope: global
Oct 24 01:43:00.967 ERRO Error occurred during request handling, status: 503 Service Unavailable Full REST context not available yet, scope: global
Oct 24 01:43:07.976 INFO switching to new candidate branch, peer_addr: 51.91.96.237:3100, task: bootstrap
Oct 24 01:43:07.988 INFO initial bootstrap completed, peer_addr: 51.91.96.237:3100, task: bootstrap
Oct 24 01:43:07.988 INFO starting task, task: client-query
Oct 24 01:43:07.989 INFO adding P2P Topology module: trusted-peers, task: network
Oct 24 01:43:07.989 INFO start listening and accepting gRPC connections on 0.0.0.0:8855, task: network
Oct 24 01:43:07.989 INFO connecting to peer, node_id: 2da766df9d16682e45119194cbe5ccb92e03020625558e821201d6e319a37673, peer_addr: 51.91.96.237:3100, task: network
Oct 24 01:43:07.989 INFO connecting to peer, node_id: 7ae7730a9f19b8398e415c199d22035e48c4e220f75ea0a12870ad2ab309b7a0, peer_addr: 45.32.188.204:4000, task: network
Oct 24 01:43:07.989 INFO connecting to peer, node_id: e277491e0dfc7c23e51258cd84f66fc77b15fe9e1b71b8b3cb0ddce0b6de54b4, peer_addr: 35.235.111.217:9000, task: network
Oct 24 01:43:11.112 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 35.235.111.217:39440, task: network
Oct 24 01:43:11.265 INFO client P2P connection closed, node_id: e277491e0dfc7c23e51258cd84f66fc77b15fe9e1b71b8b3cb0ddce0b6de54b4, peer_addr: 35.235.111.217:9000, task: network
Oct 24 01:43:13.556 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 54.66.255.167:50069, task: network
Oct 24 01:43:13.581 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 54.66.255.167:50068, task: network
Oct 24 01:43:14.106 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 176.25.248.202:42848, task: network
Oct 24 01:43:14.106 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 176.25.248.202:42846, task: network
Oct 24 01:43:14.179 INFO incoming P2P connection closed, peer_addr: 176.25.248.202:42848, task: network
Oct 24 01:43:14.673 INFO incoming P2P HTTP/2 connection error, reason: connection error: Connection reset by peer (os error 104), peer_addr: 54.66.255.167:50069, task: network
Oct 24 01:43:15.131 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 94.195.82.31:43388, task: network
Oct 24 01:43:15.157 INFO incoming P2P connection on 138.201.181.94:8855, peer_addr: 94.195.82.31:43390, task: network
Oct 24 01:43:15.225 INFO incoming P2P connection closed, peer_addr: 94.195.82.31:43388, task: network
thread 'arbiter:f73a75ad-6a0a-40a2-8dbb-dbcd7a3e347c:actix-net-worker-0' has overflowed its stack
fatal runtime error: stack overflow
Mandatory Information
To Reproduce
Query jcli rest v0 utxo get -h http://127.0.0.1:3101/api in a network with a high tx amount.
I can't reproduce the issue. Are you able to reliably reproduce it on your machine? I've tried running a node in the network (and got plenty of txs on bootstrap) and running a self node with around 2000 TXs in UTXO, in both cases UTXO REST was rock stable.
Yes, I reproduced it several times in a private testnet with this genesis: https://github.com/rdlrt/Alternate-Jormungandr-Testnet/blob/master/README.md
I had the /utxo api included in our node monitor / explorer and it crashed my node multiple times until I realised it was comming from this request. Afterwords I asked someone else in thsi network to call it and his node crashed with the same error. We did it without software inbetween, only the API request. jcli rest v0 utxo get -h http://127.0.0.1:3101/api
I remember that in the beginning of this testnet the result was empty and after a few hours of tx spamming tests it made my node crash. But it was after several thousand txs.
(In the last 80h alone we pushed 260722 tx through the network. Find a running Node-Monitor here: http://privtest1.standardize.network/)
Other links:
We talked about it here: https://t.me/CardanoPrivateTestnet/847
Here is our software to query the node: https://github.com/standardize-network/querynode
Edit: I've not tested with the newest version 0.7.0.rc, only alpha-dev, but I dont want to crash my node right now, to keep mempool and data. So I would have to setup a seperate machine to do so. Tell me if its necessary :)
Is there a chance to obtain a backtrace of the crashed thread, perhaps even with a debug build?
I don't think so, AFAIK stack overflows terminate the whole process instantly. I've tried it with both debug and release builds of a toy program I wrote.
@CodeSandwich Sorry, I didn't mean a run-time Rust backtrace, but a postmortem from core. LLDB, anyone?
Sorry, I dont have it. Node is running in a long term test right now and I dont want to shut it down the next days.
We tested it again, with 0.7.0-rc and the issue still persists. Seems like when there are tens of thousands of txs, its causing internal problems, because too much data is requested all at once.
@consuman this end point should have been removed a long time ago. This is the least reliable way to obtain the current UTxO list. Better to play with the explorer mode if you wish to get the whole list of UTxO.
documentation is here if you want to run the explorer mode: https://input-output-hk.github.io/jormungandr/quickstart/04_explorer.html
adding #1048 though, as we believe providing a query by id UTxO list is acceptable for the node mode setting
@NicolasDP The bug is still there and we have a chance to track it down. I'm 99% confident that it's not in UTxO REST endpoint, because it's not doing anything except forming a JSON. I've been able to generate 500 MB UTxO JSON when I was looking for the bug and while it took some time, it was handled perfectly. This bug is somewhere deeper, which means that it's probably much more severe.
@CodeSandwich , I got this running under gdb (will try lldb shortly). While I check further the bt, does this tells you anything?
Thread 4 "arbiter:efd4e90" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7495700 (LWP 1847)]
0x0000555556d0cba3 in <chain_impl_mockchain::utxo::Iter<V> as core::iter::traits::iterator::Iterator>::next (self=0x7ffff747dca8)
at chain-deps/chain-impl-mockchain/src/utxo.rs:169
169 None => match self.hamt_iter.next() {
(gdb) continue
Continuing.
thread 'arbiter:efd4e90d-7bff-48c4-ac9e-c5acde897e2a:actix-net-worker-0' has overflowed its stack
fatal runtime error: stack overflow
Oct 29 14:27:34.362 INFO leader event starting, date: 73017.366, leader: 1, task: leadership
Thread 4 "arbiter:efd4e90" received signal SIGABRT, Aborted.
under lldb
rocess 3316 stopped
* thread #2, name = 'arbiter:4e3e544', stop reason = signal SIGSEGV: address access protected (fault address: 0x7ffff7295ff8)
frame #0: 0x0000555556d0cba3 jormungandr`_$LT$chain_impl_mockchain..utxo..Iter$LT$V$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::next::h4f2c4804c986a977(self=0x00007ffff747dca8) at utxo.rs:169
166
167 fn next(&mut self) -> Option<Self::Item> {
168 match &mut self.unspents_iter {
-> 169 None => match self.hamt_iter.next() {
170 None => None,
171 Some(unspent) => {
172 self.unspents_iter = Some((unspent.0, (unspent.1).0.iter()));
(lldb) continue
Process 3316 resuming
thread 'arbiter:4e3e544c-e052-483e-b6f5-2872229eece4:actix-net-worker-0' has overflowed its stack
fatal runtime error: stack overflow
Process 3316 stopped
* thread #2, name = 'arbiter:4e3e544', stop reason = signal SIGABRT
frame #0: 0x00007ffff7a587bb libc.so.6`__GI_raise(sig=6) at raise.c:51
Thank you, @rinor, that's brilliant! Can you print the stack trace from that moment? I can't see an obvious recursion in code that caused the overflow, the stack trace would help a lot.
Clean one:
(gdb) frame
#0 0x0000555556d0cba3 in <chain_impl_mockchain::utxo::Iter<V> as core::iter::traits::iterator::Iterator>::next (self=0x7ffff747dca8)
at chain-deps/chain-impl-mockchain/src/utxo.rs:169
169 None => match self.hamt_iter.next() {
(gdb) continue
Continuing.
thread 'arbiter:55a596ba-f7af-4a3d-b5c5-a6287abfe11b:actix-net-worker-0' has overflowed its stack
fatal runtime error: stack overflow
Thread 4 "arbiter:55a596b" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
gdb-info_threads.txt
gdb-bt_full.txt
gdb-thread_apply_all_bt_full.txt
@rinor Could you confirm if the issue is fixed on branch fix_utxo_iter?
@CodeSandwich Under the same scenario as before, there is NO Segmentation Fault anymore.
Most helpful comment
@consuman this end point should have been removed a long time ago. This is the least reliable way to obtain the current UTxO list. Better to play with the explorer mode if you wish to get the whole list of UTxO.
documentation is here if you want to run the explorer mode: https://input-output-hk.github.io/jormungandr/quickstart/04_explorer.html