Describe the bug
Initially after bootstrapping you can use the node stats via the rest api just fine. But for some reason as time goes on when you try to get node stats it starts to time out. This started happening in 8.7
Mandatory Information
1 .jcli 0.8.7 (HEAD-364cd84e, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The stats to be returned
@stanfieldr thanks for the report. Can you check and report what is the total memory usage of your machine at the moment the issue happens.
@rinor
I have the same issue and my total memory usage was 1,5Go-1,6Go out of 4Go.
When using 0.8.6, my memory usage is 1,3Go max
I have the same issue on one of my nodes so cannot do a proper failover. @stanfieldr could you please set a proper title for this issue which is actually meaningful?
Suggestion: "Stats timeout on v0.8.7"
I've seen this issue as well.

Interestingly when I had this happen the REST query leader logs was still working instantly, it was just the stats query which was slow or timing out.
On my pool the rest service becomes unavailable after CPU hits over 100% for a minute or so. Then the following messages start to appear in the logs:
Jan 24 00:17:20.744 WARN Peer GCed 41 nodes, number_gced: 41, task: network
Jan 24 00:17:33.744 WARN Peer GCed 62 nodes, number_gced: 62, task: network
Jan 24 00:18:12.743 WARN Peer GCed 61 nodes, number_gced: 61, task: network
Jan 24 00:18:25.743 WARN Peer GCed 60 nodes, number_gced: 60, task: network
Jan 24 00:18:38.748 WARN Peer GCed 55 nodes, number_gced: 55, task: network
Jan 24 00:18:51.742 WARN Peer GCed 47 nodes, number_gced: 47, task: network
Jan 24 00:19:04.743 WARN Peer GCed 49 nodes, number_gced: 49, task: network
Jan 24 00:19:17.747 WARN Peer GCed 37 nodes, number_gced: 37, task: network
Jan 24 00:19:30.743 WARN Peer GCed 27 nodes, number_gced: 27, task: network
Jan 24 00:19:43.743 WARN Peer GCed 15 nodes, number_gced: 15, task: network
...
...
Jan 24 00:21:02.637 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.637 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.653 ERRO cannot process network headers, reason: Error(Elapsed), sub_task: chain_pull, task: block
Jan 24 00:21:02.654 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.666 ERRO cannot process network headers, reason: Error(Elapsed), sub_task: chain_pull, task: block
Jan 24 00:21:02.667 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.686 ERRO cannot process network headers, reason: Error(Elapsed), sub_task: chain_pull, task: block
Jan 24 00:21:02.687 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.688 ERRO cannot process network headers, reason: Error(Elapsed), sub_task: chain_pull, task: block
Jan 24 00:21:02.688 WARN response canceled by the processing task, request: PushHeaders, sub_task: server, task: network
Jan 24 00:21:02.733 ERRO cannot process network headers, reason: Error(Elapsed), sub_task: chain_pull, task: block
Before that node bootstrapped very quickly and it run approximately for 700 seconds before the messages above started to appear. Memory was OK - within limits.
I was testing out a new machine with the E3-1270v6 cpu and 32GB of RAM @ 2048 connections with a gossip interval of 1s and a 1gbps uplink. When I noticed I was forking the network due to connection time outs I stopped paying for the server immediately, so I'm not sure how much RAM I was using. I can however say I was, and still am running another server with much lower specs on commit ac36553c in master and I have not experienced this issue, so it must have been a change after that. JCLI on that machine was compiled back in 8.5 though, not sure if the two projects may impact each other. If useful, my script was calling the rest api directly through port 80, not through jcli. SSHing into the machine manually and running jcli had the same timeout problem though
I found hitting the REST API twice per second causes it to crash but for the last ~24 hours only hitting it once per second it did not crash.
Edit: correction - it just crashed a node.
Seems to only occur with the JCLI Node stats and Network stats get for me.
observation i made:
@jeroenvhove Not sure this is related to the peerUnreachableCnt.
I have two nodes well over 400 of this and they are running fine (no delay in node stats get)
blockRecvCnt: 3014
lastBlockContentSize: 0
lastBlockDate: "41.33627"
lastBlockFees: 0
lastBlockHash: d04d855dd918535ff2cb279ad67e8f3caad34925cde598c4c8c70ebbbdddf642
lastBlockHeight: "127419"
lastBlockSum: 0
lastBlockTime: "2020-01-24T13:54:31+00:00"
lastBlockTx: 0
lastReceivedBlockTime: "2020-01-24T13:54:31+00:00"
peerAvailableCnt: "62707"
peerQuarantinedCnt: "59330"
peerUnreachableCnt: "1972"
state: Running
txRecvCnt: 1330
uptime: 62484
version: jormungandr 0.8.7-364cd84
blockRecvCnt: 3154
lastBlockContentSize: 315
lastBlockDate: "41.33684"
lastBlockFees: 800000
lastBlockHash: 52853b9da1d28c531614d84304e1b64fd16e578d81a409ead8cba1a014c7b168
lastBlockHeight: "127425"
lastBlockSum: 1003912904971
lastBlockTime: "2020-01-24T13:56:25+00:00"
lastBlockTx: 1
lastReceivedBlockTime: "2020-01-24T13:56:25+00:00"
peerAvailableCnt: "74174"
peerQuarantinedCnt: "51835"
peerUnreachableCnt: "1955"
state: Running
txRecvCnt: 1384
uptime: 62604
version: jormungandr 0.8.7-364cd84
What I have observed is that if the server is running close to 100% CPU and Memory utilisation, this hang is more likely to happen (which I've seen on my other busy host)
I've just bootstrapped from yet another machine and I never got to see the node stats after the bootsrap status was gone.
jcli rest v0 leaders get returns without delay jcli rest v0 node stats get hangs for 20-30s then times out
After restarting jormungandr, the CPU utilisation seems to be normal and there is no issue with the node stats endpoint.

After a few minutes, node stats call times out again and I can see one of the CPU cores are always on 100% utilisation
found hitting the REST API twice per second causes it to crash
As I was using graphana to monitor my pool, I noticed that everything went back to normal (no timed-out error) after stopping it. So maybe the problem really comes from hitting the REST API too much as @rickymac68 said
Unlikely @ofakem-banock I'm not hitting the rest API periodically at all, and I still observe the issue.
Same here, 1 of the 6 CPU's is always at 100%.
With or without grafana/prometheus active issue seems to be same.
no special things found in the logs only this:

Thats about all i can tell.
There is a whole lot of synchronous locking over gratuitously shared state for the REST server.
The REST handlers for stats requests also lock the topology object; while this has been converted to use async-friendly locking, data extraction operations performed across the entire peer map lock out network threads.
This needs to be refactored, otherwise we'll keep running into these issues.
I think this issue applies to more than just node stats get. It just so happens that people use that one the most. Probably applies to all the jcli commands that call the REST interface.
I get it only on the Node Stats and Network Stats get. Probably has to do with the new network info that has been introduced in 0.8.7 ?
These log messages from 0.8.8 may be related:
Jan 30 09:59:34.557 INFO connection closed, peer_addr: 136.34.11.185:64518, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.557 DEBG a client peer connection has been closed, active_connections: 255, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.671 INFO connection closed, peer_addr: 68.75.78.164:51674, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.671 DEBG a client peer connection has been closed, active_connections: 254, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: out, stream: block_events, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task
: network
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: in, stream: block_events, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task:
network
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: out, stream: fragments, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task: n
etwork
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: in, stream: fragments, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task: ne
twork
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: out, stream: gossip, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task: netw
ork
Jan 30 09:59:34.691 DEBG subscription stream closed, direction: in, stream: gossip, node_id: 05a5641c2c26b07c7a7ef325987499e3150f88e6772c126c, sub_task: server, task: netwo
rk
Jan 30 09:59:34.924 DEBG subscription stream closed, direction: in, stream: fragments, node_id: 6169eba2b337d52728053a27f36e0467024ac851dd62a3d9, sub_task: server, task: ne
twork
Jan 30 09:59:34.924 DEBG inbound subscription stream terminated by the peer, direction: in, stream: fragments, node_id: 6169eba2b337d52728053a27f36e0467024ac851dd62a3d9, su
b_task: server, task: network
Jan 30 09:59:34.924 DEBG subscription stream closed, direction: in, stream: gossip, node_id: 6169eba2b337d52728053a27f36e0467024ac851dd62a3d9, sub_task: server, task: netwo
rk
Jan 30 09:59:34.924 DEBG inbound subscription stream terminated by the peer, direction: in, stream: gossip, node_id: 6169eba2b337d52728053a27f36e0467024ac851dd62a3d9, sub_t
ask: server, task: network
Jan 30 09:59:34.926 INFO connection closed, peer_addr: 77.68.3.39:42534, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.926 DEBG a client peer connection has been closed, active_connections: 253, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.945 INFO connection closed, peer_addr: 126.2.173.74:52005, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:34.945 DEBG a client peer connection has been closed, active_connections: 252, local_addr: 0.0.0.0:3000, task: network
Jan 30 09:59:37.049 ERRO cannot process network blocks, reason: Error(Elapsed), task: block
Jan 30 09:59:37.379 INFO received block announcement from network, from_node_id: 7ddf203c86a012e8863ef19d96aabba23d2445c492d86267, date: 47.24780, parent: f0fc5a51e44daf395ea6fe62b1185c266f0ee4151335da95c71248f4809451bd, hash: 68a0706353c6eb7492c4be8807a75070d590b4a5cd5dd6e4e902c4500b81180f, task: block
Jan 30 09:59:37.379 DEBG block is missing a locally stored parent, from_node_id: 7ddf203c86a012e8863ef19d96aabba23d2445c492d86267, date: 47.24780, parent: f0fc5a51e44daf395ea6fe62b1185c266f0ee4151335da95c71248f4809451bd, hash: 68a0706353c6eb7492c4be8807a75070d590b4a5cd5dd6e4e902c4500b81180f, task: block
@mark-stopka, you still have this issue?
For me on 8.8 it is as fast as lightning.
also, i just reverted back to 8.7 to test for this ticket and... its also super fast, even after 2000sec uptime -> problem is gone.
what i have found out though is following: recently i switched to a new provider. with new provider it went good untill suddenly they did a firmware update of the modem.
after that i had to limit max connections to max 150 to make it work more or less.
no good support from provider, a lot of calls to support team,...
long story short -> if found issue myself with some handy google work, modem had issue with NAT loopback
replaced modem with new one WITH NAT loopback support and presto.
super happy for the moment
maybe you should check the NAT loopback as well?
-> would mean that jormungandr is using the external IP as well to perform some actions
Jeroen
@jeroenvhove,
yes the issue is still present and will be until #1645 is resolved as the root-cause has been identified.
root cause = NAT loopback?
you can find lists online for what modem supports it and which one does'nt.
solution in #1645 seems more like a workaround and not the real root cause.
Anyway, for me it seems fixed. I hope you find a solution/fix for you as well.
I suspect this is a related to changing the p2p topology lock from RW to a standard Mutex lock. A revert was proposed in #1653 as there was no suitable asynchornous locking mechanism available without upgrading everything to tokio 0.2 at the time (unless just going back to an Arc ref lock is okay?).
It makes sense that its a deadlock to me given that peer stats were introduced around the same time and added to this specific rest handler. At the time, this would result in the handler thread requesting a read lock. Now it will request the same write lock as other threads querying or updating the p2p topology state. Will try to confirm later tonight.
Hi guys, good news I just installed v0.8.9 and the issue is resolved. In v0.8.8 I still had the problem. Also bootstrapping was really fast.
Sorry, looks like I was too fast. The bug is still there, the node stopped responding to JCLI request after 15 min. The number of incoming connections stayed the same 365, number of outgoing connections 0
In v8.8 I experienced it as well and noticed when it started having issues correlated with a large amount of connections being made. Went from 400 connections to 700 within the time it started spiking CPU and not responding. #1645 will probably fix it :crossed_fingers:
It is indeed only the jcli rest v0 node stats which times out. The other jcli commands are still working. The stats commands responses are becoming slower and slower until it times out after around 15min. At the same moment the number of outgoing connection also decrease to almost 0, 1 or max 3 connections. So according to me the stats commands also blocks the opening of new outgoing connections. Incoming connections remain the same, around 325 connections. Hopefully this helps the developers. Succes.
just curious, do you guys see
update current branch tip OR
INFO switching branch from
loggings after rest stats call stopped responding?
from the OP's screen shot, the cause of this issue looks similar to #1607,
at least 3 of my nodes has been experiencing it.
basically this will happen when the node receives many incoming block notifications on the same slot, and if that block is deemed missing local parent.. EACH incoming block will start it's own connection to find its parent.
And that's the start of a snowball effect from what I can tell, because most likely the next incoming block will start the same process because it'd take some time before the previous block is applied to the storage and it's usually too late.
To add a bit of maybe related info. We're seeing this happen mainly on nodes that differ only in having the max unreachable nodes set to 0. After increasing it to 100 the nodes seem much more stable now.
I consistently have the problem on nodes with max_unreachable_nodes_to_connect_per_event set to 16, 64, or 128.
On Friday, January 31, 2020, 4:54:01 AM PST, Michael Fellinger <[email protected]> wrote:
To add a bit of maybe related info. We're seeing this happen mainly on nodes that differ only in having the max unreachable nodes set to 0. After increasing it to 100 the nodes seem much more stable now.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Just a heads-up; issue appears to be still present in 0.8.10.
The issue is still present in 0.8.11, rolling back to 0.8.6
I think this issue is reproducible only when running multiple Jormungandr instances on a same host with NAT without proper isolation between individual instances. When I set-up multiple instances but block traffic between them I am not getting this issue even though my HEALTHCHECK is querying the API every 1s.
I really think you should lookup NAT loopback like i mentioned before (twice). Your latest description seems to confirm my suspicion.
To check my theory I would suggest you add a static route on your server for your external IP pointing to a non-existing IP within your local subnet. (make sure its within your local subnet)
I'm not a specialist, just somebody who likes to help and uses google from time to time.
don't shoot at me if it does not work :)
@jeroenvhove, yes we are aware of RFC5128 that describes how applications should deal with the issue. hairpinning is only a workaround that many enterprise-grade NAT deployments don't support. The issue is with the software, as the implications are quite obvious, see #1839, #1072 and #839 where we discussed issues connected with dynamic IPs, user-configured NAT and Carrier Grade NAT quite extensively.
I am not running multiple nodes on a single server. I am not using NAT, I have a static IP.As mentioned in #1812 this still happens for me on v0.8.11 although much less often now.
On Tuesday, February 25, 2020, 9:18:37 AM PST, Mark Stopka <[email protected]> wrote:
@jeroenvhove, yes we are aware of RFC5128 that describes how applications should deal with the issue. hairpinning is only a workaround that many enterprise-grade NAT deployments don't support. The issue is with the software, as the implications are quite obvious, see #1839, #1072 and #839 where we discussed issues connected with dynamic IPs, user-configured NAT and Carrier Grade NAT quite extensively.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Hi guys, good news I just installed v0.8.13 and the issue seems to be resolved for me. The node stays stable. I noticed that the number of incoming connections is much lower (~11) compared to v0.8.6 (~100). I am running a 1:1 NAT setup on VPS setup.
Altough the REST node stats keeps running, the node cannot keep in sync with the tip. My number of outgoing connections is still very low ~10 compared to the incoming ones ~100. I had to roll back to v.0.8.6, there the nodes stays in sync for most of the time, ~1 forced restart due to stalled block height. The number of outgoing connections is now ~100. It seems that for some reason the node running v0.8.13 is blocked somewhere in opening new outgoing connections? Anybody an idea what could be the reason for this?
@cardaspians
v0.8.13 seems to build up connections slower than previous versions, but supposedly the connections are better quality..
But you would need to give it a chance to build up the connections though.
So I would suggest run v0.8.13 as a passive node along with your v0.8.9 node.. For v0.8.13, set public ip/id(you can use the same IP but different port than your V0.8.9 node), max connection to 1028 or above, 15s gossip; and just let it run for couple days( of course you should restart it if the tip is off too much, sounds like you already have a restart script, just use it with a higher threshold like 20 mins).
Basically just try to get your ip/port out there and hopefully the new node will eventually connect to a ring that has lots of good peers..
Couple things you should monitor, for both your new 8.12 and 8.9 nodes:
you should monitor the INFO log for the number of incoming block announcements per each slot.. that's in additional to monitoring of the tip because not only you want your node getting the latest block, preferable there are more than a few of peers sending you the same block.
Good luck!
@fish2plain thanks for the useful info. Am I correct that when the v0.8.13 is running stable in passive mode I can promote the node to leader via jcli rest v0 leaders post -f node_secret.yaml -h and demote my other node to passive? I am using a stuck_node_mon.sh restart script with a block timeout of 10min. I will increase it to 20min. Any advice what I should use to monitor the incoming block anouncement per slot?
on multiple leaders setup I know couple popular choices
But the basic flow is as follows:
immediately demote nodes that you don’t want as leader after bootstrapped using jcli rest v0 leaders delete $slot (get the $slot using jcli rest v0 leaders get)
just ON or BEFORE the epoch switch time: 19:13 UTC, use jcli rest v0 leaders post -f to promote all those demoted nodes in step 2 as leader nodes. So they get the leader logs , and they all can be readily promote to leader for the new epoch if needed. After epoch switch is over, I think it will take less than 1 min.. demote all but one node.
couple things I usually look for in the INFO or DEBUG log..
$13, it’s the date slot in epoch.slot format that I’m interested in), grep "date:” jormungandr.log | grep "INFO” | awk '{print $13}’ | sort -g | uniq -c | sort -k 2
and if you have DEBUG logging, I like to see the following as well. “applying” and “applied” timestamp will tell you if there is latency in disk write. “missing” is to monitor for issue #1607, which still happens with new version but less often.
egrep “update current|switch|applied|applying|missing” jormungandr.log
Good luck and please let us know if you found better alternatives.
On Mar 3, 2020, at 11:52 AM, cardaspians notifications@github.com wrote:
@fish2plain https://github.com/fish2plain thanks for the useful info. Am I correct that when the v0.8.13 is running stable in passive mode I can promote the node to leader via jcli rest v0 leaders post -f node_secret.yaml -h and demote my other node to passive? I am using a stuck_node_mon.sh restart script with a block timeout of 10min. I will increase it to 20min. Any advice what I should use to monitor the incoming block anouncement per slot?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/input-output-hk/jormungandr/issues/1626?email_source=notifications&email_token=AGDR3MOJLDAFQSAVJ5GYOWLRFVNWDA5CNFSM4KKSPPSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENU45II#issuecomment-594136737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDR3MN3YJKSYJJJBRPBATDRFVNWDANCNFSM4KKSPPSA.
Most helpful comment
There is a whole lot of synchronous locking over gratuitously shared state for the REST server.
The REST handlers for stats requests also lock the topology object; while this has been converted to use async-friendly locking, data extraction operations performed across the entire peer map lock out network threads.
This needs to be refactored, otherwise we'll keep running into these issues.