Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): 0.26.4
ABCI app (name for built-in, URL for self-written if it's publicly available): Loom SDK
Environment:
What happened:
Our 4 node staging tendermint cluster in a single geographic region will get stuck trying to finish a round, something for 50 minutes straight, it accepts tx in the mempool but never completes a block. Any ideas how to debug this more or allow the cluster to just cancel blocks when it can't come to consensus.
See the attached logs from all 4 nodes. At the end of the 50 minutes it finally commits and 5000 txs stuck in the mempool clear
What you expected to happen:
Would expect the round to close within 5-10seconds, if they can't get agreement from one node
Have you tried the latest version: yes/no no, the version is only a few days old
How to reproduce it (as minimally and precisely as possible): no consistent way, it seems to do with having dozens of users connected with a few transactions a second
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
Archive.zip
Attached log from 4 nodes
Config (you can paste only the changes you've made):
replay = false
broadcast = true
create_empty_blocks = false
node command runtime flags:
/dump_consensus_state output for consensus bugs
This is an hour after the event, the cluster came back to life with no outside intervention. So I'm not sure if this is super relevant
Anything else we need to know:
A few hundred connected clients to the cluster but still low transaction volume, maybe a few transactions a second
Thanks for the report. Do you have the logs for when it started making blocks again?
Have you seen this happen more than once, and is the timing (~50min) roughly the same?
We recently fixed an issue related to replay = false & create_empty_blocks = false, but in that case sending more txs caused blocks to get made again (#2961), where as here there are lots of txs coming in and no progression in the consensus.
It looks like all nodes complete block 910590, and then halt on their entrance to 910591:
Dec 17 09:19:16 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:16.497] Finalizing commit of block with 84 txs module=consensus height=910590 hash=9756BA1DE69636060071C9540B6FEC9B9BE46E6803A27399FCB45377C7D7DCD4 root=07AB2A360F583CD53232A11F0D6BF0330723026DCC24ABD31C8AB8722B0D5582
Dec 17 09:19:16 plasma-1-aws0 loom[23379]: Height: 910590
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.364] Executed block module=state height=910590 validTxs=10 invalidTxs=74
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.468] Committed state module=state height=910590 txs=84 appHash=7DA1CE9D1BCE05A4B4A5E85811B23426A1EDE419EA62226ACA304A159FD0CD1E
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.482] Indexed block module=txindex height=910590
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.483] enterPropose: Not our turn to propose module=consensus height=910591 round=0 proposer=C74E5470D137C4152B48521472955127DEB0E414 privValidator="PrivValidator{F825EEB10F9CF76012FB155751B3A0437FC31442 LH:910590, LR:13, LS:3}"
Dec 17 09:19:22 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:22.174] Added good transaction module=mempool tx=447BB65DFC48EC79FF96E4D0A2D4DEA7E4EDFD9969F1859A69814D585D2AEC39 res=&{CheckTx:} height=910590 total=896 counter=31688
We'll have to keep digging.
Dec 17 09:19:22 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:22.174] Added good transaction module=mempool tx=447BB65DFC48EC79FF96E4D0A2D4DEA7E4EDFD9969F1859A69814D585D2AEC39 res=&{CheckTx:} height=910590 total=896 counter=31688
What is counter? Did you add that?
Also looks like there's log lines missing. Did you by chance just grep for 910590 ? Can we get complete logs for a couple blocks around when this happened?
Big log file, it has the full log set, with a thousand lines before and after the failing block 910590
Let me know if there is anything else that would help, metrics etc !
Also randomly, I believe how we have the logger is creating the strange log you mentioned
Dec 17 08:54:17 plasma-1-0 loom[23379]: I[17126-12-17|08:54:17.650] Added good transaction module=mempool tx=EC81012C31B4A6A8B446189B17CEF98D83AC01E38BD2FC25AE93F0CC2D29E228 res=&{CheckTx:} height=910581 total=3 counter=26642
seems like the counter was there on the 0.26.3 release, your right its release 0.26.3
loom_4node_0.26.3.tar.gz
mem.logger.Info("Added good transaction",
"tx", TxID(tx),
"res", r,
"height", memTx.height,
"total", mem.Size(),
"counter", memTx.counter,
)
Some notes:
It's possible we're seeing the consensus get overwhelmed by mempool activity since the same txs are getting re-broadcast over and over.
There is at least https://github.com/tendermint/tendermint/pull/3036 to fix and https://github.com/tendermint/tendermint/pull/2778 will reduce the amount we send txs to peers that already have them, but in any case, your application should prevent the same tx from being valid multiple times - the mempool does not guarantee replay protection, but only offers a best effort. I suspect if the app was rejecting duplicate transactions, it would help take some of the pressure off.
Ok, assuming this is due to some kind of DoS from the mempool overwhelming the consensus, there's a few things that you could try immediately:
Priority: 5 to Priority: 1 in mempool/reactor.go (this will decrease the priority of mempool messages compared to consensus votes, as they are currently the same)We will merge #2778 soon as well after a bit more testing
Any possibility the block had 5k txs, is because it couldn鈥檛 ever reach consensus so it kept pulling more in from the mempool each time it tried?
Any possibility the block had 5k txs, is because it couldn鈥檛 ever reach consensus so it kept pulling more in from the mempool each time it tried?
5k is the default mempool max. Due to a combination of issues (#3036 and #2778), adding txs to the mempool when it already has a full cache can cause the same txs to be rebroadcast over and over again, ultimately filling up the mempool. So I suspect that while we were slowly trying to make block 910590, we were filling up the mempool. Once 910590 was committed, we grabbed all the txs from the mempool to propose block 910591 with 5k txs.
So you also mentioned the app didn鈥檛 have replay protection, it does have nonce checks in checktx and delivertx. Is there anything else that is needed for replay protection?
So you also mentioned the app didn鈥檛 have replay protection, it does have nonce checks in checktx and delivertx. Is there anything else that is needed for replay protection?
Do the nonce checks work across multiple CheckTxs in a row? Ie. if you CheckTx({nonce:1}), CheckTx({nonce:1}, the second will fail?
Oh we wouldn鈥檛 stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that
Oh we wouldn鈥檛 stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that
Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.
Oh we wouldn鈥檛 stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that
Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.
This, together with
implement replay protection in the app to prevent the same tx from being valid more than once
Does that mean it actually is enough to guard against duplicates in checkTx at the current height? In this case, we don't want to guard against clients sending duplicates but against txs that are rebroadcasted because the cache rolled over.
Right, we can't prevent clients from sending duplicates, but apps should prevent the same tx from being valid in the mempool more than once
Overall things are much better. However we found a new situation where a block took 6 minutes to converge, at least moving in the right direction. We have implemented the following
1) Our checktx function now is stateful and follows your guidelines above
2) We forked master tendermint 0.27.3 https://github.com/tendermint/tendermint/compare/master...mattkanwisher:update-timeout#diff-f0521764a882805b106091a2215d58c2R214
3) we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable
4) we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.
5) we made some tweaks to the config.toml
flush_throttle_timeout = "10ms"
max_packet_msg_payload_size = 10240
send_rate = 20000000
recv_rate = 20000000
skip_timeout_commit = true
If you look at attached logs we have a block 938211, that takes 6 minutes to come to consensus on. We are still seeing instances where it takes a long time to converge on a block
GOOD THINGS
Mem pool is behaving way better
Chains seem to be much more stable under load
Let me know any other suggestions, I think there is still some underlying issue in the convergence of blocks. as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.
Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.
I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?
Thanks for this helpful update. Good to hear things are working better with the patch.
we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable
we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.
You're using broadcast_tx_commit? We seem to be having some trouble there, as it seems it can overload the internal pub sub system - see https://github.com/tendermint/tendermint/issues/2826
we made some tweaks to the config.toml
How did you motivate/justify the numerical changes here and what differences are you seeing?
as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.
Please do!
I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?
Possibly. The honest truth is that, unfortunately, we have not run or tested this setting enough as it is not currently used by the Cosmos testnets, which have been our main focus and testing grounds. Of course we do aim to stabilize this feature but simply haven't had the time. Your patience and help here is supremely appreciated :)
Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.
Not sure I understand the issue here. From a brief look at the logs, all nodes enter height 938211 around 11:05, but they don't receive a new tx until around 11:13, at which point they quickly make the next block.
Is the problem that there were actually txs in the mempool already and the should have made the block right away?
What suggests that node2 fell out of consensus?
Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.
Not sure I understand the issue here. From a brief look at the logs, all nodes enter height 938211 around 11:05, but they don't receive a new tx until around 11:13, at which point they quickly make the next block.
Is the problem that there were actually txs in the mempool already and the should have made the block right away?
What suggests that node2 fell out of consensus?
Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] Timed out module=consensus dur=893.128955ms height=938211 round=0 step=RoundStepNewHeight
Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0
So Node2 falls out of consensus on block 938210, so for 6 minutes it has bad state, but eventually catchup, I think Node2 actually tries to propose 938211 but all the other nodes ignore him.
I'm going to enable empty blocks, there is no sense for me to be to divergent from your configs. A bit nervous switching a chain while running to enable empty blocks, we tried on a testnet and we had app hash issues switching. Maybe need a hard fork.
Thanks for this helpful update. Good to hear things are working better with the patch.
we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable
2778 doesn't have the proper LRU fix from #3036, so it would be expected to exhibit similar issues. Once it's fixed, it should provide even more mempool stability by reducing how much txs are gossipped.
we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.
You're using
broadcast_tx_commit? We seem to be having some trouble there, as it seems it can overload the internal pub sub system - see #2826we made some tweaks to the config.toml
How did you motivate/justify the numerical changes here and what differences are you seeing?
as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.
Please do!
I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?
Possibly. The honest truth is that, unfortunately, we have not run or tested this setting enough as it is not currently used by the Cosmos testnets, which have been our main focus and testing grounds. Of course we do aim to stabilize this feature but simply haven't had the time. Your patience and help here is supremely appreciated :)
Thanks for this helpful update. Good to hear things are working better with the patch.
we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable
2778 doesn't have the proper LRU fix from #3036, so it would be expected to exhibit similar issues. Once it's fixed, it should provide even more mempool stability by reducing how much txs are gossipped.
we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.
You're using
broadcast_tx_commit? We seem to be having some trouble there, as it seems it can overload the internal pub sub system - see #2826we made some tweaks to the config.toml
How did you motivate/justify the numerical changes here and what differences are you seeing?
as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.
Please do!
I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?
Possibly. The honest truth is that, unfortunately, we have not run or tested this setting enough as it is not currently used by the Cosmos testnets, which have been our main focus and testing grounds. Of course we do aim to stabilize this feature but simply haven't had the time. Your patience and help here is supremely appreciated :)
broadcast_tx_commit we use it dozens of times a second under normal load, as we have mobile devices submitting transactions to the chain. For now we have hacked the timeout, I'll work on a better configuration pull request, the WriteTimeout and deliverTxTimeout has to be configurable
https://github.com/mattkanwisher/tendermint/commit/9a8cb94fad835aeb4b8507f8ab34d59cb75f3926#diff-f0521764a882805b106091a2215d58c2R214
So Node2 falls out of consensus on block 938210, so for 6 minutes it has bad state, but eventually catchup, I think Node2 actually tries to propose 938211 but all the other nodes ignore him.
Sorry I'm still not seeing it. Here's node2's enterNewRounds:
Dec 19 11:04:55 plasma-1-2 loom[18834]: I[19126-12-19|11:04:55.199] enterNewRound(938209/0). Current: 938209/0/RoundStepNewHeight module=consensus height=938209 round=0
Dec 19 11:05:13 plasma-1-2 loom[18834]: I[19126-12-19|11:05:13.536] enterNewRound(938210/0). Current: 938210/0/RoundStepNewHeight module=consensus height=938210 round=0
Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0
Dec 19 11:13:43 plasma-1-2 loom[18834]: I[19126-12-19|11:13:43.906] enterNewRound(938212/0). Current: 938212/0/RoundStepNewHeight module=consensus height=938212 round=0
Dec 19 11:13:45 plasma-1-2 loom[18834]: I[19126-12-19|11:13:45.935] enterNewRound(938213/0). Current: 938213/0/RoundStepNewHeight module=consensus height=938213 round=0
And node0 for comparison:
Dec 19 11:04:55 plasma-1-0 loom[25973]: I[19126-12-19|11:04:55.198] enterNewRound(938209/0). Current: 938209/0/RoundStepNewHeight module=consensus height=938209 round=0
Dec 19 11:05:11 plasma-1-0 loom[25973]: I[19126-12-19|11:05:11.615] enterNewRound(938210/0). Current: 938210/0/RoundStepNewHeight module=consensus height=938210 round=0
Dec 19 11:05:13 plasma-1-0 loom[25973]: I[19126-12-19|11:05:13.699] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0
Dec 19 11:13:44 plasma-1-0 loom[25973]: I[19126-12-19|11:13:44.916] enterNewRound(938212/0). Current: 938212/0/RoundStepNewHeight module=consensus height=938212 round=0
Dec 19 11:13:45 plasma-1-0 loom[25973]: I[19126-12-19|11:13:45.934] enterNewRound(938213/0). Current: 938213/0/RoundStepNewHeight module=consensus height=938213 round=0
The timestamps match up quite closely. The reason for the ~8 min gap between 938211 and 938212 seems to be because we don't receive another good tx for about ~8 min. From node2:
Dec 19 11:04:51 plasma-1-0 loom[25973]: I[19126-12-19|11:04:51.501] Added good transaction module=mempool tx=6AA01FEAF75EB544B7F17F320965BDF1EAC1456BFD87886FFB3E510D6BB1B219 res=&{CheckTx:} height=938206 total=1
Dec 19 11:05:09 plasma-1-0 loom[25973]: I[19126-12-19|11:05:09.547] Added good transaction module=mempool tx=6B8C11615F0FD9AE921312BA8946410CE7C8CE9574563733AE6E5D9C40EA4DE7 res=&{CheckTx:} height=938208 total=1
Dec 19 11:13:42 plasma-1-0 loom[25973]: I[19126-12-19|11:13:42.595] Added good transaction module=mempool tx=3600DA1F8B7B018BFE721DA56ACA56F71B07C52837C790BE086A0A35A25F91D9 res=&{CheckTx:} height=938210 total=1
Dec 19 11:14:05 plasma-1-0 loom[25973]: I[19126-12-19|11:14:05.017] Added good transaction module=mempool tx=043CB72FC8DFA9E8BB1E515C5150131327186AF48E61328FCD3D683FDC957A2E res=&{CheckTx:} height=938212 total=1
What am I missing?
@mattkanwisher is there more to do here?
Sorry just saw this. A combination of the 0.27.4 and tweaks to config.toml I can feel confident this is resolved
Thanks a bunch for all the help!
Most helpful comment
Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.