Tendermint: Consensus stuck for 50 minute trying to finish a block

Created on 17 Dec 2018 · 24Comments · Source: tendermint/tendermint

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source): 0.26.4

ABCI app (name for built-in, URL for self-written if it's publicly available): Loom SDK

Environment:

OS (e.g. from /etc/os-release): Ubuntu 16.04
Install tools:
Others:

What happened:
Our 4 node staging tendermint cluster in a single geographic region will get stuck trying to finish a round, something for 50 minutes straight, it accepts tx in the mempool but never completes a block. Any ideas how to debug this more or allow the cluster to just cancel blocks when it can't come to consensus.

See the attached logs from all 4 nodes. At the end of the 50 minutes it finally commits and 5000 txs stuck in the mempool clear

What you expected to happen:

Would expect the round to close within 5-10seconds, if they can't get agreement from one node

Have you tried the latest version: yes/no no, the version is only a few days old

How to reproduce it (as minimally and precisely as possible): no consistent way, it seems to do with having dozens of users connected with a few transactions a second

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
Archive.zip
Attached log from 4 nodes

Config (you can paste only the changes you've made):

replay = false
broadcast = true
create_empty_blocks = false

node command runtime flags:

/dump_consensus_state output for consensus bugs
This is an hour after the event, the cluster came back to life with no outside intervention. So I'm not sure if this is super relevant

consensus_dump.log

Anything else we need to know:

A few hundred connected clients to the cluster but still low transaction volume, maybe a few transactions a second

consensus bug

Source

mattkanwisher

Most helpful comment

Oh we wouldn’t stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that

Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.

ebuchman on 17 Dec 2018

👍4

All 24 comments

Thanks for the report. Do you have the logs for when it started making blocks again?

Have you seen this happen more than once, and is the timing (~50min) roughly the same?

We recently fixed an issue related to replay = false & create_empty_blocks = false, but in that case sending more txs caused blocks to get made again (#2961), where as here there are lots of txs coming in and no progression in the consensus.

It looks like all nodes complete block 910590, and then halt on their entrance to 910591:

Dec 17 09:19:16 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:16.497] Finalizing commit of block with 84 txs       module=consensus height=910590 hash=9756BA1DE69636060071C9540B6FEC9B9BE46E6803A27399FCB45377C7D7DCD4 root=07AB2A360F583CD53232A11F0D6BF0330723026DCC24ABD31C8AB8722B0D5582
Dec 17 09:19:16 plasma-1-aws0 loom[23379]:     Height:         910590
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.364] Executed block                               module=state height=910590 validTxs=10 invalidTxs=74
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.468] Committed state                              module=state height=910590 txs=84 appHash=7DA1CE9D1BCE05A4B4A5E85811B23426A1EDE419EA62226ACA304A159FD0CD1E
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.482] Indexed block                                module=txindex height=910590
Dec 17 09:19:20 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:20.483] enterPropose: Not our turn to propose        module=consensus height=910591 round=0 proposer=C74E5470D137C4152B48521472955127DEB0E414 privValidator="PrivValidator{F825EEB10F9CF76012FB155751B3A0437FC31442 LH:910590, LR:13, LS:3}"
Dec 17 09:19:22 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:22.174] Added good transaction                       module=mempool tx=447BB65DFC48EC79FF96E4D0A2D4DEA7E4EDFD9969F1859A69814D585D2AEC39 res=&{CheckTx:} height=910590 total=896 counter=31688

We'll have to keep digging.

ebuchman on 17 Dec 2018

Dec 17 09:19:22 plasma-1-aws0 loom[23379]: I[17126-12-17|09:19:22.174] Added good transaction module=mempool tx=447BB65DFC48EC79FF96E4D0A2D4DEA7E4EDFD9969F1859A69814D585D2AEC39 res=&{CheckTx:} height=910590 total=896 counter=31688

What is counter? Did you add that?

ebuchman on 17 Dec 2018

Also looks like there's log lines missing. Did you by chance just grep for 910590 ? Can we get complete logs for a couple blocks around when this happened?

ebuchman on 17 Dec 2018

Big log file, it has the full log set, with a thousand lines before and after the failing block 910590

Let me know if there is anything else that would help, metrics etc !

Also randomly, I believe how we have the logger is creating the strange log you mentioned
Dec 17 08:54:17 plasma-1-0 loom[23379]: I[17126-12-17|08:54:17.650] Added good transaction module=mempool tx=EC81012C31B4A6A8B446189B17CEF98D83AC01E38BD2FC25AE93F0CC2D29E228 res=&{CheckTx:} height=910581 total=3 counter=26642

seems like the counter was there on the 0.26.3 release, your right its release 0.26.3
loom_4node_0.26.3.tar.gz

mem.logger.Info("Added good transaction", "tx", TxID(tx), "res", r, "height", memTx.height, "total", mem.Size(), "counter", memTx.counter, )

mattkanwisher on 17 Dec 2018

Some notes:

the blocks that take many rounds are those with many txs (ie. block 910590 has 84 txs and took 13 rounds, block 910591 has 5000 txs and took 16 rounds)
- this at least "makes sense", but it shouldn't take as long as it is
the same tx is being included in blocks multiple times
- this should be prevented at the application layer

It's possible we're seeing the consensus get overwhelmed by mempool activity since the same txs are getting re-broadcast over and over.

There is at least https://github.com/tendermint/tendermint/pull/3036 to fix and https://github.com/tendermint/tendermint/pull/2778 will reduce the amount we send txs to peers that already have them, but in any case, your application should prevent the same tx from being valid multiple times - the mempool does not guarantee replay protection, but only offers a best effort. I suspect if the app was rejecting duplicate transactions, it would help take some of the pressure off.

ebuchman on 17 Dec 2018

Ok, assuming this is due to some kind of DoS from the mempool overwhelming the consensus, there's a few things that you could try immediately:

fix from #3036 (this is now on latest develop. let us know if you need this back-ported, though depending on your situation you may be able to upgrade to the v0.27 series without restarting chains - see UPGRADING.md)
change Priority: 5 to Priority: 1 in mempool/reactor.go (this will decrease the priority of mempool messages compared to consensus votes, as they are currently the same)
implement replay protection in the app to prevent the same tx from being valid more than once (this should help take more pressure off the mempool)

We will merge #2778 soon as well after a bit more testing

ebuchman on 17 Dec 2018

🎉2

Any possibility the block had 5k txs, is because it couldn’t ever reach consensus so it kept pulling more in from the mempool each time it tried?

mattkanwisher on 17 Dec 2018

Any possibility the block had 5k txs, is because it couldn’t ever reach consensus so it kept pulling more in from the mempool each time it tried?

5k is the default mempool max. Due to a combination of issues (#3036 and #2778), adding txs to the mempool when it already has a full cache can cause the same txs to be rebroadcast over and over again, ultimately filling up the mempool. So I suspect that while we were slowly trying to make block 910590, we were filling up the mempool. Once 910590 was committed, we grabbed all the txs from the mempool to propose block 910591 with 5k txs.

ebuchman on 17 Dec 2018

So you also mentioned the app didn’t have replay protection, it does have nonce checks in checktx and delivertx. Is there anything else that is needed for replay protection?

mattkanwisher on 17 Dec 2018

So you also mentioned the app didn’t have replay protection, it does have nonce checks in checktx and delivertx. Is there anything else that is needed for replay protection?

Do the nonce checks work across multiple CheckTxs in a row? Ie. if you CheckTx({nonce:1}), CheckTx({nonce:1}, the second will fail?

ebuchman on 17 Dec 2018

Oh we wouldn’t stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that

mattkanwisher on 17 Dec 2018

Oh we wouldn’t stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that

ebuchman on 17 Dec 2018

👍4

Oh we wouldn’t stop multiple at same height in the checktx, I thought only delivertx mutated state to increment the nonce. Ok we will fix that

Yeh this is a subtle point but important for a truly healthy mempool. Otherwise if a user sends the same tx to a node twice, it could get included twice in a block. Sounds like the second one will be considered invalid in the block, but better for it not to get there at all by implementing a more stateful CheckTx.

This, together with

implement replay protection in the app to prevent the same tx from being valid more than once

Does that mean it actually is enough to guard against duplicates in checkTx at the current height? In this case, we don't want to guard against clients sending duplicates but against txs that are rebroadcasted because the cache rolled over.

srmo on 18 Dec 2018

Right, we can't prevent clients from sending duplicates, but apps should prevent the same tx from being valid in the mempool more than once

ebuchman on 18 Dec 2018

Overall things are much better. However we found a new situation where a block took 6 minutes to converge, at least moving in the right direction. We have implemented the following

1) Our checktx function now is stateful and follows your guidelines above
2) We forked master tendermint 0.27.3 https://github.com/tendermint/tendermint/compare/master...mattkanwisher:update-timeout#diff-f0521764a882805b106091a2215d58c2R214
3) we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable
4) we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.
5) we made some tweaks to the config.toml

 flush_throttle_timeout = "10ms"
max_packet_msg_payload_size = 10240
send_rate = 20000000
 recv_rate = 20000000
skip_timeout_commit = true

loom_new_4_node_logs.zip

If you look at attached logs we have a block 938211, that takes 6 minutes to come to consensus on. We are still seeing instances where it takes a long time to converge on a block

GOOD THINGS

Mem pool is behaving way better
Chains seem to be much more stable under load

Let me know any other suggestions, I think there is still some underlying issue in the convergence of blocks. as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.

mattkanwisher on 19 Dec 2018

Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.

I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?

mattkanwisher on 19 Dec 2018

Thanks for this helpful update. Good to hear things are working better with the patch.

we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable

2778 doesn't have the proper LRU fix from #3036, so it would be expected to exhibit similar issues. Once it's fixed, it should provide even more mempool stability by reducing how much txs are gossipped.

we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.

You're using broadcast_tx_commit? We seem to be having some trouble there, as it seems it can overload the internal pub sub system - see https://github.com/tendermint/tendermint/issues/2826

we made some tweaks to the config.toml

How did you motivate/justify the numerical changes here and what differences are you seeing?

as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.

Please do!

I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?

Possibly. The honest truth is that, unfortunately, we have not run or tested this setting enough as it is not currently used by the Cosmos testnets, which have been our main focus and testing grounds. Of course we do aim to stabilize this feature but simply haven't had the time. Your patience and help here is supremely appreciated :)

ebuchman on 20 Dec 2018

Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.

Not sure I understand the issue here. From a brief look at the logs, all nodes enter height 938211 around 11:05, but they don't receive a new tx until around 11:13, at which point they quickly make the next block.

Is the problem that there were actually txs in the mempool already and the should have made the block right away?

What suggests that node2 fell out of consensus?

ebuchman on 20 Dec 2018

Seems one node (node2.log) loses consensus, and then it tries to vote by itself, no one accepts it. Eventually the three other nodes join back in and create the next block 6 min later when another transaction comes in.

Not sure I understand the issue here. From a brief look at the logs, all nodes enter height 938211 around 11:05, but they don't receive a new tx until around 11:13, at which point they quickly make the next block.

Is the problem that there were actually txs in the mempool already and the should have made the block right away?

What suggests that node2 fell out of consensus?

Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] Timed out module=consensus dur=893.128955ms height=938211 round=0 step=RoundStepNewHeight Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0

So Node2 falls out of consensus on block 938210, so for 6 minutes it has bad state, but eventually catchup, I think Node2 actually tries to propose 938211 but all the other nodes ignore him.

I'm going to enable empty blocks, there is no sense for me to be to divergent from your configs. A bit nervous switching a chain while running to enable empty blocks, we tried on a testnet and we had app hash issues switching. Maybe need a hard fork.

mattkanwisher on 20 Dec 2018

Thanks for this helpful update. Good to hear things are working better with the patch.

we added #3036 (move to front), but did not add (#2778) as it does not seem to be stable

2778 doesn't have the proper LRU fix from #3036, so it would be expected to exhibit similar issues. Once it's fixed, it should provide even more mempool stability by reducing how much txs are gossipped.

we increased timeouts in deliverTxTimeout back to 2 min instead of 10 seconds, ill open a separate ticket, but this change a few weeks ago seems to make things worse.

You're using broadcast_tx_commit? We seem to be having some trouble there, as it seems it can overload the internal pub sub system - see #2826

we made some tweaks to the config.toml

How did you motivate/justify the numerical changes here and what differences are you seeing?

as we have found other situations, like 1 node test machines have similar problems. I can open separate tickets for them.

Please do!

I'm wondering if you have to run with empty blocks enabled with some reasonable interval to prevent these type of problems?

Possibly. The honest truth is that, unfortunately, we have not run or tested this setting enough as it is not currently used by the Cosmos testnets, which have been our main focus and testing grounds. Of course we do aim to stabilize this feature but simply haven't had the time. Your patience and help here is supremely appreciated :)

broadcast_tx_commit we use it dozens of times a second under normal load, as we have mobile devices submitting transactions to the chain. For now we have hacked the timeout, I'll work on a better configuration pull request, the WriteTimeout and deliverTxTimeout has to be configurable

https://github.com/mattkanwisher/tendermint/commit/9a8cb94fad835aeb4b8507f8ab34d59cb75f3926#diff-f0521764a882805b106091a2215d58c2R214

mattkanwisher on 20 Dec 2018

So Node2 falls out of consensus on block 938210, so for 6 minutes it has bad state, but eventually catchup, I think Node2 actually tries to propose 938211 but all the other nodes ignore him.

Sorry I'm still not seeing it. Here's node2's enterNewRounds:

Dec 19 11:04:55 plasma-1-2 loom[18834]: I[19126-12-19|11:04:55.199] enterNewRound(938209/0). Current: 938209/0/RoundStepNewHeight module=consensus height=938209 round=0
Dec 19 11:05:13 plasma-1-2 loom[18834]: I[19126-12-19|11:05:13.536] enterNewRound(938210/0). Current: 938210/0/RoundStepNewHeight module=consensus height=938210 round=0
Dec 19 11:05:14 plasma-1-2 loom[18834]: I[19126-12-19|11:05:14.582] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0
Dec 19 11:13:43 plasma-1-2 loom[18834]: I[19126-12-19|11:13:43.906] enterNewRound(938212/0). Current: 938212/0/RoundStepNewHeight module=consensus height=938212 round=0
Dec 19 11:13:45 plasma-1-2 loom[18834]: I[19126-12-19|11:13:45.935] enterNewRound(938213/0). Current: 938213/0/RoundStepNewHeight module=consensus height=938213 round=0

And node0 for comparison:

Dec 19 11:04:55 plasma-1-0 loom[25973]: I[19126-12-19|11:04:55.198] enterNewRound(938209/0). Current: 938209/0/RoundStepNewHeight module=consensus height=938209 round=0
Dec 19 11:05:11 plasma-1-0 loom[25973]: I[19126-12-19|11:05:11.615] enterNewRound(938210/0). Current: 938210/0/RoundStepNewHeight module=consensus height=938210 round=0
Dec 19 11:05:13 plasma-1-0 loom[25973]: I[19126-12-19|11:05:13.699] enterNewRound(938211/0). Current: 938211/0/RoundStepNewHeight module=consensus height=938211 round=0
Dec 19 11:13:44 plasma-1-0 loom[25973]: I[19126-12-19|11:13:44.916] enterNewRound(938212/0). Current: 938212/0/RoundStepNewHeight module=consensus height=938212 round=0
Dec 19 11:13:45 plasma-1-0 loom[25973]: I[19126-12-19|11:13:45.934] enterNewRound(938213/0). Current: 938213/0/RoundStepNewHeight module=consensus height=938213 round=0

The timestamps match up quite closely. The reason for the ~8 min gap between 938211 and 938212 seems to be because we don't receive another good tx for about ~8 min. From node2:

Dec 19 11:04:51 plasma-1-0 loom[25973]: I[19126-12-19|11:04:51.501] Added good transaction                       module=mempool tx=6AA01FEAF75EB544B7F17F320965BDF1EAC1456BFD87886FFB3E510D6BB1B219 res=&{CheckTx:} height=938206 total=1
Dec 19 11:05:09 plasma-1-0 loom[25973]: I[19126-12-19|11:05:09.547] Added good transaction                       module=mempool tx=6B8C11615F0FD9AE921312BA8946410CE7C8CE9574563733AE6E5D9C40EA4DE7 res=&{CheckTx:} height=938208 total=1
Dec 19 11:13:42 plasma-1-0 loom[25973]: I[19126-12-19|11:13:42.595] Added good transaction                       module=mempool tx=3600DA1F8B7B018BFE721DA56ACA56F71B07C52837C790BE086A0A35A25F91D9 res=&{CheckTx:} height=938210 total=1
Dec 19 11:14:05 plasma-1-0 loom[25973]: I[19126-12-19|11:14:05.017] Added good transaction                       module=mempool tx=043CB72FC8DFA9E8BB1E515C5150131327186AF48E61328FCD3D683FDC957A2E res=&{CheckTx:} height=938212 total=1

What am I missing?

ebuchman on 20 Dec 2018

@mattkanwisher is there more to do here?

ebuchman on 13 Jan 2019

Sorry just saw this. A combination of the 0.27.4 and tweaks to config.toml I can feel confident this is resolved

mattkanwisher on 15 Jan 2019

👍1

Thanks a bunch for all the help!

mattkanwisher on 15 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ABCI: Switch to list of lists of tags for transaction results, begin block, and end block

zramsay · 24Comments

cs.wal growing without bound on gaia-8001

zmanian · 31Comments

Mempool size full is there any why to fix this issue

skdjango001 · 23Comments

cleveldb socket

ebuchman · 25Comments

light: brainstorm ways to get witnesses & hash/height

melekes · 30Comments