Tendermint: Regression with chain catchup behavior introduced in commit 1000378

Created on 13 Jul 2020 · 22Comments · Source: tendermint/tendermint

We are running Cosmos gaiad 2.0.11 which uses v0.32.11 of the tendermint sdk. This version of the tendermint sdk introduced a bugfix via https://github.com/tendermint/tendermint/pull/4709/commits/100037877080bf73c951531732363451d51b5236 which is causing some of our nodes to fall behind in chain state.

We use the remote signing feature of a nodes to connect several nodes to a single backing validator (a scheme that has worked excellently for us since cosmoshub launch and we made available via our open-source validator here: https://gitlab.com/polychainlabs/tendermint-validator). With this approach the remote validator maintains a high water mark and will reject signatures requests for heights below the watermark. Prior to the change in https://github.com/tendermint/tendermint/pull/4709/commits/100037877080bf73c951531732363451d51b5236 our nodes would all keep up with the blockchain equally.

After the change - if a node happens to fall behind by 1 block (due to having lower block propagation time or network connectivity - it will continue to fall behind and fall out of sync with the chain and never catch up. This happens because a "rejected" signing request is handled by returning an error via the remote signing protocol.

Given how fast new blocks are produced on the network - we see retrying as rarely useful and certainly not useful with such a long timeout which causes nodes to fall behind.

My suggestion is to revert the commit and allow the next block to attempt a new signature and move forward.

bug validator

Source

defunctzombie

All 22 comments

cc @melekes

defunctzombie on 13 Jul 2020

We (Interchain Gmbh) discussed this yesterday.

Our conclusion was to keep the retry functionality. Otherwise, if remote signer is down,
a) proposer might miss a chance to create a block and receive a reward
b) validator might miss a chance to precommit for a block and get slashed for not signing

However, the current implementation is not flexible enough and led to the above bug due to big timeouts and number of retries.

The proposal is to a) use exponential backoff b) use context.Context and terminate it once the proposing/voting window is over. This will require us to modify the PrivValidator interface

type PrivValidator interface {
    GetPubKey(ctx context.Context) (crypto.PubKey, error)

    SignVote(ctx context.Context, chainID string, vote *tmproto.Vote) error
    SignProposal(ctx context.Context, chainID string, proposal *tmproto.Proposal) error
}

Example:

when validator enters propose stage https://github.com/tendermint/tendermint/blob/909163afa87f93a865e48fa81259a06d6437a043/consensus/state.go#L977
instead of straight GetPubKey() call we call it with timeout propose cs.privValidator.GetPubKey(context.WithTimeout(context.Background(), timeoutPropose))

melekes on 21 Jul 2020

👍1

@defunctzombie does ^ work for you? also, we can differentiate the errors and only retry connection errors. when signer refuses to sign, we don't retry at all. what error do you return from the server (code, description)?

melekes on 21 Jul 2020

@defunctzombie can you try https://github.com/tendermint/tendermint/pull/5140 and see if it fixes your problem?

melekes on 21 Jul 2020

What's the best way to give #5140 a try? Change the go.mod file in gaia? Will the version of tendermint that change is based on work with the latest stable gaia?

I'd have a stronger preference for a config option to just disable the retry functionality altogether (ideally default no retry and allow folks who feel strongly that they need to tune retry to turn it on). I prefer simpler connection flow logic over pre-tuned retries especially on a fast moving chain.

Some thoughts on the earlier comments:

a) proposer might miss a chance to create a block and receive a reward

To me that seems like it is on the validator to ensure their connections are reliable to avoid missing the reward. The reward is also not that outsized that missing it as a proposer is significantly detrimental. Maybe the answer here is to have 1 immediate retry - if you failed that too then I still believe the right move is to move on.

b) validator might miss a chance to precommit for a block and get slashed for not signing

Missing one precommit won't get you slashed. If you are disconnected and come back online - the node would catch up and you would start signing current blocks which would prevent you from being slashed. You would need to be disconnected for ~16 hrs before liveliness slashing hits - if you are down that long - this retry logic would actually do more to negatively impact your ability to come back online and be at the head of the chain.

If you want to help folks avoid liveliness slashing - that should be done at a higher design/protocol level rather than per single block.

defunctzombie on 21 Jul 2020

What's the best way to give #5140 a try? Change the go.mod file in gaia? Will the version of tendermint that change is based on work with the latest stable gaia?

I can probably create a backport release v0.32.12 w/ the patch. How does this sound?

melekes on 22 Jul 2020

A backport would be 👍 since the bugfix that was applied to the v0.32 family is what caused the specific regression - so getting v0.32 back to a similar state for chain syncing would help operators with remote signers that are connected to several front nodes.

defunctzombie on 22 Jul 2020

Looks like it might not be possible. On v0.32, GetPubKey does not return an error => it's not possible to differentiate between HSM error and connection error. Not sure what to do.

melekes on 24 Jul 2020

What about a config option to disable?

I still don't think the retry logic as introduced in the bugfix is appropriate for how the Cosmos chain operates so being able to disable it would allow our nodes to stay updated with the chain.

defunctzombie on 25 Jul 2020

https://github.com/tendermint/tendermint/pull/5160

melekes on 28 Jul 2020

What's the best way to give #5140 a try? Change the go.mod file in gaia?

Yes, change Tendermint version to v0.32.13-rc1

github.com/tendermint/tendermint v0.32.13-rc1

melekes on 28 Jul 2020

@defunctzombie did u have a chance to try ^?

melekes on 31 Jul 2020

Hi @defunctzombie. Any chance you could try the fix out today? We're preparing to cut 0.33.7 as well, and would like to include this fix if we can get it confirmed.

erikgrinaker on 3 Aug 2020

I’ll give it a try today. Apologies for the delay in responding.

On Mon, Aug 3, 2020 at 3:02 AM Erik Grinaker notifications@github.com
wrote:

Hi @defunctzombie https://github.com/defunctzombie. Any chance you
could try the fix out today? We're preparing to cut 0.33.7 as well, and
would like to include this fix if we can get it confirmed.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tendermint/tendermint/issues/5112#issuecomment-667933570,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAUWOCYFOHMFCHCBAZKRH3R62DMNANCNFSM4OYBFOPA
.

defunctzombie on 3 Aug 2020

👍1

I've deployed a patched gaiad - I'll report back in 12 hrs and see if the nodes are all keeping up with the chain as I would expect - it takes some time for them to fall out of sync since after a restart they all catch up to the chain head first before continuing other actions.

defunctzombie on 4 Aug 2020

So far it looks like the nodes are keeping up with each other as I would expect.

defunctzombie on 4 Aug 2020

Thanks @defunctzombie, we've released 0.32.13 with this fix. Let us know if you still experience any issues.

erikgrinaker on 5 Aug 2020

Awesome! Do you know if gaia will be updated with this new release or is it something I should request?

defunctzombie on 5 Aug 2020

I'll be doing releases for sdk and gaia tomorrow.