We are running Cosmos gaiad 2.0.11 which uses v0.32.11 of the tendermint sdk. This version of the tendermint sdk introduced a bugfix via https://github.com/tendermint/tendermint/pull/4709/commits/100037877080bf73c951531732363451d51b5236 which is causing some of our nodes to fall behind in chain state.
We use the remote signing feature of a nodes to connect several nodes to a single backing validator (a scheme that has worked excellently for us since cosmoshub launch and we made available via our open-source validator here: https://gitlab.com/polychainlabs/tendermint-validator). With this approach the remote validator maintains a high water mark and will reject signatures requests for heights below the watermark. Prior to the change in https://github.com/tendermint/tendermint/pull/4709/commits/100037877080bf73c951531732363451d51b5236 our nodes would all keep up with the blockchain equally.
After the change - if a node happens to fall behind by 1 block (due to having lower block propagation time or network connectivity - it will continue to fall behind and fall out of sync with the chain and never catch up. This happens because a "rejected" signing request is handled by returning an error via the remote signing protocol.
Given how fast new blocks are produced on the network - we see retrying as rarely useful and certainly not useful with such a long timeout which causes nodes to fall behind.
My suggestion is to revert the commit and allow the next block to attempt a new signature and move forward.
cc @melekes
We (Interchain Gmbh) discussed this yesterday.
Our conclusion was to keep the retry functionality. Otherwise, if remote signer is down,
a) proposer might miss a chance to create a block and receive a reward
b) validator might miss a chance to precommit for a block and get slashed for not signing
However, the current implementation is not flexible enough and led to the above bug due to big timeouts and number of retries.
The proposal is to a) use exponential backoff b) use context.Context and terminate it once the proposing/voting window is over. This will require us to modify the PrivValidator interface
type PrivValidator interface {
GetPubKey(ctx context.Context) (crypto.PubKey, error)
SignVote(ctx context.Context, chainID string, vote *tmproto.Vote) error
SignProposal(ctx context.Context, chainID string, proposal *tmproto.Proposal) error
}
Example:
when validator enters propose stage https://github.com/tendermint/tendermint/blob/909163afa87f93a865e48fa81259a06d6437a043/consensus/state.go#L977
instead of straight GetPubKey() call we call it with timeout propose cs.privValidator.GetPubKey(context.WithTimeout(context.Background(), timeoutPropose))
@defunctzombie does ^ work for you? also, we can differentiate the errors and only retry connection errors. when signer refuses to sign, we don't retry at all. what error do you return from the server (code, description)?
@defunctzombie can you try https://github.com/tendermint/tendermint/pull/5140 and see if it fixes your problem?
What's the best way to give #5140 a try? Change the go.mod file in gaia? Will the version of tendermint that change is based on work with the latest stable gaia?
I'd have a stronger preference for a config option to just disable the retry functionality altogether (ideally default no retry and allow folks who feel strongly that they need to tune retry to turn it on). I prefer simpler connection flow logic over pre-tuned retries especially on a fast moving chain.
Some thoughts on the earlier comments:
a) proposer might miss a chance to create a block and receive a reward
To me that seems like it is on the validator to ensure their connections are reliable to avoid missing the reward. The reward is also not that outsized that missing it as a proposer is significantly detrimental. Maybe the answer here is to have 1 immediate retry - if you failed that too then I still believe the right move is to move on.
b) validator might miss a chance to precommit for a block and get slashed for not signing
Missing one precommit won't get you slashed. If you are disconnected and come back online - the node would catch up and you would start signing current blocks which would prevent you from being slashed. You would need to be disconnected for ~16 hrs before liveliness slashing hits - if you are down that long - this retry logic would actually do more to negatively impact your ability to come back online and be at the head of the chain.
If you want to help folks avoid liveliness slashing - that should be done at a higher design/protocol level rather than per single block.
What's the best way to give #5140 a try? Change the go.mod file in gaia? Will the version of tendermint that change is based on work with the latest stable gaia?
I can probably create a backport release v0.32.12 w/ the patch. How does this sound?
A backport would be 👍 since the bugfix that was applied to the v0.32 family is what caused the specific regression - so getting v0.32 back to a similar state for chain syncing would help operators with remote signers that are connected to several front nodes.
Looks like it might not be possible. On v0.32, GetPubKey does not return an error => it's not possible to differentiate between HSM error and connection error. Not sure what to do.
What about a config option to disable?
I still don't think the retry logic as introduced in the bugfix is appropriate for how the Cosmos chain operates so being able to disable it would allow our nodes to stay updated with the chain.
What's the best way to give #5140 a try? Change the go.mod file in gaia?
Yes, change Tendermint version to v0.32.13-rc1
github.com/tendermint/tendermint v0.32.13-rc1
@defunctzombie did u have a chance to try ^?
Hi @defunctzombie. Any chance you could try the fix out today? We're preparing to cut 0.33.7 as well, and would like to include this fix if we can get it confirmed.
I’ll give it a try today. Apologies for the delay in responding.
On Mon, Aug 3, 2020 at 3:02 AM Erik Grinaker notifications@github.com
wrote:
Hi @defunctzombie https://github.com/defunctzombie. Any chance you
could try the fix out today? We're preparing to cut 0.33.7 as well, and
would like to include this fix if we can get it confirmed.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tendermint/tendermint/issues/5112#issuecomment-667933570,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAUWOCYFOHMFCHCBAZKRH3R62DMNANCNFSM4OYBFOPA
.
I've deployed a patched gaiad - I'll report back in 12 hrs and see if the nodes are all keeping up with the chain as I would expect - it takes some time for them to fall out of sync since after a restart they all catch up to the chain head first before continuing other actions.
So far it looks like the nodes are keeping up with each other as I would expect.
Thanks @defunctzombie, we've released 0.32.13 with this fix. Let us know if you still experience any issues.
Awesome! Do you know if gaia will be updated with this new release or is it something I should request?
I'll be doing releases for sdk and gaia tomorrow.
@marbar3778 any updates on gaia release?
Finally got the sdk approval yesterday and last night got gaia. Will do the release this morning, sorry for the delay