Cosmos-sdk: Failed to slash a validator for double sign at the first block

Created on 21 Sep 2018  Â·  19Comments  Â·  Source: cosmos/cosmos-sdk

Summary of Bug

When someone double signed at the first block, the slashing module will not handle it correctly.

Steps to Reproduce

When generating the first block, someone may double sign and trigger the slashing module.
log:

Confirmed double sign from FDC8E91E2F361F4BFCCEFA80C1ABAD138E5175AE at height 1, age of 0 less than max age of 120 module=x/slashing 
E[09-21|13:27:34.653] CONSENSUS FAILURE!!!                         module=consensus err="impossible attempt to slash future infraction at height 1 but we are at height 0" stack="goroutine 1193 [running]:\nruntime/debug.Stack(0xc424c4cea0, 0xd5e1c0, 0xc426777690)\n\t/usr/lib/go-

For Admin Use

  • [ ] Not duplicate issue
  • [ ] Appropriate labels applied
  • [ ] Appropriate contributors tagged
  • [ ] Contributor assigned/self-assigned
bug slashing

Most helpful comment

@cwgoes we did some analysis and figured out double-sign evidence was included in the first block, then got broadcasted. However, there's mismatch to treat double-sign evidence. in tendermint the infractionHeight is 1 but when slashing module process the evidence, ctx.BlockHeight() is still 0.
We tried to solve the issue with this PR: https://github.com/irisnet/cosmos-sdk/pull/109
Hope to get more input from you.

All 19 comments

CC @cwgoes

That's unusual, can you replicate this in a testcase or describe more about the order in which you ran the daemons and double-signed? I don't understand why you would get evidence for a double-sign at height 1 before block 0 was committed.

@cwgoes we did some analysis and figured out double-sign evidence was included in the first block, then got broadcasted. However, there's mismatch to treat double-sign evidence. in tendermint the infractionHeight is 1 but when slashing module process the evidence, ctx.BlockHeight() is still 0.
We tried to solve the issue with this PR: https://github.com/irisnet/cosmos-sdk/pull/109
Hope to get more input from you.

@cwgoes
I think it is possible to include evidence at the same height block. For instance, if the network is poor and the network come to consensus of a block in multiple rounds,then then double sign evidence maybe happen to be included by a block proposal. Then in applying the block,the infractionHeight will equal to ctx.BlockHeight()+1.
In this situation, cosmos will throw panic error.

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic. Because currently the validator has no sign info which is necessary for figuring out jail period. I think this may be a bug too.

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic.

This definitely seems possible.

I'm still a bit lost on how evidence reaches the block height +1 except maybe the case of the first block?

@alexanderbez
Suppose the blockchain height is A. The block store height is A. ctx.blockHeight() in delieverTx is A. Now the consensus height is A+1. In the consensus process, due to poor network, the network may come to consensus of new block in multiple rounds. Suppose the new validator has done double sign in round N, but in this round, the consensus is failed. Then in next round, the double sign evidence may be included by a block proposal. Thus in applying block, evidence height euqals to ctx.blockHeight() + 1 .
How do you think about the above assumption?

Thus in applying block, evidence height equals to ctx.blockHeight() + 1 .

Yes, but wont the application also be onto the next block?

@alexanderbez
This is low probability thing. Only when the evidence is broadcasted to the block proposal and then it is included in the proposal block, then this will happed. In most case, the evidence will be included in next block.

Suppose the blockchain height is A. The block store height is A. ctx.blockHeight() in delieverTx is A. Now the consensus height is A+1. In the consensus process, due to poor network, the network may come to consensus of new block in multiple rounds. Suppose the new validator has done double sign in round N, but in this round, the consensus is failed. Then in next round, the double sign evidence may be included by a block proposal. Thus in applying block, evidence height euqals to ctx.blockHeight() + 1 .

We don't care about the consensus round, just the consensus height. If we're attempting to come to consensus at some height n, and I double-sign in round 1 but we don't come to consensus until round 3, the evidence passed to the ABCI app when the block is executed after round 3 should still have height n. Have you found otherwise?

If a new validator is slashed at the first height after it becomes the real validator, cosmos may panic. Because currently the validator has no sign info which is necessary for figuring out jail period. I think this may be a bug too.

This is why evidence handling is run after validator signature handling in the slashing tick function. Do you mean as a result of slashing from elsewhere (e.g. governance), or do you have a testcase which replicate this (which seems like a separate issue to evidence at a future height)?

We tried to solve the issue with this PR: irisnet#109

Although this will probably fix the symptom, I don't think this is the right approach. It's illogical to handle double-signature evidence for a block at a height we haven't started processing yet - if Tendermint is indeed sending such evidence, that's a bug in Tendermint.

Do you know how I might replicate either issue, either by running commands locally or with a test case? That's the next step to debug the root cause.

  1. Clone my code: https://github.com/HaoyangLiu/cosmos-sdk.git
  2. Checkout branch develop and build gaiad
make get_vendor_deps build-linux

I have synced my develop with the latest cosmos-sdk develop branch

  1. Checkout branch develop-byzantine and build gaiad-byzantine. gaiad-byzantine will always vote nil. Here is my change
  2. Create five full nodes with normal gaiad, mark them as node0, node1, node2, node3 and node4
  3. Create two byzantine nodes with gaiad-byzantine, make them as node-byz-0 and node-byz-1.
  4. node1 and node-byz-0 share one validator address. node2 and node-byz-1 share another one validator address. So there will be two groups conflict voting.
  5. I have upload all the node home directory which include genesis file and executable binary: iris.zip
  6. Start these node in turns: node1 , node-byz-0, node2, node-byz-1
  7. Wait half a minute, then start node0
  8. Wait half a minute, then start node4.
  9. Then after the first block is applied, a panic error will arise:
E[10-10|16:26:26.205] CONSENSUS FAILURE!!!                         module=consensus err="Expected signing info for validator cosmosvalcons15ad46aj0u39n6grh7gjyk494d380jen6xp89w4 but not found" stack="goroutine 709 [running]:\nruntime/debug.Stack(0xc42302eae0, 0xd6c6e0, 0xc42300d600)
    /usr/local/go/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc4200f0900, 0x105f450)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:585 +0x57\npanic(0xd6c6e0, 0xc42300d600)
    /usr/local/go/src/runtime/panic.go:502 +0x229\ngithub.com/cosmos/cosmos-sdk/x/slashing.Keeper.handleDoubleSign(0x10dc4e0, 0xc420b55720, 0xc420a46cb0, 0x10eb700, 0xc4200dd1d0, 0xc420a46cb0, 0x10dc4e0, 0xc420b55750, 0xa, 0x10e4060, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/x/slashing/keeper.go:84 +0xe75\ngithub.com/cosmos/cosmos-sdk/x/slashing.BeginBlocker(0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/x/slashing/tick.go:33 +0x6ac\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker(0xc4200c8000, 0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:140 +0xea\ngithub.com/cosmos/cosmos-sdk/cmd/gaia/app.(*GaiaApp).BeginBlocker-fm(0x10e4060, 0xc422fbcde0, 0xc420413040, 0xc, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/cmd/gaia/app/app.go:111 +0xba\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc420f7f680, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/baseapp/baseapp.go:433 +0x23a\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc420b670e0, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:206 +0x9e\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc420480b80, 0xc42301b800, 0x14, 0x20, 0xc422e5e710, 0x5, 0x1, 0x3aca512d, 0xed34faeaa, 0x0, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:69 +0x6b\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x10e4ce0, 0xc420b2b260, 0x10e9d80, 0xc420480b80, 0xc4228ec8c0, 0xc422fbc870, 0x10ee220, 0xc42000e918, 0x0, 0x0, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:209 +0x421\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc420b29860, 0xc420b388a0, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/state/execution.go:77 +0x111\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc4200f0900, 0x1)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1334 +0xa7f\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc4200f0900, 0x1)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1265 +0x462\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit.func1(0xc4200f0900, 0x5, 0x1)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1213 +0x98\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc4200f0900, 0x1, 0x5)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1242 +0x72c\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc4200f0900, 0xc422f8caa0, 0xc420a18360, 0x28, 0x1060690, 0xf5, 0x7fc4f3379000)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1649 +0xb7f\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc4200f0900, 0xc422f8caa0, 0xc420a18360, 0x28, 0xc4208dc960, 0xc422f98200, 0xf5)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:1507 +0x59\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc4200f0900, 0xdab4a0, 0xc422a4d888, 0xc420a18360, 0x28)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:659 +0x667\ngithub.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc4200f0900, 0x0)
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:616 +0x6a2\ncreated by github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart
    /home/lhy/go/src/github.com/cosmos/cosmos-sdk/vendor/github.com/tendermint/tendermint/consensus/state.go:310 +0x140\n"
  1. To reproduce the issue, the key point is to make sure the first proposer has received the evidence. Here node0 is the first proposer. We start byzantine nodes first, then when the proposer is started, it will include the received evidence into the first block.

@cwgoes
Currently, tendermint does include evidence in blocks with the same height. If you think my fix is not the right approach, maybe we should change tendermint to ensure evidence will only be included in later blocks.

Thanks @HaoyangLiu - I'm trying to replicate this, working on a shell script at https://gist.github.com/cwgoes/3ab18585a72966bb9e68e8a9428a28b8 to make it easier.

So far I can replicate the Byzantine duplicate-voting but not the crash - the VoteInfos are processed before the evidence, as expected, and blocks continue onwards.

@cwgoes
Maybe it is much diffcult to reproduce this issue just by your script. Because, to reproduce this issue, we have to control the node starting order, as well as the validator order in genesis.json. It should be the same as the node names. Here the name order is node0, node1, node2, node3, node4. But the genesis.json which is generated by command gaiad init will make the order random.
Here I just created a repository to provide a easier way to reproduce.
https://github.com/HaoyangLiu/cosmos_issue_reproduction
Just clone this repository and run run.sh.

Thanks @HaoyangLiu, I can replicate the issue now. On the first block, it looks like the vote information sent by Tendermint is empty (which I suppose makes sense, since that's for the last commit) - but the SDK currently expects to have seen a validator signature in LastCommit before evidence can be submitted for the validator.

I wonder if this is also a problem in general - does Tendermint send evidence for e.g. double-signatures at the current height which the SDK will not necessarily have seen yet in the LastCommit vote info list?

Ah OK - we should set a SigningInfo for a validator whenever we instruct Tendermint to add that validator to the validator set, including at genesis. That should solve this issue, this was just subtly flawed design on the SDK end.

Was this page helpful?
0 / 5 - 0 ratings