Edited by @AdityaSripal
With the new Pruning changes, the IAVL only flushes to disk at each snapshot interval defined by the SDK KeepEvery parameter. On restart, the application should replay blocks from the last persisted version (or should replay from an empty state if nothing has been persisted). However, the CommitInfo needs to contain the last persisted commit, rather than the latest commit so that the tendermint process can restart the application correctly.
A couple changes need to be integrated into the SDK
KeepRecent int64 // how many recent versions should we persist in memory
FlushEvery int64 // how often do we flush to disk
SnapshotEvery int64 // how often do we snapshot a version
Here {KeepRecent, FlushEvery} form the IAVL PruningOptions {KeepRecent, KeepEvery}.
The SDK will on each commit of a FlushEvery version, remove the last FlushEvery version unless the last version is a snapshot version which is defined with the SnapshotEvery parameter.
Thanks to @ethanfrey and @zmanian for help diagnosing issue and helping with solution
End of edit
I started the migration of cyber to the latest SDK v0.38.0.
After refactoring of application and modules it built and ran but I found after node restart it crashes with consensus failure every time. I spent holidays trying to fix this think this is an application problem this but after tried to check bumped to 38 Gaia version and took the same issue.
Upgraded to 0.38.0 code, single node, start, stop, restart -> failure.
Stacktrace, restarting Gaia node
I[2020-01-27|14:46:12.284] starting ABCI with Tendermint module=main
panic: stored minter should not have been nil
goroutine 1 [running]:
github.com/cosmos/cosmos-sdk/x/mint/internal/keeper.Keeper.GetMinter(0xc00013c000, 0x52282e0, 0xc000b8caf0, 0xc00013c000, 0x52282e0, 0xc000b8cb30, 0x5228320, 0xc000b8cb70, 0xc000b95a20, 0x4, ...)
github.com/cosmos/[email protected]/x/mint/internal/keeper/keeper.go:57 +0x18f
github.com/cosmos/cosmos-sdk/x/mint.BeginBlocker(0x5238820, 0xc0000d8008, 0x524c360, 0xc0000a8e80, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/cosmos/[email protected]/x/mint/abci.go:11 +0x8e
github.com/cosmos/cosmos-sdk/x/mint.AppModule.BeginBlock(...)
github.com/cosmos/[email protected]/x/mint/module.go:130
github.com/cosmos/cosmos-sdk/types/module.(*Manager).BeginBlock(0xc000139260, 0x5238820, 0xc0000d8008, 0x524c360, 0xc0000a8e80, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
github.com/cosmos/[email protected]/types/module/module.go:297 +0x1ca
github.com/cosmos/gaia/app.(*GaiaApp).BeginBlocker(...)
/Users/litvintech/Projects/gaia/app/app.go:299
github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock(0xc000b9fe00, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/cosmos/[email protected]/baseapp/abci.go:136 +0x469
github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync(0xc0000cf620, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/tendermint/[email protected]/abci/client/local_client.go:231 +0x101
github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync(0xc000cce940, 0xc000dd8680, 0x20, 0x20, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/tendermint/[email protected]/proxy/app_conn.go:69 +0x6b
github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x52391e0, 0xc000b7cd00, 0x5245c00, 0xc000cce940, 0xc000c121c0, 0x524e260, 0xc000d6c000, 0x6, 0xc000dc04b0, 0xc)
github.com/tendermint/[email protected]/state/execution.go:280 +0x3e1
github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000b5a380, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
github.com/tendermint/[email protected]/state/execution.go:131 +0x17a
github.com/tendermint/tendermint/consensus.(*Handshaker).replayBlock(0xc000d130b0, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
github.com/tendermint/[email protected]/consensus/replay.go:475 +0x233
github.com/tendermint/tendermint/consensus.(*Handshaker).ReplayBlocks(0xc000ab90b0, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, 0x20, ...)
github.com/tendermint/[email protected]/consensus/replay.go:394 +0xe03
github.com/tendermint/tendermint/consensus.(*Handshaker).Handshake(0xc000d130b0, 0x524ef60, 0xc000ac6310, 0x80, 0x4d037c0)
github.com/tendermint/[email protected]/consensus/replay.go:269 +0x485
github.com/tendermint/tendermint/node.doHandshake(0x524e260, 0xc000d6c000, 0xa, 0x0, 0xc000dc0496, 0x6, 0xc000dc04b0, 0xc, 0x6, 0xc0000e4d80, ...)
github.com/tendermint/[email protected]/node/node.go:281 +0x19a
github.com/tendermint/tendermint/node.NewNode(0xc000b9f540, 0x5232e60, 0xc000b44000, 0xc000b8d350, 0x5217380, 0xc000ace920, 0xc000b8d4d0, 0x5032578, 0xc000b8d4e0, 0x52391e0, ...)
github.com/tendermint/[email protected]/node/node.go:597 +0x343
github.com/cosmos/cosmos-sdk/server.startInProcess(0xc0000ef360, 0x5032dd8, 0x1d, 0x0, 0x0)
github.com/cosmos/[email protected]/server/start.go:157 +0x4c1
github.com/cosmos/cosmos-sdk/server.StartCmd.func1(0xc000370780, 0xc0000b5db0, 0x0, 0x1, 0x0, 0x0)
github.com/cosmos/[email protected]/server/start.go:67 +0xb4
github.com/spf13/cobra.(*Command).execute(0xc000370780, 0xc0000b5d90, 0x1, 0x1, 0xc000370780, 0xc0000b5d90)
github.com/spf13/[email protected]/command.go:826 +0x460
github.com/spf13/cobra.(*Command).ExecuteC(0xc0000f1900, 0x4ecdc0e, 0xc000ab5e90, 0x4185832)
github.com/spf13/[email protected]/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/[email protected]/command.go:864
github.com/tendermint/tendermint/libs/cli.Executor.Execute(0xc0000f1900, 0x5033220, 0x4eb1b22, 0x10)
github.com/tendermint/[email protected]/libs/cli/setup.go:89 +0x3c
main.main()
/Users/litvintech/Projects/gaia/cmd/gaiad/main.go:72 +0x8cb
It looks like this is some storage issues. It first halts with mint module during BeginBlock but I checked that this is the same with other modules in OrderBeginBlockers.
store := ctx.KVStore(k.storeKey)
b := store.Get(types.MinterKey)
if b == nil {
panic("stored minter should not have been nil")
}
Cosmos-SDK release v0.38.0
Gaia b2f508950d11897fdc89924fad81b1045379a937
Take provided in version section gaia commit and
./gaiad testnet --v=1 --output-dir=./mytestnet
./gaiad start --home=./mytestnet/node0/gaiad
stop node
./gaiad start --home=./mytestnet/node0/gaiad
I initially asked @ethanfrey about this and he confirmed SDK's issue in Wasmd project, https://github.com/cosmwasm/wasmd/issues/54
@ethanfrey provided more deep details, https://github.com/cosmwasm/wasmd/issues/54
ref: https://github.com/cosmwasm/wasmd/issues/54#issuecomment-578748710
@AdityaSripal Is working on a fix to this.
I agree that the fix will work for the current sdk. However, it does break the generality of MultiStore.
A lot of work was made on such an abstract store than can use multiple sub-dbs, like ethereum patricia tree, under one root. This pruning approach only applied to the iavl substores. In 99+% of the cases currently this is the only substore used, so please make the fix and get v0.38.1 out. But also note that this adds tech debt (making rootmultistore only usable by the iavl substore), so please make an issue on that and start working on a proper design that doesn't couple the two so closely
I think we can tackle this w/o introducing tech-debt, of which I've spent the better part of the last two months trying to reduce so I know the pain. Instead of introducing changes to the root multistore, can we push the fix down to the IAVL store -- most likely in SaveVersion and do the custom logic and tracking there (essentially like we used to before we updated IAVL)? Surely, there must be a way.
Most helpful comment
I agree that the fix will work for the current sdk. However, it does break the generality of MultiStore.
A lot of work was made on such an abstract store than can use multiple sub-dbs, like ethereum patricia tree, under one root. This pruning approach only applied to the iavl substores. In 99+% of the cases currently this is the only substore used, so please make the fix and get v0.38.1 out. But also note that this adds tech debt (making rootmultistore only usable by the iavl substore), so please make an issue on that and start working on a proper design that doesn't couple the two so closely