Cosmos-sdk: Restarting node causes DataCorruptionError

Created on 11 Feb 2019  路  2Comments  路  Source: cosmos/cosmos-sdk

Summary of Bug


On GoS6, I changed the values in gaiad.toml in two of my nodes (one validating node, one not-validating node). After stopping and restarting the node again immediately, I get the following:

I[2019-02-11|15:53:34.544] Starting ABCI with Tendermint                module=main
E[2019-02-11|15:53:35.878] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: i/o timeout"
E[2019-02-11|15:53:36.032] Corrupted entry. Skipping...                 module=consensus wal=/home/ubuntu/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.202] data has been corrupted in last height of consensus WAL module=consensus err="DataCorruptionError[failed to read data: EOF]" height=1944
E[2019-02-11|15:53:36.202] Encountered corrupt WAL file                 module=consensus err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.202] Please repair the WAL file before restarting module=consensus
You can attempt to repair the WAL as follows:

----
WALFILE=~/.tendermint/data/cs.wal/wal
cp $WALFILE ${WALFILE}.bak # backup the file
go run scripts/wal2json/main.go $WALFILE > wal.json # this will panic, but can be ignored
rm $WALFILE # remove the corrupt file
go run scripts/json2wal/main.go wal.json $WALFILE # rebuild the file without corruption
----
E[2019-02-11|15:53:36.202] Error starting conS                          module=consensus err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.669] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: connect: connection refused"
E[2019-02-11|15:53:37.559] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: i/o timeout"
E[2019-02-11|15:53:37.813] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF
E[2019-02-11|15:53:37.813] Stopping peer for error                      module=p2p peer="Peer{MConn{X:X:X:X:12345} censoredcensored out}" err=EOF
E[2019-02-11|15:53:38.146] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF
E[2019-02-11|15:53:38.146] Stopping peer for error                      module=p2p peer="Peer{MConn{X:X:X:X:12345} censoredcensored out}" err=EOF
E[2019-02-11|15:53:38.146] MConnection flush failed                     module=p2p peer=censoredcensored@X:X:X:X:12345 err="write tcp X:X:X:X:59324->X:X:X:X:12345: use of closed network connection"
E[2019-02-11|15:53:38.479] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF

This caused me to miss a few blocks on Game of Stakes 6. The issue was solved by stopping the node again, waiting for a few seconds and then restarting it.

Steps to Reproduce

cosmos-sdk: 0.31.1
git commit: b9e523212ec47910a00db00be2f1b7935e201ee7
vendor hash: 85e6c5c7a700e822cccf169c97f4a3974312dfd1
go version go1.11.5 linux/amd64
  1. Change value in gaiad.toml (although I don't think it has anything to do with this)
  2. Stop Node
  3. Immediately start node again with gaiad start
  4. See corrupted WAL Errors

For Admin Use

  • [ ] Not duplicate issue
  • [ ] Appropriate labels applied
  • [ ] Appropriate contributors tagged
  • [ ] Contributor assigned/self-assigned
bug needs-more-info tendermint

Most helpful comment

It looks like this is a tendermint bug. I'm going to go ahead and re-open this issue over there.

All 2 comments

I've been able to reproduce this issue:

sudo systemctl stop gaiad && sudo systemctl start gaiad && journalctl -u gaiad -f
Feb 11 16:49:29 gaiaeleven gaiad[22266]: I[2019-02-11|16:49:29.917] Committed state                              module=state height=24937 txs=0 appHash=42DA9DE2D2A0B110652AEC2A8D0F063B9F67463C7DACBF9AE87F0F4B4C863859
Feb 11 16:49:35 gaiaeleven gaiad[22266]: I[2019-02-11|16:49:35.918] Executed block                               module=state height=24938 validTxs=1 invalidTxs=0
Feb 11 16:49:35 gaiaeleven gaiad[22266]: I[2019-02-11|16:49:35.933] Committed state                              module=state height=24938 txs=1 appHash=91FCE175DC0C3F44B30FDFB6F40F17C157984473E5A5FE3569142410F13B9FBC
Feb 11 16:49:46 gaiaeleven gaiad[22266]: I[2019-02-11|16:49:46.537] Executed block                               module=state height=24939 validTxs=0 invalidTxs=0
Feb 11 16:49:46 gaiaeleven gaiad[22266]: I[2019-02-11|16:49:46.552] Committed state                              module=state height=24939 txs=0 appHash=E376AF81A12BE4556D89786FFF3B6A479D3F36D77F479992960BC4B7738A21EB
Feb 11 16:49:52 gaiaeleven systemd[1]: Stopping Cosmos Gaia Node...
Feb 11 16:49:52 gaiaeleven systemd[1]: gaiad.service: Main process exited, code=exited, status=143/n/a
Feb 11 16:49:52 gaiaeleven systemd[1]: gaiad.service: Failed with result 'exit-code'.
Feb 11 16:49:52 gaiaeleven systemd[1]: Stopped Cosmos Gaia Node.
Feb 11 16:49:52 gaiaeleven systemd[1]: Started Cosmos Gaia Node.
Feb 11 16:49:52 gaiaeleven gaiad[6922]: I[2019-02-11|16:49:52.535] Starting ABCI with Tendermint                module=main
Feb 11 16:49:53 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:53.560] Dialing failed                               module=pex addr=4e013719b1a3769455ff858a20b771bcd1c08596@[2a05:d014:60:7302:d020:36fe:9d7c:cbce]:26656 err="dial tcp [2a05:d014:60:7302:d020:36fe:9d7c:cbce]:26656: connect: network is unreachable" attempts=0
Feb 11 16:49:53 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:53.563] Dialing failed                               module=pex addr=ed840a7f45e998de1189088504c8bb7ab71be2b4@[2a05:d014:60:7302:84d8:f984:23e1:8077]:26656 err="dial tcp [2a05:d014:60:7302:84d8:f984:23e1:8077]:26656: connect: network is unreachable" attempts=0
Feb 11 16:49:53 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:53.563] Dialing failed                               module=pex addr=8d00cb7315bccfbe02c4cd7415086a1b47718a6f@[2a05:d014:60:7302:d6ab:5d68:75c4:3bb0]:26656 err="dial tcp [2a05:d014:60:7302:d6ab:5d68:75c4:3bb0]:26656: connect: network is unreachable" attempts=0
Feb 11 16:49:53 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:53.580] Dialing failed                               module=pex [email protected]:26656 err="dial tcp 35.247.46.115:26656: connect: connection refused" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.063] Dialing failed                               module=pex [email protected]:26656 err="incompatible: Peer is on a different network. Got game_of_stakes_6, expected gaia-11001" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.515] Corrupted entry. Skipping...                 module=consensus wal=/home/johnzampolin/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.597] Dialing failed                               module=pex [email protected]:26656 err="dial tcp 123.56.250.241:26656: i/o timeout" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.598] Dialing failed                               module=pex [email protected]:32656 err="dial tcp 35.196.99.139:32656: i/o timeout" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.598] Dialing failed                               module=pex [email protected]:26656 err="dial tcp 188.3.142.201:26656: i/o timeout" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.598] Dialing failed                               module=pex [email protected]:26656 err="dial tcp 35.198.162.20:26656: i/o timeout" attempts=0
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.797] Corrupted entry. Skipping...                 module=consensus wal=/home/johnzampolin/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
Feb 11 16:49:54 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:54.800] Corrupted entry. Skipping...                 module=consensus wal=/home/johnzampolin/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
Feb 11 16:49:55 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:55.084] data has been corrupted in last height of consensus WAL module=consensus err="DataCorruptionError[failed to read data: EOF]" height=24940
Feb 11 16:49:55 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:55.085] Encountered corrupt WAL file                 module=consensus err="DataCorruptionError[failed to read data: EOF]"
Feb 11 16:49:55 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:55.086] Please repair the WAL file before restarting module=consensus
Feb 11 16:49:55 gaiaeleven gaiad[6922]: You can attempt to repair the WAL as follows:
Feb 11 16:49:55 gaiaeleven gaiad[6922]: ----
Feb 11 16:49:55 gaiaeleven gaiad[6922]: WALFILE=~/.tendermint/data/cs.wal/wal
Feb 11 16:49:55 gaiaeleven gaiad[6922]: cp $WALFILE ${WALFILE}.bak # backup the file
Feb 11 16:49:55 gaiaeleven gaiad[6922]: go run scripts/wal2json/main.go $WALFILE > wal.json # this will panic, but can be ignored
Feb 11 16:49:55 gaiaeleven gaiad[6922]: rm $WALFILE # remove the corrupt file
Feb 11 16:49:55 gaiaeleven gaiad[6922]: go run scripts/json2wal/main.go wal.json $WALFILE # rebuild the file without corruption
Feb 11 16:49:55 gaiaeleven gaiad[6922]: ----
Feb 11 16:49:55 gaiaeleven gaiad[6922]: E[2019-02-11|16:49:55.086] Error starting conS                          module=consensus err="DataCorruptionError[failed to read data: EOF]"

It looks like this is a tendermint bug. I'm going to go ahead and re-open this issue over there.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rigelrozanski picture rigelrozanski  路  3Comments

mossid picture mossid  路  3Comments

fedekunze picture fedekunze  路  3Comments

rigelrozanski picture rigelrozanski  路  3Comments

hendrikhofstadt picture hendrikhofstadt  路  3Comments