As reported in #749, consensus was blocked due to state synchronization unable to make progress since networking wasn't started. There should be an integration test for this.
More details:
We have two databases - consensus db and state db. State db keeps transactions and all the data, while consensus db keep some metadata related to blocks, quorum certificates and safety rules.
There is invariant, that whenever things changes, consensus db is updated first, and then data is committed to state db. So state db may run behind consensus db, but not vise versa.
Since information in those dbs committed independently, so they might get out of sync, if node crashed in bad moment. When this happens consensus db has references to blocks, that are not in state db.
When node starts in such state, special synchronization on startup is invoked, during consensus startup. State synchronization downloads missing information from peers into state db.
At some point we had a problem when component startup order was such that this procedure was broken in real environment, while all unit tests were passing(because those test don't use same startup sequence as real node).
Since DBs running out of sync is a special case, it also does not reproduces regularly, however it did brake production eventually.
In order to make sure this situation does not happen again, we need special integration test that would verify this condition.
For this we have special system called smoke tests - unlike unit test, they run libra_node binaries on same host, so those nodes go through same startup sequence as prod nodes.
Note that smoke tests are different from cluster test - smoke test runs on CI before merge, and also can be easily run locally
cargo test -p test_suite
So in a way smoke tests are subset of cluster test, that can give signal faster because they can run on CI. (however historically smoke test were before cluster test, so perhaps better way to see it as cluster test being extension smoke tests :D )
Easiest way to simulate situation when consensus db is behind state db, is to simply stop libra node and delete state db directory. In this case it will be re-created and sync-on-startup procedure will be invoked.
See test_basic_restartability in smoke_test.rs as an example
Easiest way to simulate situation when consensus db is behind state db
@andll Did you mean state db is behind consensus db?
Oops, yes :)
Most helpful comment
More details:
We have two databases - consensus db and state db. State db keeps transactions and all the data, while consensus db keep some metadata related to blocks, quorum certificates and safety rules.
There is invariant, that whenever things changes, consensus db is updated first, and then data is committed to state db. So state db may run behind consensus db, but not vise versa.
Since information in those dbs committed independently, so they might get out of sync, if node crashed in bad moment. When this happens consensus db has references to blocks, that are not in state db.
When node starts in such state, special synchronization on startup is invoked, during consensus startup. State synchronization downloads missing information from peers into state db.
At some point we had a problem when component startup order was such that this procedure was broken in real environment, while all unit tests were passing(because those test don't use same startup sequence as real node).
Since DBs running out of sync is a special case, it also does not reproduces regularly, however it did brake production eventually.
In order to make sure this situation does not happen again, we need special integration test that would verify this condition.
For this we have special system called smoke tests - unlike unit test, they run libra_node binaries on same host, so those nodes go through same startup sequence as prod nodes.
Note that smoke tests are different from cluster test - smoke test runs on CI before merge, and also can be easily run locally
cargo test -p test_suiteSo in a way smoke tests are subset of cluster test, that can give signal faster because they can run on CI. (however historically smoke test were before cluster test, so perhaps better way to see it as cluster test being extension smoke tests :D )
Easiest way to simulate situation when consensus db is behind state db, is to simply stop libra node and delete state db directory. In this case it will be re-created and sync-on-startup procedure will be invoked.
See
test_basic_restartabilityinsmoke_test.rsas an example