@zmanian , I and a lot of other community members are starting to see extremely long startup times.
It can range anywhere from 2 minutes to 20 minutes to start the gaiad process.
adrian@validator-gaia-5001:~$ ./gaiad start
I[06-22|21:35:17.382] Starting ABCI with Tendermint module=main
This is the only output until it picks up.
I'm running with
# Output level for logging, including package level options
#log_level = "main:info,state:info,*:error"
log_level = "*:debug"
I[06-22|22:18:11.769] Starting ABCI with Tendermint module=main
I[06-22|22:27:37.976] Starting multiAppConn module=proxy impl=multiAppConn
Just saw 9 min pass before 1st and 2nd line of the log
Also an issue on one of my sentries; note: resetting the DB made startup instant again, so this has something to do with database size.
Maybe it's that we aren't pruning the state db?
Basically it's trying to load the state tree for every single height during startup. Surely we don't need to do that :)
Here we load the version from the IAVL lib:
And then in the IAVL lib, we load every single past version:
So I think the SDK needs to set a strategy for calling DeleteVersion on the IAVL tree, or we need some way to not load all these roots @Liamsi
(why didn't that second link give me a snippet like the first one did ? )
Maybe it only works if it's within the same repo
I just discussed this with @zmanian.
A solution we were thinking about was implementing a lazy loading versioned tree.
Since the old versions are only loaded when the state for an old height is queried via ABCI we don't usually need to have them loaded permanently (on a non public API node).
However lazy loading the state imposes a DoS vulnerability on the ABCI query interface since a malicious actor could start sending requests for heights that are not loaded by default causing high I/O load on the node.
I would propose the following solution:
Implement a lazy loading versioned tree and only load the latest (latest + last x) versions into the cache and loads other versions on demand.
Add a TTL to the versions in the tree which purges them from the cache (implemented as map versions) after the time has passed to free RAM because at the moment the usage of the tree is just steadily growing without real purpose. Currently the versioned tree has a limited cache size but that does only apply to the nodeDB storage layer but not the versions cached in the struct (at least as far as I could see).
Recommend validators to shield off the ABCI query interface from the public on their sentry nodes.
API nodes will have to introduce appropriate rate limits that prevent abuse of this i/o heavy operation.
If a node operator wanted to hold all versions in cache we could introduce a config option to load the version history in a separate go routine (so only the first version blocks the node launch) although I don't really see a case where someone wanted to do this especially because of the RAM requirements that this would introduce with a production chain.
In the past we discussed adopting the following strategy for pruning state history: keep the last 100 states, and every 10,000. Everything else would be deleted.
Then on startup, we would load only H/10000 + 100 states, which should only take a few seconds.
Lazy loading would be cool too, but not sure it's worth the work right now since we want to be pruning old state from the tree anyways. We could then of course offer a flag for archive nodes that want to keep all versions of the tree.
Should be fixed by pruning. Current code will not prune by default, with #1533 the default will be set to the strategy @ebuchman laid out above, with options to prune everything or nothing.
Most helpful comment
I just discussed this with @zmanian.
A solution we were thinking about was implementing a lazy loading versioned tree.
Since the old versions are only loaded when the state for an old height is queried via ABCI we don't usually need to have them loaded permanently (on a non public API node).
However lazy loading the state imposes a DoS vulnerability on the ABCI query interface since a malicious actor could start sending requests for heights that are not loaded by default causing high I/O load on the node.
I would propose the following solution:
Implement a lazy loading versioned tree and only load the latest (latest + last x) versions into the cache and loads other versions on demand.
Add a TTL to the versions in the tree which purges them from the cache (implemented as map
versions) after the time has passed to free RAM because at the moment the usage of the tree is just steadily growing without real purpose. Currently the versioned tree has a limited cache size but that does only apply to the nodeDB storage layer but not the versions cached in the struct (at least as far as I could see).Recommend validators to shield off the ABCI query interface from the public on their sentry nodes.
API nodes will have to introduce appropriate rate limits that prevent abuse of this i/o heavy operation.
If a node operator wanted to hold all versions in cache we could introduce a config option to load the version history in a separate go routine (so only the first version blocks the node launch) although I don't really see a case where someone wanted to do this especially because of the RAM requirements that this would introduce with a production chain.