We are using the https://github.com/cosmos/gaia/tree/jack/gaiav3.0 branch
This is a snapshot of block 3557667 of the Cosmoshub. This export is available here: https://storage.googleapis.com/stargate-genesis/3557667.cosmos_hub_3.json
There is a full copy of a cosmos hub full node here: https://storage.googleapis.com/stargate-genesis/snapshot.tgz
Using Gaia 2.0 and this cosmos node above, gaiad export > 3557667.cosmos_hub_3.json
Using the migrate public keys from the Stargate repo:
build/gaiad migrate ~/3557667.cosmos_hub_3.json --chain-id=cosmoshub-test-stargate --initial-height 3557668
This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json
gaiad start takes about 2 hours for InitChain to run.
Looks like we need to do some profiling.
I will try to look at it.
I guess the _full copy of a cosmos hub full node_ (100GB+) is not needed?
This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json
@zmanian What do you mean with a genesis file? Is this a migrated snapshot?
Yes that file is a migrated snapshot.
I will try to look at it.
I guess the _full copy of a cosmos hub full node_ (100GB+) is not needed?
No, just this migrates genesis file.
One reason for the snapshot is that for if someone wants to redo the export as part of the testing process.
This export was not a "zero height" export. There is more staking state to process as result and that might be a factor in the long start up time.
I started looking at it today:
simapp from master fails to load the JSON filegaia takes several minutes to parse the JSON, and lot of time to finish the state sync (I killed it after 3+h)What's the expected time to finish the state sync normally (what was the state import time on the previous update) ? This process consumes 2CPU threads all the time.
It's also worth to mention that the 2 CPU threads I mentioned above are mostly used by: gnome-keyring-daemon and gaiad.
Shouldn't the key be loaded into the runtime and freeing the key manager? Probably this depends on the key manager. Some key managers will not give any access to its private keys.
Also, it's strange, because when doing a sync, we don't need a private key.... so we shouldn't need to access a key mananger.
UPDATE, I didn't a second run and killed the keyring daemon. It was started by the simapp and for some reason was keeping 100% busy. (I was running simapp and gaia in parallel).
Typically what we've done to debug these regressions is enable cpu profiling in the binary.
All we need is a simple pprof here to see where the bottleneck is. Loading genesis state into memory shouldn't take too long (we've done this on previous upgrades).
Cosmos-SDK app has an option to enable cpu profile through --cpu-profile. But it doesn't help here because we stuck before the profiling starts.
The loading get stuck during the Tendermint Node creation -> when it calls the app to initialize the genesis. This starts in server/startInProcess. The reason I'm writing about it is because it proceeds other routines to start, like cpuProfile or rpc.pprof HTTP handler. Because of this, we can't use the builtin profiling options.
Instead, I had to add a new go routine and export the pprof profiles.
I linked the gaiav3 code in the cosmos-sdk module to easily modify the gaia or cosmos-sdk code and build it together.
Following that I added a new goroutine to create cpu pprofile.
I will create a PR to change an order of some of the calls to enable at lest CPU profile before we will starting the TM node.
As noted above, the InitChain process is the crux. In attachment you can find 2 SVG files depicting a CPU profile. Unfortunately GitHub doesn't support attaching SVG files, so I had to pack them in a ZIP.
As you can see most of the work in this part is spent in x/bank/keeper InitGenesis . There are essentially 2 calls: ValidateBalance and SetBalances. It's important to note that we call these 2 functions ONLY on all accounts from the snapshot file. We can get rid of the first call, because the second call is doing k.ClearBalances(ctx, addr) on all accounts from the snapshot. This already drops the time by 50%.
Stats for 2min run
Showing top 15 nodes out of 127
flat flat% sum% cum cum%
34.54s 26.59% 26.59% 34.54s 26.59% aeshashbody
18.17s 13.99% 40.58% 18.17s 13.99% memeqbody
14.99s 11.54% 52.12% 65.20s 50.19% runtime.mapaccess1_faststr
5.66s 4.36% 56.47% 13.74s 10.58% runtime.mapiternext
5.22s 4.02% 60.49% 12.69s 9.77% runtime.mallocgc
4.07s 3.13% 63.63% 104.78s 80.66% github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
3.10s 2.39% 66.01% 3.10s 2.39% cmpbody
2.93s 2.26% 68.27% 7.37s 5.67% runtime.scanobject
2.24s 1.72% 69.99% 2.24s 1.72% runtime.memclrNoHeapPointers
2.13s 1.64% 71.63% 2.47s 1.90% runtime.findObject
1.88s 1.45% 73.08% 1.89s 1.45% github.com/enigmampc/btcutil/bech32.bech32Polymod
1.73s 1.33% 74.41% 5.87s 4.52% github.com/tendermint/tm-db.IsKeyInDomain
1.61s 1.24% 75.65% 15.85s 12.20% runtime.stringtoslicebyte
Stats for 10min run
Showing top 15 nodes out of 84
flat flat% sum% cum cum%
200.77s 31.38% 31.38% 200.77s 31.38% aeshashbody
98.02s 15.32% 46.69% 98.02s 15.32% memeqbody
91.30s 14.27% 60.96% 384.03s 60.02% runtime.mapaccess1_faststr
27.81s 4.35% 65.31% 67.29s 10.52% runtime.mapiternext
22.53s 3.52% 68.83% 52.17s 8.15% runtime.mallocgc
22.26s 3.48% 72.31% 574.32s 89.75% github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
14.99s 2.34% 74.65% 14.99s 2.34% runtime.evacuated (inline)
14.54s 2.27% 76.92% 14.54s 2.27% cmpbody
11.09s 1.73% 78.66% 31.56s 4.93% runtime.scanobject
10.73s 1.68% 80.33% 11.78s 1.84% runtime.findObject
7.79s 1.22% 81.55% 7.79s 1.22% runtime.memclrNoHeapPointers
7.66s 1.20% 82.75% 7.66s 1.20% runtime.nextFreeFast
7.58s 1.18% 83.93% 75.39s 11.78% runtime.stringtoslicebyte
6.90s 1.08% 85.01% 61.88s 9.67% runtime.rawbyteslice
6.40s 1.00% 86.01% 25.14s 3.93% github.com/tendermint/tm-db.IsKeyInDomain
89% github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItemsaeshashbody is a primary function use in go hash maps (maps in go).60.02% runtime.mapaccess1_faststr. Lot of map access (probably getting an item from a map).string to []byte -- alos a lot!I removed a ClearBalances call and now running a long profiling to find something about other modules. For the moment I have a feeling that there is something wrong in IAVL.
I modified slightly the code, and it seams that there is something wrong with IAVL iterator (see the new profiler graph). Looks that setting a new balances for each account takes only a fraction of time of removing the balances. Both operations update the merkle paths, so I'm wondering if there were any related changes in IAVL. Is anyone aware about it?
Should we inspect it further here?
I'm wondering if we should hitting the cacheStore at all during initChain?
@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?
@zmanian - so what is the loading time we should anticipate? what is the "baseline"?
I will do a PR with the x/bank optimization.
Well a 2 hour node start up time is too much of a usability issue for the hub to upgrade to stargate.
I think we need to complete InitChain < 10 min.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?
Well, there have been a lot of changes since the version the Hub is currently running, so it's hard to say. But we know that IAVL has pretty poor performance in general, so it's not entirely unexpected that importing the state is taking a long time. I'll dig into this a bit more tomorrow and see if there's any low-hanging fruit.
I'll also mention that we now have export/import APIs in IAVL, which should be much faster at importing data than creating nodes one-by-one, but this was introduced in more recent versions than the Hub is running.
@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?
So a bulk of this time is coming from x/crisis invariant assertion, which operators are not obligated to run. Given that it takes so long to run, it is very worrisome. Thankfully it's not serialization related like it was in the past with Amino, but now something to do with IAVL most likely.
We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.

If we could make invariant checking optional during node startup we might be okay
We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.
I already did that earlier today. copy-pasting the response from discord:
Without invariant checking it's much faster. 10-15min. Most of the work is is done by an IAVL iterator in both the bank and crisis
Invariants check - on my computer it's about 30min
@zmanian - currently it's not configurable to disable the invariant checks. I will add that option tomorrow.
@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.
@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.
Ah, right, good point. Direct IAVL import would basically require the data to remain unchanged, since the exported data contains the internal IAVL tree structure. It might be possible to change the values as long as keys don't change, but I wouldn't really recommend modifying it.
okay. I think 10-15 min is survivable.
thanks guys!
I had a closer look, and as far as I can tell this isn't related to IAVL at all, but rather the cachekv package in the SDK. There is some CPU usage of tm-db.IsKeyInDomain, but interestingly this function is not used at all in tm-db or IAVL (not sure why it's even there), it is only used in the SDK's cachekv package.
Let me know if I've missed anything, or you'd like a hand with optimizing cachekv.
Thanks @erikgrinaker . I will see what's in cachekv. Is new for me.
@zmanian , @jackzampolin - I've pushed the update both on the gaia and cosmos-sdk branch to make the crisis invariants check customizable. This is done in the app constructor - NewGaiaApp:
crisis.NewAppModule(&app.CrisisKeeper, false),
Let me know if this is OK, or if you want to make it customizable through a CLI parameter.
With invariants disabled, the genesis loading takes ~10min.
Interesting. I'll take a look at cachekv. But I don't recall any changes to the core kv store logic, so this is concerning.
Here is a PR for the Cosmos-SDK (new branch, previous one was causing conflicts to merge to master): https://github.com/cosmos/cosmos-sdk/pull/7764
Here is the Gaia-3 PR: https://github.com/cosmos/gaia/pull/487
Here is the followup task: https://github.com/cosmos/cosmos-sdk/issues/7766