Cosmos-sdk: Gaia 3.0 takes a long time to start on migrated hub3 state

Created on 26 Oct 2020  路  32Comments  路  Source: cosmos/cosmos-sdk

We are using the https://github.com/cosmos/gaia/tree/jack/gaiav3.0 branch

This is a snapshot of block 3557667 of the Cosmoshub. This export is available here: https://storage.googleapis.com/stargate-genesis/3557667.cosmos_hub_3.json

There is a full copy of a cosmos hub full node here: https://storage.googleapis.com/stargate-genesis/snapshot.tgz

Using Gaia 2.0 and this cosmos node above, gaiad export > 3557667.cosmos_hub_3.json

Using the migrate public keys from the Stargate repo:

build/gaiad migrate ~/3557667.cosmos_hub_3.json --chain-id=cosmoshub-test-stargate --initial-height 3557668 

This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json

gaiad start takes about 2 hours for InitChain to run.

For Admin Use

  • [ ] Not duplicate issue
  • [ ] Appropriate labels applied
  • [ ] Appropriate contributors tagged
  • [ ] Contributor assigned/self-assigned
bug

All 32 comments

Looks like we need to do some profiling.

I will try to look at it.
I guess the _full copy of a cosmos hub full node_ (100GB+) is not needed?

This genesis file can be downloaded from https://storage.googleapis.com/stargate-genesis/cosmoshub-test-stargate.json

@zmanian What do you mean with a genesis file? Is this a migrated snapshot?

Yes that file is a migrated snapshot.

I will try to look at it.

I guess the _full copy of a cosmos hub full node_ (100GB+) is not needed?

No, just this migrates genesis file.

One reason for the snapshot is that for if someone wants to redo the export as part of the testing process.

This export was not a "zero height" export. There is more staking state to process as result and that might be a factor in the long start up time.

I started looking at it today:

  1. simapp from master fails to load the JSON file
  2. gaia takes several minutes to parse the JSON, and lot of time to finish the state sync (I killed it after 3+h)
  3. I started porting the gaia into a branch of cosmos-sdk, to allow easy testing.

What's the expected time to finish the state sync normally (what was the state import time on the previous update) ? This process consumes 2CPU threads all the time.

It's also worth to mention that the 2 CPU threads I mentioned above are mostly used by: gnome-keyring-daemon and gaiad.
Shouldn't the key be loaded into the runtime and freeing the key manager? Probably this depends on the key manager. Some key managers will not give any access to its private keys.
Also, it's strange, because when doing a sync, we don't need a private key.... so we shouldn't need to access a key mananger.

UPDATE, I didn't a second run and killed the keyring daemon. It was started by the simapp and for some reason was keeping 100% busy. (I was running simapp and gaia in parallel).

Typically what we've done to debug these regressions is enable cpu profiling in the binary.

https://blog.golang.org/pprof

All we need is a simple pprof here to see where the bottleneck is. Loading genesis state into memory shouldn't take too long (we've done this on previous upgrades).

Note about profiling options

Cosmos-SDK app has an option to enable cpu profile through --cpu-profile. But it doesn't help here because we stuck before the profiling starts.

Preliminaries

The loading get stuck during the Tendermint Node creation -> when it calls the app to initialize the genesis. This starts in server/startInProcess. The reason I'm writing about it is because it proceeds other routines to start, like cpuProfile or rpc.pprof HTTP handler. Because of this, we can't use the builtin profiling options.
Instead, I had to add a new go routine and export the pprof profiles.

I linked the gaiav3 code in the cosmos-sdk module to easily modify the gaia or cosmos-sdk code and build it together.

Following that I added a new goroutine to create cpu pprofile.

Action Points:

I will create a PR to change an order of some of the calls to enable at lest CPU profile before we will starting the TM node.

Profiling

As noted above, the InitChain process is the crux. In attachment you can find 2 SVG files depicting a CPU profile. Unfortunately GitHub doesn't support attaching SVG files, so I had to pack them in a ZIP.
As you can see most of the work in this part is spent in x/bank/keeper InitGenesis . There are essentially 2 calls: ValidateBalance and SetBalances. It's important to note that we call these 2 functions ONLY on all accounts from the snapshot file. We can get rid of the first call, because the second call is doing k.ClearBalances(ctx, addr) on all accounts from the snapshot. This already drops the time by 50%.

Top called functions

Stats for 2min run

Showing top 15 nodes out of 127
      flat  flat%   sum%        cum   cum%
    34.54s 26.59% 26.59%     34.54s 26.59%  aeshashbody
    18.17s 13.99% 40.58%     18.17s 13.99%  memeqbody
    14.99s 11.54% 52.12%     65.20s 50.19%  runtime.mapaccess1_faststr
     5.66s  4.36% 56.47%     13.74s 10.58%  runtime.mapiternext
     5.22s  4.02% 60.49%     12.69s  9.77%  runtime.mallocgc
     4.07s  3.13% 63.63%    104.78s 80.66%  github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
     3.10s  2.39% 66.01%      3.10s  2.39%  cmpbody
     2.93s  2.26% 68.27%      7.37s  5.67%  runtime.scanobject
     2.24s  1.72% 69.99%      2.24s  1.72%  runtime.memclrNoHeapPointers
     2.13s  1.64% 71.63%      2.47s  1.90%  runtime.findObject
     1.88s  1.45% 73.08%      1.89s  1.45%  github.com/enigmampc/btcutil/bech32.bech32Polymod
     1.73s  1.33% 74.41%      5.87s  4.52%  github.com/tendermint/tm-db.IsKeyInDomain
     1.61s  1.24% 75.65%     15.85s 12.20%  runtime.stringtoslicebyte

Stats for 10min run

Showing top 15 nodes out of 84
      flat  flat%   sum%        cum   cum%
   200.77s 31.38% 31.38%    200.77s 31.38%  aeshashbody
    98.02s 15.32% 46.69%     98.02s 15.32%  memeqbody
    91.30s 14.27% 60.96%    384.03s 60.02%  runtime.mapaccess1_faststr
    27.81s  4.35% 65.31%     67.29s 10.52%  runtime.mapiternext
    22.53s  3.52% 68.83%     52.17s  8.15%  runtime.mallocgc
    22.26s  3.48% 72.31%    574.32s 89.75%  github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
    14.99s  2.34% 74.65%     14.99s  2.34%  runtime.evacuated (inline)
    14.54s  2.27% 76.92%     14.54s  2.27%  cmpbody
    11.09s  1.73% 78.66%     31.56s  4.93%  runtime.scanobject
    10.73s  1.68% 80.33%     11.78s  1.84%  runtime.findObject
     7.79s  1.22% 81.55%      7.79s  1.22%  runtime.memclrNoHeapPointers
     7.66s  1.20% 82.75%      7.66s  1.20%  runtime.nextFreeFast
     7.58s  1.18% 83.93%     75.39s 11.78%  runtime.stringtoslicebyte
     6.90s  1.08% 85.01%     61.88s  9.67%  runtime.rawbyteslice
     6.40s  1.00% 86.01%     25.14s  3.93%  github.com/tendermint/tm-db.IsKeyInDomain

Findings:

  • Most of the time is spent on 89% github.com/cosmos/cosmos-sdk/store/cachekv.(*Store).dirtyItems
    I didn't check yet what this is doing. Sounds like cache management
  • Lot of time is spent in map function (the builtin map). Looks like we do lot of serialization into maps.

    • aeshashbody is a primary function use in go hash maps (maps in go).

    • I'm curious in particular about this one: 60.02% runtime.mapaccess1_faststr. Lot of map access (probably getting an item from a map).

  • Significant amount of time is spent on memory allocatoin: 9% it's really a lot!
  • 11% when converting string to []byte -- alos a lot!

Next steps

I removed a ClearBalances call and now running a long profiling to find something about other modules. For the moment I have a feeling that there is something wrong in IAVL.

Attachments

cpu-pprof-graphs.zip

I modified slightly the code, and it seams that there is something wrong with IAVL iterator (see the new profiler graph). Looks that setting a new balances for each account takes only a fraction of time of removing the balances. Both operations update the merkle paths, so I'm wondering if there were any related changes in IAVL. Is anyone aware about it?

Should we inspect it further here?

cpu-35min-mod.pdf

I'm wondering if we should hitting the cacheStore at all during initChain?

@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

cpu-35min.zip

@zmanian - so what is the loading time we should anticipate? what is the "baseline"?

I will do a PR with the x/bank optimization.

Well a 2 hour node start up time is too much of a usability issue for the hub to upgrade to stargate.

I think we need to complete InitChain < 10 min.

@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

Well, there have been a lot of changes since the version the Hub is currently running, so it's hard to say. But we know that IAVL has pretty poor performance in general, so it's not entirely unexpected that importing the state is taking a long time. I'll dig into this a bit more tomorrow and see if there's any low-hanging fruit.

I'll also mention that we now have export/import APIs in IAVL, which should be much faster at importing data than creating nodes one-by-one, but this was introduced in more recent versions than the Hub is running.

@alexanderbez here is the latest pprof.
@erikgrinaker - could you advise here if there are any IAVL changes which could be related here?

cpu-35min.zip

So a bulk of this time is coming from x/crisis invariant assertion, which operators are not obligated to run. Given that it takes so long to run, it is very worrisome. Thankfully it's not serialization related like it was in the past with Amino, but now something to do with IAVL most likely.

We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.

cpu-35min.zip

If we could make invariant checking optional during node startup we might be okay

We should run a profile @robert-zaremba with invariant checking off. I believe you can pass this as a CLI flag or config? If not, just hard-code it to zero during app construction.

I already did that earlier today. copy-pasting the response from discord:
Without invariant checking it's much faster. 10-15min. Most of the work is is done by an IAVL iterator in both the bank and crisis
Invariants check - on my computer it's about 30min

@zmanian - currently it's not configurable to disable the invariant checks. I will add that option tomorrow.

@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.

@erikgrinaker - I don't know if directly loading to IAVL is possible. We are loading a snapshot using JSON to the modules. That JSON is converted upfront - the object serialization in 0.40.x changed.

Ah, right, good point. Direct IAVL import would basically require the data to remain unchanged, since the exported data contains the internal IAVL tree structure. It might be possible to change the values as long as keys don't change, but I wouldn't really recommend modifying it.

okay. I think 10-15 min is survivable.

thanks guys!

I had a closer look, and as far as I can tell this isn't related to IAVL at all, but rather the cachekv package in the SDK. There is some CPU usage of tm-db.IsKeyInDomain, but interestingly this function is not used at all in tm-db or IAVL (not sure why it's even there), it is only used in the SDK's cachekv package.

Let me know if I've missed anything, or you'd like a hand with optimizing cachekv.

Thanks @erikgrinaker . I will see what's in cachekv. Is new for me.

@zmanian , @jackzampolin - I've pushed the update both on the gaia and cosmos-sdk branch to make the crisis invariants check customizable. This is done in the app constructor - NewGaiaApp:

crisis.NewAppModule(&app.CrisisKeeper, false),

Let me know if this is OK, or if you want to make it customizable through a CLI parameter.
With invariants disabled, the genesis loading takes ~10min.

Interesting. I'll take a look at cachekv. But I don't recall any changes to the core kv store logic, so this is concerning.

Here is a PR for the Cosmos-SDK (new branch, previous one was causing conflicts to merge to master): https://github.com/cosmos/cosmos-sdk/pull/7764

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kevlubkcm picture kevlubkcm  路  3Comments

rigelrozanski picture rigelrozanski  路  3Comments

hendrikhofstadt picture hendrikhofstadt  路  3Comments

faboweb picture faboweb  路  3Comments

ValarDragon picture ValarDragon  路  3Comments