Renovate: git submodules do not need to be initialized in order to be updated

Created on 14 Oct 2020 · 11Comments · Source: renovatebot/renovate

What would you like Renovate to be able to do?

I just want to start a discussion on this, because I think it would optimize runs for both self-hosted and the app. Currently, renovate attempts to initialize all submodules for all discovered/listed repos during every run, even though the submodule manager is disabled by default and isn't going to be part of updates for most projects.

This causes a lot of timeouts due to private submodules, proxies, SSH protocols etc. Imagine your bot accidentally picks up a mirror/fork of the boost repo with 150 submodules during auto-discovery. Submodule initialization is also not shallow by default, so this is an expensive operation. With a bit of git plumbing, a lot of this could be avoided.

Did you already have any implementation ideas?

submodules should not always be initialized in this initial block IMO, just check for the presence of .gitmodules to see if anything needs to be done (this is true even if you still wanted to initialize submodules for whatever reason).
https://github.com/renovatebot/renovate/blob/cc9d256ca5cd64255d28e25f525c0e1765ad2f75/lib/util/git/index.ts#L265-L273
You don't need to have a submodule initialized to get its currentValue, this only needs a tiny change to handle the -:
https://github.com/renovatebot/renovate/blob/cc9d256ca5cd64255d28e25f525c0e1765ad2f75/lib/manager/git-submodules/extract.ts#L107-L109
See docs:

Each SHA-1 will possibly be prefixed with - if the submodule is not initialized, + if the currently checked out submodule commit does not match the SHA-1 found in the index of the containing repository and U if the submodule has merge conflicts.
The recently introduced ls-remote approach that picks up the HEAD branch can be extended to get the hash of the remote HEAD. The second line always contains the hash.

https://github.com/renovatebot/renovate/blob/cc9d256ca5cd64255d28e25f525c0e1765ad2f75/lib/manager/git-submodules/extract.ts#L39-L42

Example:

$ git ls-remote --symref https://github.com/renovatebot/renovate.git HEAD
ref: refs/heads/master  HEAD
3b1ed2f4656ea216192f21ec3200ec079098d8aa        HEAD

This block is not needed, because you can update a submodule with update-index without initializing it:

https://github.com/renovatebot/renovate/blob/cc9d256ca5cd64255d28e25f525c0e1765ad2f75/lib/manager/git-submodules/update.ts#L11-L20

The equivalent command would be, reusing the hash from above if renovate was at deps/renovate (which will stage the submodule change):

git update-index --cacheinfo 160000,3b1ed2f4656ea216192f21ec3200ec079098d8aa,deps/renovate

See https://stackoverflow.com/q/62588847 for a more general discussion on this.

I feel like this would also narrow things down for failures in private submodules (https://github.com/renovatebot/renovate/issues/5471). It would be easier to rewrite URLs to http and reuse API tokens for specific ls-remote calls only.
Does all of this make sense or did I miss anything? Any specific reason why people would need submodules initialized?

Are there any workarounds or alternative ideas you've tried to avoid needing this feature?

Not sure, I think this approach would actually be simpler and much faster than the current workflow.

Is this a feature you'd be interested in implementing yourself?

Yes, but I'm slow :) if you don't see a draft PR by me, feel free to hack away

git-submodule priority-3-normal

Source

nejch

Most helpful comment

I agree, trying to detect when a package manager needs submodules loaded is going to be too difficult.

I'll send out a PR with the following changes:

Add an option cloneSubmodules (or similarly named) that defaults to true so it's not a breaking change
Add - as an allowed character here, so we can extract the current hash in cases where the submodule is not initialised
Change the git command here from git submodule to git update-index (This actually fixes a race condition as well!)
Initialise the submodules in git-submodules/update.ts if not already. This will give us more control to initialise individual submodules, instead of all at the same time.

Then in a separate PR, I'll alter the config option cloneSubmodules to be a string option eager/lazy/never where:

eager initialises submodules when cloning the repo
lazy initialises submodules before updateDependency
never only initialises submodules when updateing git-submodules dependencies.

I think that covers everything we discussed.

JamieMagee on 16 Oct 2020

👍2

All 11 comments

Thanks for the detailed report. I'll dive deeper into this tomorrow, but it all looks reasonable.

I've self assigned this issue for now.

JamieMagee on 14 Oct 2020

👍1

I've had a more in depth look at this. Here are my comments:

There are some valid reasons why you might want to initialise submodules. See #1356 for the initial issue, and #4353 for the initial implementation. Though I agree that it's not necessary for everyone.

I think I'll add a new configuration option, cloneSubmodules, which allows users to control whether submodules are initialised when cloning the repository. I'll set it to true by default, so as to not make this a breaking change
Makes sense. I love 1-character changes :+1:
You're talking about getting the latest version of the HEAD ref? That's out of scope for what the manager should do. Getting the updated ref is the concern of the datasource, and we currently use ls-remote there already.

https://github.com/renovatebot/renovate/blob/692eea18d441aa8ce657f959b2e9e540ae14f8fe/lib/datasource/git-submodules/index.ts#L29-L33
Thanks for the tip. I know that most git operations are just wrappers around update-index, but I never got around to fixing this.

JamieMagee on 15 Oct 2020

Thanks a lot for the quick analysis! That's great, just one more thought.

1. There are some valid reasons why you might want to initialise submodules. See #1356 for the initial issue, and #4353 for the initial implementation. Though I agree that it's not necessary for everyone.
   I think I'll add a new configuration option, `cloneSubmodules`, which allows users to control whether submodules are initialised when cloning the repository. I'll set it to true by default, so as to not make this a breaking change

I read the use case in the initial issue and personally I still think it's not necessary (we happen to do the same npm + submodule approach in a few repos). I think that this:

"dependencies": {
  "foo-bar": "file:./my-local-version-of-foo-bar"
}

is completely unaware of versioning, it just expects files there. It would be perfectly happy with a tarball unpacked/repo copy-pasted in there, so all of the versioning needs to be done with submodules separately and I think the same process described above could apply here. Then it's up to the consumers to do initalizing/npm installing in the right order in their own process.

But I understand that you'd want to be careful in order not to break things. Otherwise the only use-case I can think of where submodules would need to be initialized is if people had packagefiles as symlinks to something in the submodule... wait no, even then the submodule would need to be updated instead 😁

Edit: Hmm maybe I'm missing some use-cases with lock-files here, although mixing that with submodules sounds like the worst idea but maybe some people do that 😋

3. You're talking about getting the latest version of the `HEAD` ref? That's out of scope for what the manager should do. Getting the updated ref is the concern of the datasource, and we currently use `ls-remote` there already.
   https://github.com/renovatebot/renovate/blob/692eea18d441aa8ce657f959b2e9e540ae14f8fe/lib/datasource/git-submodules/index.ts#L29-L33

Ah, oops I see 🤦 even better, so this is already done :grin: Thanks again!

nejch on 15 Oct 2020

I think submodule checkout is nessesary when it comes to lockfile updates, as npm and other need the all transitive deps to resolve transitive packages

viceice on 15 Oct 2020

I think submodule checkout is nessesary when it comes to lockfile updates, as npm and other need the all transitive deps to resolve transitive packages

I realized that just after writing this, sorry my bad :) but having this as an option would be great

nejch on 15 Oct 2020

Can we delay submodule initialization until we need it, e.g. only on runs when artifacts like lock files are updated?

Also we could have the new option with options lazy, eager and never to represent the three use cases of lazily initializing before artifact updates, eager initializing at the start, or never initializing.

rarkins on 15 Oct 2020

should be possible, can maybe done in worker layer before artifacts generation (if there is any update which uses artifacts)

viceice on 15 Oct 2020

Okay, so I've spent a bit more time digging into this, as well as git internals.

I like the idea of lazy/eager loading of git submodules, but where is the correct place to lazily load? I think it will have to be before updateDependency to replicate the current behaviour, while still providing some performance benefits.

JamieMagee on 16 Oct 2020

crazy idea: can you match the submodule paths extracted from .gitmodules against local paths in packagefiles/lockfiles and initialize them based on that as needed? Just throwing it out there, but that would probably require some insane amounts of parsing and would be brittle, though :P not sure how involved your current implementation is for this.

Otherwise I guess there are 2 almost separate issues here, I hope I'm not conflating this too much.
a) submodules as their own candidates for updates via the submodules manager (my use case above)
b) submodules that need to be initialized purely for other managers to function properly (the issue you linked)

nejch on 16 Oct 2020

I don't think we can reliably do that matching because we don't always know when some package file has made a relative requires statement pointing into a submodule directory. i.e. use case (b) you mentioned.

@JamieMagee maybe implementing a new option initalizeGitSubmodules with values always (current behavior) and never (for @nejch's use case) would be a good first step, and lazy initialization can be deferred into another feature request?

rarkins on 16 Oct 2020

I agree, trying to detect when a package manager needs submodules loaded is going to be too difficult.

I'll send out a PR with the following changes:

Add an option cloneSubmodules (or similarly named) that defaults to true so it's not a breaking change
Add - as an allowed character here, so we can extract the current hash in cases where the submodule is not initialised
Change the git command here from git submodule to git update-index (This actually fixes a race condition as well!)
Initialise the submodules in git-submodules/update.ts if not already. This will give us more control to initialise individual submodules, instead of all at the same time.

Then in a separate PR, I'll alter the config option cloneSubmodules to be a string option eager/lazy/never where:

eager initialises submodules when cloning the repo
lazy initialises submodules before updateDependency
never only initialises submodules when updateing git-submodules dependencies.

I think that covers everything we discussed.

JamieMagee on 16 Oct 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings