Terraform: Terraform sparse checkout module

Created on 4 Nov 2018  路  13Comments  路  Source: hashicorp/terraform

Current Terraform Version

v0.11.7

Use-cases

Terraform module sparse checkout and specify depth. While terraform suggest 1 module per repo, there are orgs which are more willing to manage multiple related modules together. This gives faster feedback cycle also, related pull requests in one repo etc..

Attempted Solutions

Couldn't find any thing relevant.

Proposal

possibly we can evolve source with backward compatibility as

module dynamo-auto {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git?ref=master"
}

and also

module dynamo-auto {
     source = {
       repo = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git?ref=master"
       depth = 1
       path = /modules/dynamo
}

which allows to sparse checkout /modules/dynamo as relevant terraform module.

cli config enhancement

Most helpful comment

Hi @rverma-nikiai! Thanks for sharing this use-cases.

The git module source actually already supports a syntax for selecting a sub-path from a repository, like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master"
}

The extra //... portion of the path at the end is interpreted as a subdirectory within the repository.

That then just leaves the request for shallow cloning. The git handling is all done by a component which parses only the source string, so additional git-related settings must be packed in inside that pseudo-query-string argument at the end, which means a hypothetical new option might look something like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master&depth=1"
}

However, I think Terraform's module installer does a full clone by default just because when it was written the "shallow clone" functionality was relatively new and limited, and we wanted to be sure of proper behavior on subsequent commands such as upgrading the module, which requires running operations like git fetch.

I think we should investigate whether the improved shallow clone behavior added in Git 1.9 (now several years old) is featureful enough that we could enable shallow cloning _by default_ in a future release, since the module installer's goal is always to install just the single version you requested, rather than to create a fully-fledged development environment for that repository. Before making that decision, we'll need to prototype it to make sure the upgrading behavior is well-behaved after a shallow, single-branch clone.

We are in the early stages of planning some other changes to how Terraform manages configuration dependencies for a future release, so I'm going to label this one to remind us to consider this use-case as part of that work.

All 13 comments

Hi @rverma-nikiai! Thanks for sharing this use-cases.

The git module source actually already supports a syntax for selecting a sub-path from a repository, like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master"
}

The extra //... portion of the path at the end is interpreted as a subdirectory within the repository.

That then just leaves the request for shallow cloning. The git handling is all done by a component which parses only the source string, so additional git-related settings must be packed in inside that pseudo-query-string argument at the end, which means a hypothetical new option might look something like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master&depth=1"
}

However, I think Terraform's module installer does a full clone by default just because when it was written the "shallow clone" functionality was relatively new and limited, and we wanted to be sure of proper behavior on subsequent commands such as upgrading the module, which requires running operations like git fetch.

I think we should investigate whether the improved shallow clone behavior added in Git 1.9 (now several years old) is featureful enough that we could enable shallow cloning _by default_ in a future release, since the module installer's goal is always to install just the single version you requested, rather than to create a fully-fledged development environment for that repository. Before making that decision, we'll need to prototype it to make sure the upgrading behavior is well-behaved after a shallow, single-branch clone.

We are in the early stages of planning some other changes to how Terraform manages configuration dependencies for a future release, so I'm going to label this one to remind us to consider this use-case as part of that work.

@apparentlymart, though the terraform module support submodule like

 source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master&depth=1"

It did clone the whole repo and reference to path which is useful locally. It still defats the purpose of sparse checkout, which provides various benefits.

Just some thoughts as
Consider three modules definition in main.tf

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/dynamo?ref=master"
}
module "rds-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/rds?ref=master"
}



md5-c097e3499d46ad0417bda045912ca8e6



module "es-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/es?ref=master"
}

Currently this will cause 3 complete clone of terraform-aws-autoscaler.git and we have atleast 2 redundant copies of each module on disk, 3 time git cloning would be called as well.

possibly in init step we can prebuilt the sparse-checkout info resulting in 1 clone of just 3 repos only.
I can see one major flaw is that if we miss spell any module, sparse checkout will ignore it without warning.

Anyways, shallow cloning would be a huge improvement standalone.

Hi again @rverma-nikiai! Thanks for the additional context.

It looks like you are interested in several slightly different (but related) problems here:

  1. Terraform clones exactly the same git repository multiple times over the network, which is slow.
  2. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.
  3. Terraform clones the entire source tree in the repository, even though only a sub-path is requested.
  4. The same repository is stored on local disk multiple times.

The first of these has already been addressed in master and will be included in the forthcoming v0.12.0 release: Terraform will now detect that all of these are coming from the same repository and only run git clone once.

Point 2 I think we can solve after we do some testing to make sure that -depth 1 doesn't have any unwanted consequences for the update step. This is what my previous comment was about.

Point 3 here is intentional because in a multi-module repository the different modules will often refer to one another with references like source = "../es" and so we need to have the entire repository on disk to resolve references like that.

Point 4 is another one we can address eventually. For v0.12.0 we've switched to a directory naming scheme that reflects the module names in source code so that error messages (which now contain source location references) are more easily understandable. The new mechanism I mentioned for point 1 doesn't yet address this, since we wanted to keep things relatively simple for the first pass, but that mechanism could also potentially use additional techniques to share the files on disk between multiple copies of the same source. We intend to investigate that further in a later release.

In order to keep things focused, let's say that this issue is about the second point, since I think that's the one that is in most need of some further study/prototyping. I expect we will also make a separate issue for the 4th point at a later date, once we've got some experience with this new download optimization fix in v0.12 and can potentially address any other concerns related to it at the same time.

Point 1 here was originally discussed in #11435, which is now closed due to the fix being ready for release.

@apparentlymart Now that https://github.com/hashicorp/go-getter/pull/140 has been merged, any chance we can get terraform's vendoring updated to add support for shallow clones? I'm happy to open a PR including a docs update, I just need to know which target branch would be most appropriate at the moment.

It looks like #20411 updated the go-getter version to include the shallow clone functionality. It is in 0.12 beta1. I'm looking forward to using this in 0.12, thanks! We can probably close this issue out.

Since that other PR wasn't intentionally updating go-getter to address this issue, it therefore didn't update Terraform's module sources documentation to mention this new option. We'll need to do _that_ at least before considering this done.

I'd also still like to investigate whether that option is necessary at all or if we can just make that behavior the default. Since we're not cloning the repository for _development_ it seems unnecessary to produce a fully-functioning work tree by default, and in the rare case where someone _does_ want to work directly with the cloned repository in .terraform/modules it only takes a couple git commands to fetch the full history if needed.

Ah you're right I forgot about documentation, and I would fully support using depth=1 by default. I can't think of any reasonable situations where a user would need the full history in .terraform/modules.

Surely the source can be cached as well as put in a lookup during execution. If the lookup contains the same url key just use the cached copy if not source it and add to the cache...this surely can't be hard to do versus downloading and identical copy form the internet over and over and over each time :/

@apparentlymart

Is there a github issue tracking the the following point?

2. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.

I'm using the terraform-google-modules/gcloud/google module which ends up downloading hundreds of megabytes of history since the github repo contains gcloud binaries and grows in size significantly with every version bump.

@apparentlymart

Is there a github issue tracking the the following point?

  1. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.

I'm using the terraform-google-modules/gcloud/google module which ends up downloading hundreds of megabytes of history since the github repo contains gcloud binaries and grows in size significantly with every version bump.

That is the way Git works it is distributed so copies all history locally. There no way around this the first time

It would be really nice to have shallow clone as default option for cloning Terraform modules. @apparentlymart can you tell whether there are any plans to implement it anytime soon?

Nobody on the Terraform team at HashiCorp is currently working on this, because our attentions are currently elsewhere.

As I mentioned before, the main trick here is making sure that shallow clone won't break the ability for terraform init -upgrade to roll forward to a newer commit when a shallow tree is already present on disk. I don't know yet how that will behave, and I think understanding that behavior is the main blocker for deciding whether we can make this change. If someone is motivated to work on this, I'd suggest the following approach to get into a state where it's possible to test and experiment:

  • Create a local branch of go-getter, which is the library that implements the Git fetching in Terraform.
  • In your Terraform work tree, temporarily edit go.mod to include a replace directive referring to your local go-getter tree, so your local Terraform builds will see the go-getter changes you're making locally:

    replace github.com/hashicorp/go-getter => ../go-getter
    
  • Change the logic in GitGetter to enable shallow cloning unconditionally. (I don't have exact details on this step, because I've not looked closely at the logic in there yet.)

  • Build Terraform against the locally-modified go-getter and experiment with terraform init and terraform init -upgrade to make sure they are both still working as expected.

If the above is fruitful and it seems like making shallow clone the default work work, I expect the final change to go-getter would need to make it conditional via a flag field in the GitGetter type so that Terraform can enable it without forcing that behavior on other go-getter callers. We can then change Terraform's own instantiation of that getter to set the new flag, making that behavior always be activated for Terraform's module installer.

If someone is interested in working on this but needs some more guidance, please let me know what specific questions you have and I can try to answer them as best I can with what I know already.

Due to closed https://github.com/hashicorp/terraform/issues/11435 I'd like to slightly offtopic here and share a small pre-terraform routine utility that optimizes init (modules download) for git modules https://github.com/hayorov/terraform-init-booster

Was this page helpful?
0 / 5 - 0 ratings