I'm starting to place our infrastructure into a CI/CD pipeline with steps for validate -> plan (with planfile output) -> apply.
At first I thought I could simply run terragrunt plan-all -out "planfile" in one step, save some plan file, and then run terragrunt apply-all -input=false "planfile" in another step. However, it seems that terragrunt actually places these plan files into the individual module .terragrunt-cache directories.
Next I provided an absolute path like terragrunt plan-all -out /my/plan/file which seems to generate the single file I'm after.
However, on the next step of trying to apply, since it's a single plan I expect I should use terragrunt apply /my/plan/file as opposed to the apply-all syntax. Doing this actually fails:
$ cd $CI_COMMIT_REF_NAME
$ terragrunt apply -input=false "$CI_PROJECT_DIR/planfile"
[terragrunt] [/builds/path/to/infrastructure-live/dev] 2019/05/31 22:39:40 Running command: terraform --version
[terragrunt] 2019/05/31 22:39:40 Reading Terragrunt config file at /builds/path/to/infrastructure-live/dev/terraform.tfvars
[terragrunt] 2019/05/31 22:39:40 Did not find any Terraform files (*.tf) in /builds/path/to/infrastructure-live/dev
[terragrunt] 2019/05/31 22:39:40 Unable to determine underlying exit code, so Terragrunt will exit with error code 1
ERROR: Job failed: exit code 1
I'm not sure if I should actually be letting the individual modules deal with plan files or trying to consolidate into this single file.
Is there a best practice for how to handle this situation?
plan-all simply iterates over the subdirectories, looking for those that have tfvars files and running the equivalent of terragrunt plan. Each CLI arg that isn't understood by terragrunt is passed through to the individual plan.
Given that, when you give an absolute path for the plan output, each plan is outputting to that file with replacement. This means that the last plan that runs is the one that is actually stored there, which isn't what you expect.
This use case that you are describing (having a single plan output from plan-all) is a feature that isn't currently supported. In fact, we generally discourage the usage of the xxx-all flavor for day to day operational usage. We designed the xxx-all variants with the specific purpose of standing up the stack from scratch. Once the infra is up, we recommend running the individual plans and applies manually.
There are a few reasons for this:
If you have interdependencies between the modules, then the plan output from plan-all will be wrong. This is because plan-all does not do any apply in between. So let's consider the following scenario. Suppose you had the following folder structure, where vpc manages a VPC and eks will manage an EKS cluster in that VPC. The two are linked through a terragrunt dependency chain.
.
โโโ account.tfvars
โโโ eks
โย ย โโโ terraform.tfvars
โโโ vpc
โโโ terraform.tfvars
In this scenario, when you run terragrunt plan-all, what you will see is the output of running terragrunt plan in the vpc folder, and then the output of running terragrunt plan in the eks folder.
Now suppose that nothing is deployed. In this case, the latter plan will fail because the vpc isn't deployed yet, so it doesn't know what to do for some resources (the VPC lookup will fail if you are using terraform_remote_state).
What about if we had already deployed resources? In this case the plan command will successfully complete because it has all the information available in the state and so there will be no errors.
However, if the vpc has some changes that affect the eks module, then this plan is WRONG. This is simply because the plan for the eks module is based off of the current state, not the future state. When you do apply-all, the vpc changes will apply, which will modify the state and thus change what terraform will do for the eks module when it applies that module.
So in practice, plan-all is something that you really shouldn't use unless you know exactly what you are looking for.
We have discussed in the past of trying to address this concern. You can see https://github.com/gruntwork-io/terragrunt/issues/262 for the relevant thread.
xxx-all, should that be a single module?The whole reason to componentize your infrastructure is to localize changes and the blast radius. You are trying to minimize operator error from changes caused in one component to spill over to other components. Using the xxx-all variants is roughly the equivalent of having all the code defined in a single stack, only you lose all the benefits of doing so (can't cross reference modules, not treated as a single atomic operation, etc) and get all the problems (slow to do anything, harder to read / parse the output, less secure because you now need permissions for everything not just the small subset, etc). Of course, there are certain situations where this is useful to do (a massive version bump across the stack, updating OS AMI across the stack, etc) and for those cases the xxx-all variant exists, but it is usually not recommended to do this all the time.
You can see our post 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code for more reasons why you don't want a single stack.
Using the xxx-all variant on a regular basis depends on having all the dependencies between all your modules defined clearly in the terragrunt config. Otherwise, you can have very subtle bugs where the apply happens in the wrong order, leading to partial rollout. Since there is no syntactic support for this, you are much more likely to get the dependency chain wrong. Basically, the added complexity introduces a lot more room to make mistakes with the xxx-all variants of the commands.
So going back to your original question about CI/CD best practices, this is an area that we are actively working on. In general, we find that it is much harder to setup a generic CI/CD pipeline for arbitrary infrastructure changes. This is mostly because:
Right now, our general recommendation is to build very limited CI/CD pipelines for infrastructure that changes often. For example, you can have a CI/CD pipeline just for rolling out your application. Or you can have one just for making changes to the ASG.* This can help you workaround a lot of the issues mentioned above.
In this model, you are unlikely to use the plan-all variant, and instead will be using plan and apply on a single module. In this case, it will work very similar to plain terraform, although the main difference is that you will most likely want to use absolute paths to store the plan output so that it is findable (as you discovered).
* Note that we don't have any special tooling around implementing such a pipeline (e.g we don't have scripts that autodetects the change to route to the relevant pipeline).
I really appreciate the extremely thorough writeup! You've made some awesome points.
You're right that most CI/CD doesn't have manual triggering as a first class citizen. We have the capability, but it seems that there are plenty of other reasons why we should avoid this approach for now.
Again, thanks for the top to bottom explanation!
Most helpful comment
plan-allsimply iterates over the subdirectories, looking for those that havetfvarsfiles and running the equivalent ofterragrunt plan. Each CLI arg that isn't understood byterragruntis passed through to the individual plan.Given that, when you give an absolute path for the plan output, each
planis outputting to that file with replacement. This means that the lastplanthat runs is the one that is actually stored there, which isn't what you expect.This use case that you are describing (having a single plan output from
plan-all) is a feature that isn't currently supported. In fact, we generally discourage the usage of thexxx-allflavor for day to day operational usage. We designed thexxx-allvariants with the specific purpose of standing up the stack from scratch. Once the infra is up, we recommend running the individual plans and applies manually.There are a few reasons for this:
plan-all output is broken for certain use cases
If you have interdependencies between the modules, then the
planoutput fromplan-allwill be wrong. This is becauseplan-alldoes not do anyapplyin between. So let's consider the following scenario. Suppose you had the following folder structure, wherevpcmanages a VPC andekswill manage an EKS cluster in that VPC. The two are linked through a terragrunt dependency chain.In this scenario, when you run
terragrunt plan-all, what you will see is the output of runningterragrunt planin thevpcfolder, and then the output of runningterragrunt planin theeksfolder.Now suppose that nothing is deployed. In this case, the latter
planwill fail because thevpcisn't deployed yet, so it doesn't know what to do for some resources (the VPC lookup will fail if you are usingterraform_remote_state).What about if we had already deployed resources? In this case the
plancommand will successfully complete because it has all the information available in the state and so there will be no errors.However, if the
vpchas some changes that affect theeksmodule, then this plan is WRONG. This is simply because theplanfor theeksmodule is based off of the current state, not the future state. When you doapply-all, thevpcchanges will apply, which will modify the state and thus change whatterraformwill do for theeksmodule when it applies that module.So in practice,
plan-allis something that you really shouldn't use unless you know exactly what you are looking for.We have discussed in the past of trying to address this concern. You can see https://github.com/gruntwork-io/terragrunt/issues/262 for the relevant thread.
If you are always running
xxx-all, should that be a single module?The whole reason to componentize your infrastructure is to localize changes and the blast radius. You are trying to minimize operator error from changes caused in one component to spill over to other components. Using the
xxx-allvariants is roughly the equivalent of having all the code defined in a single stack, only you lose all the benefits of doing so (can't cross reference modules, not treated as a single atomic operation, etc) and get all the problems (slow to do anything, harder to read / parse the output, less secure because you now need permissions for everything not just the small subset, etc). Of course, there are certain situations where this is useful to do (a massive version bump across the stack, updating OS AMI across the stack, etc) and for those cases thexxx-allvariant exists, but it is usually not recommended to do this all the time.You can see our post 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code for more reasons why you don't want a single stack.
More error prone
Using the
xxx-allvariant on a regular basis depends on having all the dependencies between all your modules defined clearly in the terragrunt config. Otherwise, you can have very subtle bugs where theapplyhappens in the wrong order, leading to partial rollout. Since there is no syntactic support for this, you are much more likely to get the dependency chain wrong. Basically, the added complexity introduces a lot more room to make mistakes with thexxx-allvariants of the commands.CI/CD Pipeline
So going back to your original question about CI/CD best practices, this is an area that we are actively working on. In general, we find that it is much harder to setup a generic CI/CD pipeline for arbitrary infrastructure changes. This is mostly because:
Right now, our general recommendation is to build very limited CI/CD pipelines for infrastructure that changes often. For example, you can have a CI/CD pipeline just for rolling out your application. Or you can have one just for making changes to the ASG.* This can help you workaround a lot of the issues mentioned above.
In this model, you are unlikely to use the
plan-allvariant, and instead will be usingplanandapplyon a single module. In this case, it will work very similar to plain terraform, although the main difference is that you will most likely want to use absolute paths to store the plan output so that it is findable (as you discovered).* Note that we don't have any special tooling around implementing such a pipeline (e.g we don't have scripts that autodetects the change to route to the relevant pipeline).