Terragrunt: CI/CD Best Practice for plan-all output?

Created on 1 Jun 2019 · 2Comments · Source: gruntwork-io/terragrunt

I'm starting to place our infrastructure into a CI/CD pipeline with steps for validate -> plan (with planfile output) -> apply.

At first I thought I could simply run terragrunt plan-all -out "planfile" in one step, save some plan file, and then run terragrunt apply-all -input=false "planfile" in another step. However, it seems that terragrunt actually places these plan files into the individual module .terragrunt-cache directories.

Next I provided an absolute path like terragrunt plan-all -out /my/plan/file which seems to generate the single file I'm after.

However, on the next step of trying to apply, since it's a single plan I expect I should use terragrunt apply /my/plan/file as opposed to the apply-all syntax. Doing this actually fails:

$ cd $CI_COMMIT_REF_NAME
$ terragrunt apply -input=false "$CI_PROJECT_DIR/planfile"
[terragrunt] [/builds/path/to/infrastructure-live/dev] 2019/05/31 22:39:40 Running command: terraform --version
[terragrunt] 2019/05/31 22:39:40 Reading Terragrunt config file at /builds/path/to/infrastructure-live/dev/terraform.tfvars
[terragrunt] 2019/05/31 22:39:40 Did not find any Terraform files (*.tf) in /builds/path/to/infrastructure-live/dev
[terragrunt] 2019/05/31 22:39:40 Unable to determine underlying exit code, so Terragrunt will exit with error code 1
ERROR: Job failed: exit code 1

I'm not sure if I should actually be letting the individual modules deal with plan files or trying to consolidate into this single file.

Is there a best practice for how to handle this situation?

question

Source

wollerman

Most helpful comment

plan-all simply iterates over the subdirectories, looking for those that have tfvars files and running the equivalent of terragrunt plan. Each CLI arg that isn't understood by terragrunt is passed through to the individual plan.

Given that, when you give an absolute path for the plan output, each plan is outputting to that file with replacement. This means that the last plan that runs is the one that is actually stored there, which isn't what you expect.

This use case that you are describing (having a single plan output from plan-all) is a feature that isn't currently supported. In fact, we generally discourage the usage of the xxx-all flavor for day to day operational usage. We designed the xxx-all variants with the specific purpose of standing up the stack from scratch. Once the infra is up, we recommend running the individual plans and applies manually.

There are a few reasons for this:

plan-all output is broken for certain use cases

If you have interdependencies between the modules, then the plan output from plan-all will be wrong. This is because plan-all does not do any apply in between. So let's consider the following scenario. Suppose you had the following folder structure, where vpc manages a VPC and eks will manage an EKS cluster in that VPC. The two are linked through a terragrunt dependency chain.

.
├── account.tfvars
├── eks
│   └── terraform.tfvars
└── vpc
    └── terraform.tfvars

In this scenario, when you run terragrunt plan-all, what you will see is the output of running terragrunt plan in the vpc folder, and then the output of running terragrunt plan in the eks folder.

Now suppose that nothing is deployed. In this case, the latter plan will fail because the vpc isn't deployed yet, so it doesn't know what to do for some resources (the VPC lookup will fail if you are using terraform_remote_state).

What about if we had already deployed resources? In this case the plan command will successfully complete because it has all the information available in the state and so there will be no errors.

However, if the vpc has some changes that affect the eks module, then this plan is WRONG. This is simply because the plan for the eks module is based off of the current state, not the future state. When you do apply-all, the vpc changes will apply, which will modify the state and thus change what terraform will do for the eks module when it applies that module.

So in practice, plan-all is something that you really shouldn't use unless you know exactly what you are looking for.

We have discussed in the past of trying to address this concern. You can see https://github.com/gruntwork-io/terragrunt/issues/262 for the relevant thread.

If you are always running `xxx-all`, should that be a single module?

The whole reason to componentize your infrastructure is to localize changes and the blast radius. You are trying to minimize operator error from changes caused in one component to spill over to other components. Using the xxx-all variants is roughly the equivalent of having all the code defined in a single stack, only you lose all the benefits of doing so (can't cross reference modules, not treated as a single atomic operation, etc) and get all the problems (slow to do anything, harder to read / parse the output, less secure because you now need permissions for everything not just the small subset, etc). Of course, there are certain situations where this is useful to do (a massive version bump across the stack, updating OS AMI across the stack, etc) and for those cases the xxx-all variant exists, but it is usually not recommended to do this all the time.

You can see our post 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code for more reasons why you don't want a single stack.

More error prone

Using the xxx-all variant on a regular basis depends on having all the dependencies between all your modules defined clearly in the terragrunt config. Otherwise, you can have very subtle bugs where the apply happens in the wrong order, leading to partial rollout. Since there is no syntactic support for this, you are much more likely to get the dependency chain wrong. Basically, the added complexity introduces a lot more room to make mistakes with the xxx-all variants of the commands.

CI/CD Pipeline

So going back to your original question about CI/CD best practices, this is an area that we are actively working on. In general, we find that it is much harder to setup a generic CI/CD pipeline for arbitrary infrastructure changes. This is mostly because:

The cost of a mistake is catastrophic. If you rollout the wrong version of your app, you can oftentimes rollback to a previous version. If you rollout the wrong infra code that destroys your VPC (or worse, your database), you have wiped out your entire app, potentially all the data with it.
Violates principle of least privileges. If you want to support arbitrary infra code CI/CD, your CI/CD system needs the privileges to deploy that. This ranges all the way from creating EC2 instances, to read/write access to IAM roles/users/etc. Pretty soon your CI system will need admin level privilege on your AWS account, which is probably not something you want for too long.
Terraform will fail. Very often. Failed deployments are VERY common with infra code. Most CI/CD systems do not have first class support for error handling and retries.
Manual steps are necessary. Unlike an app deploy, infra deploys typically require many gates and approval chains. This means that your pipeline needs to be broken up with approval flows that wait for approving steps (approving plan output, approval from multiple entities, etc). This is again not something that most CI/CD systems have first class support for.

Right now, our general recommendation is to build very limited CI/CD pipelines for infrastructure that changes often. For example, you can have a CI/CD pipeline just for rolling out your application. Or you can have one just for making changes to the ASG.* This can help you workaround a lot of the issues mentioned above.

In this model, you are unlikely to use the plan-all variant, and instead will be using plan and apply on a single module. In this case, it will work very similar to plain terraform, although the main difference is that you will most likely want to use absolute paths to store the plan output so that it is findable (as you discovered).

* Note that we don't have any special tooling around implementing such a pipeline (e.g we don't have scripts that autodetects the change to route to the relevant pipeline).

yorinasub17 on 1 Jun 2019

👍35 ❤11

All 2 comments

There are a few reasons for this:

plan-all output is broken for certain use cases

.
├── account.tfvars
├── eks
│   └── terraform.tfvars
└── vpc
    └── terraform.tfvars

What about if we had already deployed resources? In this case the plan command will successfully complete because it has all the information available in the state and so there will be no errors.

So in practice, plan-all is something that you really shouldn't use unless you know exactly what you are looking for.

We have discussed in the past of trying to address this concern. You can see https://github.com/gruntwork-io/terragrunt/issues/262 for the relevant thread.

If you are always running `xxx-all`, should that be a single module?

You can see our post 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code for more reasons why you don't want a single stack.

More error prone

CI/CD Pipeline

The cost of a mistake is catastrophic. If you rollout the wrong version of your app, you can oftentimes rollback to a previous version. If you rollout the wrong infra code that destroys your VPC (or worse, your database), you have wiped out your entire app, potentially all the data with it.
Violates principle of least privileges. If you want to support arbitrary infra code CI/CD, your CI/CD system needs the privileges to deploy that. This ranges all the way from creating EC2 instances, to read/write access to IAM roles/users/etc. Pretty soon your CI system will need admin level privilege on your AWS account, which is probably not something you want for too long.
Terraform will fail. Very often. Failed deployments are VERY common with infra code. Most CI/CD systems do not have first class support for error handling and retries.
Manual steps are necessary. Unlike an app deploy, infra deploys typically require many gates and approval chains. This means that your pipeline needs to be broken up with approval flows that wait for approving steps (approving plan output, approval from multiple entities, etc). This is again not something that most CI/CD systems have first class support for.

* Note that we don't have any special tooling around implementing such a pipeline (e.g we don't have scripts that autodetects the change to route to the relevant pipeline).

yorinasub17 on 1 Jun 2019

👍35 ❤11

I really appreciate the extremely thorough writeup! You've made some awesome points.

You're right that most CI/CD doesn't have manual triggering as a first class citizen. We have the capability, but it seems that there are plenty of other reasons why we should avoid this approach for now.

Again, thanks for the top to bottom explanation!

wollerman on 3 Jun 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Terraform "-no-color" option not passed to terraform init

mpkerr · 3Comments

Treat variables defined in leaf terraform.tfvars as the highest precedent variables

skang0601 · 4Comments

Add an http backend provider that can do encryption

brikis98 · 4Comments

Terragrunt does not allow new `-chdir` option

zachwhaley · 3Comments

Terragrunt moves stuff around in backend.tf between command runs

dingobar · 3Comments

Terragrunt: CI/CD Best Practice for plan-all output?

Most helpful comment

plan-all output is broken for certain use cases

If you are always running xxx-all, should that be a single module?

More error prone

CI/CD Pipeline

All 2 comments

plan-all output is broken for certain use cases

If you are always running xxx-all, should that be a single module?

More error prone

CI/CD Pipeline

Related issues

If you are always running `xxx-all`, should that be a single module?

If you are always running `xxx-all`, should that be a single module?