Terragrunt: Best Practice Question: Single "stack" module with tfvars versus current recommendations?

Created on 21 Dec 2018 · 11Comments · Source: gruntwork-io/terragrunt

So the "best practice" layout as defined in the repository I've seen with terragrunt has each module in its own folder with its own tfvars per environment

My problem here is I have a lot of inter-module dependencies, and while Terraform handles those just fine together, doing it the "terragrunt" way means I have to specify the dependencies manually per the documentation.

So what I've been doing is just creating a "stack" module at the root that defines all my downstream variables and configuration for the various child modules, and defining just a single .tfvars for that stack module, per environment one for prod, dev, and test

The only potential downside I saw with this is the larger state files I suppose vs. breaking it out, but my problem was not a large terraform state, I just wanted to easily define separate environments.

The other that was mentioned was claimed you can't "easily" add new modules to this configuration without affecting other environments, but I disagree, because I can peg prod or test to a version or branch, and then modify the "stack" module just fine.

What other downsides are there to this "stack" approach, and why is the current best practice recommended, when in practive so far I run into more "breaking" changes because I have to manually specify the dependencies vs terraform just "figuring it out"? Am I not doing it right?

Thanks!

question

Source

JustinGrote

Most helpful comment

We generally have a repository per stack, primarily for access control than anything else (different teams responsible for different stacks), no reason they all couldn't be one big repo. Examples of stacks would be "network-core", "app-HR", "app-website", etc. and in some cases separated into frontend and backend but again, only for access control if that app doesn't have full-stack developers.

There's usually a "terraform" root folder, as some stack repositories use other tools (ansible, etc.) but it's not required, we leave that up to the stack developer.

terraform

modules
- MyEC2Instance
  - variables.tf
  - main.tf
  - output.tf
  - dependencies.tf (optional) - all "data" references and remote_state references go here
  - MyEC2Instance.tests.ps1 - Pester Unit Tests to validate setup
- MyS3Pipeline
- etc.
variables.tf (stack)
main.tf (stack)

The stack main.tf is pretty much all module calls with variables in variables.tf. These are the "knobs" like size, instance count, etc. and usually 99% have defaults (except for things like "name"). This ensures all the terraform-y dependencies are there, without having the entire infrastructure defined in one terraform which can lead to hours-long plans and applys in a really large infrastructure. Then you can use terragrunt's dependencies as "loose" ordering just to enforce certain order actions (e.g. core networking processes before app stacks).

Then our "terraform-deploy" terragrunt repository references these stacks with their git tag version pinned, and environments in separate folders much like the default example, except the only thing allowed in each environment folder is the terraform.tfvars file to control the "knobs", and then they use the terragrunt shared backend. The commit log in terraform-deploy is pretty much updating version tags on stacks and adding/tweaking tfvars variables, that's it.

Side Note Alternative

Now that terraform actually has some features we went to terragrunt for (workspaces, locking, etc.) we've been the same structure but a different deployment method for "terraform-deploy".

Everyone has rights to run a "plan" locally, and a cached state file is available that's updated hourly via a scheduled task so they don't have to do a full terraform refresh. Makes testing and "diffs" easy.
We use the GitLab Flow branching strategy. Users develop in feature branches and can run plan anytime.
"master" and "production" have branch policies that only allow commits via pull request
"master" is analogous to staging. When a feature branch PR is created for master, the CI pipeline tflints the terraform, and then runs a terraform plan for the entire environment (using the periodically refreshed cached state file to keep things fast). It also creates a terraform workspace per branch to keep the state files separate, and then runs "unit tests" against the plan output that have been defined in each stack (we use Pester for this), and reports this back to the PR via webhook so each PR must pass tests, and show a summary terraform plan for what will "change"
We discuss and merge the PR into "master", which then triggers the release pipeline to staging and sets up staging with the new terraform config. We have a commit flag that allows for a full-destroy-refresh if required (in staging, never in production, all deletions in production have to be done manaully by policy for safety). We also have a commit flag that uses the production tfvar sizes if the team is looking to do performance testing, otherwise most things get created at a smaller size to keep costs low.
When staging is ready for release, a PR from staging goes to production, and the process repeats. Pretty much the only difference is the release pipeline always does a full state refresh, and the plan output has an approval step that has to be approved by 75% of the change control board before it applies.
Noone has direct "apply" rights except an emergency support admin account, only the CI/CD pipeline can run terraform applys
Emergency hotfix branches get applied directly to master, they follow the same process except they only require two CCB approvers.

With this infrastructure in place, our users can just use the native terraform commands after running their Invoke-Build setup.

You can do all this at a much smaller scale, we've been thinking about publishing our tools for it.

JustinGrote on 1 Feb 2019

👍4

All 11 comments

Deploying a giant "stack" as a single unit has quite a few downsides:

Slow: If all your infrastructure is defined in one place, running any command will take a long time. We’ve seen companies where terraform plan takes 5–6 minutes to run!
Insecure: If all your infrastructure is managed together, then to change _anything_ you need permissions to access _everything_. That means that almost every user has to be an admin, which is a Bad Idea.
Risky: If all your eggs are in one basket, then a mistake anywhere could break everything. You might be making a minor change to a frontend app in dev, but due to a typo or running the wrong command, you delete the production DB.
Hard to understand: The more code you have in one place, the harder it is for any one person to understand it all. But if it’s all bundled together, the parts you don’t understand could hurt you.
Hard to test: Testing infrastructure code is hard; testing a large amount of infrastructure code is nearly impossible.
Hard to review: The output of commands such as terraform plan becomes useless, as no one bothers to look through thousands of lines of plan output.

Building, deploying, and managing your code as a series of small modules is one of the more important lessons learned from writing over 300,000 lines of infrastructure code.

brikis98 on 21 Dec 2018

Thanks for taking the time to reply.

I'm not quite sure you understood where I'm coming from. Everything in my
"stack" is in many small modules for all the reasons you say above. That's
part of the problem, there's a lot of dependencies involved. Right now
normally terraform just handles the dependencies implicitly and it just
"works", whereas if I build it with terragrunt and define the modules each
with their own .tfvars files, I now have to manually specify the
dependencies with a dependency block.

If I instead define a "stack" module that has all the child modules, and
the variables I want to control as variables that just get passed to the
child modules, then I have a simple single place to define my environments
with terragrunt, and the dependency handling still works. I can still
run/taint/scope/etc. the submodules, and my dependencies are implicit again.

So is there some other way of defining dependencies in terragrunt I'm
missing? Should I be using data sources between modules rather than
referencing them directly? Seems like that's a lot of extra queries that
would slow things down. Just wondering why this isn't the best practice,
seems so much simpler, at the minor cost of a larger state file because the
"stack" is being processed rather than the individual modules, but so far
at the scale I'm running that's not a big deal, and I can still terraform
plan -target if I want to just update a portion.

On Thu, Dec 20, 2018 at 8:53 PM Yevgeniy Brikman notifications@github.com
wrote:

Deploying a giant "stack" as a single unit has quite a few downsides:

1.

Slow: If all your infrastructure is defined in one place, running
any command will take a long time. We’ve seen companies where terraform
plan takes 5–6 minutes to run!
2.

Insecure: If all your infrastructure is managed together, then to
change anything you need permissions to access everything. That
means that almost every user has to be an admin, which is a Bad Idea.
3.

Risky: If all your eggs are in one basket, then a mistake anywhere
could break everything. You might be making a minor change to a frontend
app in dev, but due to a typo or running the wrong command, you delete the
production DB.
4.

Hard to understand: The more code you have in one place, the harder
it is for any one person to understand it all. But if it’s all bundled
together, the parts you don’t understand could hurt you.
5.

Hard to test: Testing infrastructure code is hard; testing a large
amount of infrastructure code is nearly impossible.
6.

Hard to review: The output of commands such as terraform plan
becomes useless, as no one bothers to look through thousands of lines of
plan output.

Building, deploying, and managing your code as a series of small modules
is one of the more important lessons learned from writing over 300,000
lines of infrastructure code
https://blog.gruntwork.io/5-lessons-learned-from-writing-over-300-000-lines-of-infrastructure-code-36ba7fadeac1
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/gruntwork-io/terragrunt/issues/627#issuecomment-449249304,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOjVUtQ6uis9kNgpDCCWly4LfNq6E7-yks5u7Gk4gaJpZM4ZdEC-
.

JustinGrote on 21 Dec 2018

If you're deploying it all with a single apply, then it's subject to all the problems I listed before. I'd estimate a 99% chance that you will end up shooting yourself in the foot and regretting it if your infrastructure grows.

If you split up into separate deployables with separate state files, you avoid many of these issues. The trade-off is managing dependencies between them, as you pointed out. IMO, this is the lesser evil. You can use data sources, including terraform_remote_state.

brikis98 on 21 Dec 2018

Again, thanks for the feedback, it doesn't sound like I'm doing anything
inherently "wrong", just a different approach and a minor additional layer
of abstraction.

Currently I feel the manual dependency issue is more likely to shoot myself
in the foot by introducing hard-to-troubleshoot race conditions and
whatnot, along with unnecessary data source complexity. I absolutely get
what you are saying at scale however, and breaking it into "mini stacks"
that generally have light dependencies on each other would be required,
especially when multiple teams get involved to manage and segregate.

Just wanted to make sure I wasn't missing something obvious or some new
feature. I tried the dumpster fire that is terraform workspaces and
terragrunt is just such a more elegant approach. Thanks!

On Thu, Dec 20, 2018 at 10:27 PM Yevgeniy Brikman notifications@github.com
wrote:

If you're deploying it all with a single apply, then it's subject to all
the problems I listed before. I'd estimate a 99% chance that you will end
up shooting yourself in the foot and regretting it if your infrastructure
grows.

If you split up into separate deployables with separate state files, you
avoid many of these issues. The trade-off is managing dependencies between
them, as you pointed out. IMO, this is the lesser evil. You can use data
sources, including terraform_remote_state
https://www.terraform.io/docs/providers/terraform/d/remote_state.html.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/gruntwork-io/terragrunt/issues/627#issuecomment-449273548,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOjVUraC5H_jZajZtdZSPrtNf0UQn4kKks5u7H9IgaJpZM4ZdEC-
.

JustinGrote on 21 Dec 2018

There isn't really "right" or "wrong," but just which set of trade-offs you prefer. The vast majority of projects I've seen start with one "deployable" end up regretting it and moving to multiple small ones. If you're working alone on a small project, that extra complexity isn't worth it; but if you expect to scale up, doing lots of terraform import and other migrations isn't fun, so you may want to start thinking through this in advance.

brikis98 on 21 Dec 2018

I also found it too difficult to manage depends blocks manually, and too difficult to organize the use of remote state data sources to pass values between modules (modules in modules again). I like the one stack approach, to a point anyway. For a given team, a single stack works great for us. If the stack crosses team boundaries, probably would be too much risk for me.

lorengordon on 21 Dec 2018

Right, and I want to be clear when I say "stack" I dont mean my entire
organizational infrastructure, it's pretty much broken out by application,
with shared dependencies defined in their own stacks. Modules are in a
separate repository and sourced in via version tags. Child modules are
sourced in from git with version tags.

That's easy enough to manage with terragrunt dependencies, I just don't see
the value in defining every single instance, attachment, etc. in Terragrunt
and dealing with all the manual dependencies and remote state when it "just
works" in terraform proper.

On Fri, Dec 21, 2018, 3:39 AM Loren Gordon notifications@github.com wrote:

I also found it too difficult to manage depends blocks manually, and too
difficult to organize the use of remote state data sources to pass values
between modules (modules in modules again). I like the one stack approach,
to a point anyway. For a given team, a single stack works great for us. If
the stack crosses team boundaries, probably would be too much risk for me.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/gruntwork-io/terragrunt/issues/627#issuecomment-449366728,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOjVUqYn934C96KE9eQJlfijXMD5gh1bks5u7MiHgaJpZM4ZdEC-
.

JustinGrote on 21 Dec 2018

Right, and I want to be clear when I say "stack" I dont mean my entire
organizational infrastructure, it's pretty much broken out by application,
with shared dependencies defined in their own stacks. Modules are in a
separate repository and sourced in via version tags. Child modules are
sourced in from git with version tags.

👍

That's easy enough to manage with terragrunt dependencies, I just don't see
the value in defining every single instance, attachment, etc. in Terragrunt
and dealing with all the manual dependencies and remote state when it "just
works" in terraform proper.

I'm definitely not arguing every single resource is deployed separately. The general recommendation is to break it up into a small handful of "deployables," grouping them by (a) degree of risk and (b) items that are usually deployed together. In practical terms, this usually means:

The VPC is separated from everything else, as it's high risk (i.e., deployment error could take down your entire infra), and it's deployed/updated infrequently.
Data stores (e.g., relational databases, caches, NoSQL stores, etc) are separated from everything else, as they are high risk (i.e., deployment error could result in data loss), and it's deployed/updated infrequently.
All your apps/services are then grouped together. If you have one or a few apps, perhaps all of them are deployed together; if you have 100 microservices, then they might be more broken down than that.

brikis98 on 21 Dec 2018

Great, sounds like we are on the same page. In prod data stores are all
flagged as "no destroy" as well, someone has to manually kill those. Good discussion, this can be closed.

JustinGrote on 21 Dec 2018

@JustinGrote, your thoughts sound reasonable to me, do you mind to share a typical structure you follow to break down a project into stack modules?

mskutin on 1 Feb 2019

There's usually a "terraform" root folder, as some stack repositories use other tools (ansible, etc.) but it's not required, we leave that up to the stack developer.

terraform

modules
- MyEC2Instance
  - variables.tf
  - main.tf
  - output.tf
  - dependencies.tf (optional) - all "data" references and remote_state references go here
  - MyEC2Instance.tests.ps1 - Pester Unit Tests to validate setup
- MyS3Pipeline
- etc.
variables.tf (stack)
main.tf (stack)

Side Note Alternative

Now that terraform actually has some features we went to terragrunt for (workspaces, locking, etc.) we've been the same structure but a different deployment method for "terraform-deploy".

Everyone has rights to run a "plan" locally, and a cached state file is available that's updated hourly via a scheduled task so they don't have to do a full terraform refresh. Makes testing and "diffs" easy.
We use the GitLab Flow branching strategy. Users develop in feature branches and can run plan anytime.
"master" and "production" have branch policies that only allow commits via pull request
"master" is analogous to staging. When a feature branch PR is created for master, the CI pipeline tflints the terraform, and then runs a terraform plan for the entire environment (using the periodically refreshed cached state file to keep things fast). It also creates a terraform workspace per branch to keep the state files separate, and then runs "unit tests" against the plan output that have been defined in each stack (we use Pester for this), and reports this back to the PR via webhook so each PR must pass tests, and show a summary terraform plan for what will "change"
We discuss and merge the PR into "master", which then triggers the release pipeline to staging and sets up staging with the new terraform config. We have a commit flag that allows for a full-destroy-refresh if required (in staging, never in production, all deletions in production have to be done manaully by policy for safety). We also have a commit flag that uses the production tfvar sizes if the team is looking to do performance testing, otherwise most things get created at a smaller size to keep costs low.
When staging is ready for release, a PR from staging goes to production, and the process repeats. Pretty much the only difference is the release pipeline always does a full state refresh, and the plan output has an approval step that has to be approved by 75% of the change control board before it applies.
Noone has direct "apply" rights except an emergency support admin account, only the CI/CD pipeline can run terraform applys
Emergency hotfix branches get applied directly to master, they follow the same process except they only require two CCB approvers.

With this infrastructure in place, our users can just use the native terraform commands after running their Invoke-Build setup.

You can do all this at a much smaller scale, we've been thinking about publishing our tools for it.

JustinGrote on 1 Feb 2019

👍4

Was this page helpful?

0 / 5 - 0 ratings