Hello,
I'm provisioning resources on a single tfstate. It's been growing steadily and I'm now at ~2900 resources. I've seen an increase in the time it takes to run the plan as I added more resources - with most time it just seem to hang before refreshing the state (I suppose it is walking the graph to understand resource dependencies) and some time after the refresh ends (before printing out the plan).
I reached a place where it takes 20-25 minutes just to run the plan and I'm starting to wonder if this is a normal behavior or a bug.
I've also tried running the plan with TF_LOG set to trace and I'm noticing many times it just sits there and not print anything. Most noticeably between these lines of log:
2018/10/03 08:38:54 [TRACE] Graph after step *terraform.graphTransformerMulti:
2018/10/03 08:42:41 [TRACE] Graph after step *terraform.ModuleVariableTransformer:
2018/10/03 08:42:53 [TRACE] Graph after step *terraform.graphTransformerMulti:
2018/10/03 08:46:22 [TRACE] Graph after step *terraform.ModuleVariableTransformer:
2018/10/03 08:46:34 [TRACE] Graph after step *terraform.OutputTransformer:
2018/10/03 08:50:03 [TRACE] Graph after step *terraform.ModuleVariableTransformer:
2018/10/03 08:50:51 [TRACE] Graph after step *terraform.graphTransformerMulti:
2018/10/03 08:54:18 [TRACE] Graph after step *terraform.ModuleVariableTransformer:
Has anyone else notice this? If anyone from the HashiCorp team wants to have a look - I can provide my TF files for review...
Terraform version v0.11.7
I’ve experienced this in previous versions of terraform. Since then I’ve reworked my setup to consist of multiple separate plans by grouping resources by lifecycle. This made my terraform runs quicker, and also made updating things a lot safer since every plan has a limited scope.
Whether or not this is a bug: I don’t think so. Refreshing the state in your case means a ton of API calls. That takes time. If you’re unlucky: a lot of time.
Thanks @bennycornelissen, I obviously get it that making a ton of API calls take time - but the majority of the time I have to wait isn't even when the API calls are made, but rather before that (when TF is waling the graph)...
I'll reiterate saying that in trace logging, I get several occasions where it just hangs - whereas in trace logging, I'd expect it to always print out what it's doing. I can't say that TF isn't doing anything during these times as CPU is at 100% during that time...
I'll edit my initial report with some trace lines...
@yzargari I apologize if my comment seemed disrespectful. I meant no disrespect.
Reading your initial post again, I understand the waiting occurs before actually refreshing anything? So like this:
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
==> Long wait occurs here <==
aws_s3_bucket.blog: Refreshing state... (ID: some-blog-bucket)
...
Am I correct? I would think you're right Terraform is busy walking the graph.. With 2900 resources (and possibly a ton of dependencies between those?) I can imagine this will take some time.. whether or not 25mins is usual behavior, I don't know. In my case it was stuck at that specific point for anywhere between 5 and 10mins (much less resources though), but it needed to consume a fair bit of remote state.
@bennycornelissen your comment was more than OK and did not seem disrespectful. I appreciate any feedback I can get...
Hi @yzargari,
Sorry you're having trouble with this. Terraform isn't currently optimized for large sets of resources, primarily because managing large sets of resources in a single config has other operational challenges, so we tend to recommend smaller, more modular configs.
Large graphs, especially ones that are highly connected (a few thousand nodes can easily create many millions of edges), take considerable resources to traverse and manipulate. There will be some more performance improvements in this area eventually, but it's generally advised to avoid such large configs for now.
Hello @jbardin, thanks for the feedback.
The idea of having such large config is to manage entire environments and verifying they're set properly in a single TF execution... We have a single terraform set of configuration which we apply on different environment, so we get the same config across our dev, staging and production environment (with certain changes to scale according to the needs). If I split my 50+ microservices to 50+ terraform sets of configs - I'd have to run terraform 50 times to check if changes are needed.
I just wanted to make sure this use case is something the Hashicorp team is aware of...
I'd have to run terraform 50 times to check if changes are needed.
@yzargari But then you could run them in parallel!
I want to let the developers know that I to am having this problem, and I have less than 1000 resources. My configuration is just creating hundreds of buckets using the Google provider and it is not desirable to split the management of said resources into multiple state files. I too am noticing high processor usage while the application is in a state where nothing is printed out by the trace logging.
sounds like the same issue I was having. I ended up creating my own solution for the problem - which essentially involves writing a wrapper script for terraform that splits the large state file to a much smaller state with just the resources I want to apply, then creating a temp dir with just the necessary files (containing vars, locals, providers, etc and the resource I want to apply) and applying on that temp dir. When done, the script merges the smaller state back into the bigger state. This doesn't solve the problem when you need to apply on the entire stack, but for us we usually need to run on a single module.
Bottom line is reduction of most runs from ~30 minutes to ~1 minute...
An interesting approach, especially if most of the work is additive in nature. I will look into it, thanks.
I too am suffering from this, currently I'm at 2100 resources and it takes 15 minutes to perform an apply. It's not impossible yet also not desired to split up the configuration. I don't really care about the refresh phase as I can run with -refresh=false but the calculation of the DAG still takes too long.
I tried to debug this a bit myself for a better understanding but I'm at the limit of what my knowledge of Go can do. All I can see from the various traces and profiles I ran is that a lot of the cpu time is actually spent in runtime.futex. In fact, the terraform functions don't show up at all when I create a CPUProfile, it's all runtime, io and syscall (although granted, this might be because of my inexperience with the tools).
If I can provide assistance to debug this I'll gladly do so.
This issue seems to be exacerbated to an extreme degree by the use of for_each.
My test case is a converted a real-world scenario of wanting to manage identical environments for a set of developers to use no-op null_resource resources instead of actual cloud resources.
In my test case I've observed a >7x difference in wall clock time for the creation of the _same number of resources_ between one approach that heavily uses for_each and another that uses a bunch of instantiations of the same module. There is also a significant increase in the time for a follow-up no-op plan between the two approaches.
(An anecdote of how impactful and aggravating this issue is in the real world: The real, non-null_resource version of my test case is impossible to work with in Terraform Cloud due to even no-op _plan_ operations hitting the 30 minute time limit and being killed.)
In the scenario using for_each heavily.
Initial terraform apply -auto-approve:
1186.34s user 45.84s system 237% cpu 8:38.79 total
Follow-up (no-op) terraform apply -auto-approve:
135.61s user 8.69s system 908% cpu 15.886 total
In the scenario using modules with less connectivity of the exact same resources:
Initial terraform apply -auto-approve:
124.12s user 23.29s system 203% cpu 1:12.49 total
Follow-up (no-op) terraform apply -auto-approve:
16.41s user 6.90s system 521% cpu 4.471 total
@yzargari But then you could run them in parallel!
terraform already has -parallelism (which this bug aside, assuming most time is spent in remote checks/updates is sufficient).
This issue seems to be exacerbated to an extreme degree by the use of for_each.
@eriksw Thanks for detailed notes and the test case!
I've encountered similar issue while trying to create a large number of users (~1k) in AWS - after some experiments I ended up with another test case demonstrating performance issue with certain usage of for_each - https://github.com/mlosev/tf-for_each-performance-test
The issue manifests for both Terraform 0.12.29 and 0.13.0 - it takes Terraform an order of magnitude longer (i.e. 10x+) to plan "slow" configuration versus "fast" configuration for the same number of users and groups (~1000) and empty local state (i.e. Terraform wants to create all resources)
The key difference between configuration is
--- users-0.13-fast/main.tf 2020-08-26 11:06:16.000000000 +0100
+++ users-0.13-slow/main.tf 2020-08-26 11:06:26.000000000 +0100
@@ -15,7 +15,7 @@
for_each = local.users
triggers = {
- name = each.key
+ name = null_resource.user[each.key].triggers.name
groups = join(",", each.value)
}
}
On my laptop, the difference in plan time is as follows:
Plan: 2100 to add, 0 to change, 0 to destroy.
------------------------------------------------------------------------
300.20 real 467.68 user 19.38 sys
Plan: 2100 to add, 0 to change, 0 to destroy.
------------------------------------------------------------------------
2.71 real 7.77 user 1.66 sys
This is quite unexpected behaviour for Terraform, especially considering that there are no API calls involved with null_resource in this case...
PS. I also tried -parallelism=100 (just in case) for "slow" configuration, and it didn't change anything time-wise.
Thanks for the reproduction case @mlosev!
I've moved that to a new issue, as it's not related to the original performance problems posted above.
This specific issue has been taken care of by various optimizations in 0.13 and later, so I'm going to close it. Other performance concerns should filed as new issues.
Thanks!
I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
This issue seems to be exacerbated to an extreme degree by the use of for_each.
My test case is a converted a real-world scenario of wanting to manage identical environments for a set of developers to use no-op null_resource resources instead of actual cloud resources.
In my test case I've observed a >7x difference in wall clock time for the creation of the _same number of resources_ between one approach that heavily uses for_each and another that uses a bunch of instantiations of the same module. There is also a significant increase in the time for a follow-up no-op plan between the two approaches.
(An anecdote of how impactful and aggravating this issue is in the real world: The real, non-
null_resourceversion of my test case is impossible to work with in Terraform Cloud due to even no-op _plan_ operations hitting the 30 minute time limit and being killed.)Highly-connected scenario
In the scenario using for_each heavily.
Initial
terraform apply -auto-approve:Follow-up (no-op)
terraform apply -auto-approve:Less-connected scenario
In the scenario using modules with less connectivity of the exact same resources:
Initial
terraform apply -auto-approve:Follow-up (no-op)
terraform apply -auto-approve: