I've been having problems with how long a converge takes where the environment is 1,500+ aws_instance
resources, as many EBS volumes, and quite a few other resources (currently averages around 10 minutes to do a terraform refresh
with Terraform 0.10.8, with provider 1.2.0). The speed is not so much a problem as that we're bumping against AWS rate limits.
When looking into aws_instance resource's code it looks like it's doing a single DescribeInstances API call for each instance: https://github.com/terraform-providers/terraform-provider-aws/blob/4d7ccced1862ef5323427a91048fee093291e8f6/aws/resource_aws_instance.go#L1069 Specifying multiple instances per call and doing any additional processing locally might also alleviate some of the timeout issues I've seen some folks have.
Before I (or anyone else) starts working on a patch, are there gotchas that I'm missing that has kept the number of requests O(N) to this point?
Hi @dafyddcrosby!
You're right that today Terraform does a separate read call for each resource. This is because when Terraform is walking the graph it currently attempts to deal with each resource as soon as it is ready (all dependencies are complete), whereas optimal batching would require collecting together a number of "similar" operations (that are mutually-batchable) and possibly delaying some of them so that they can all be processed together. This is effectively an optimization problem, to trade off the artificial delay against the overhead of many separate requests while ensuring that the graph traversal can still complete in cases where transitive dependencies prevent batching[1].
This sort of optimization is something we intend to look into eventually, but since it's a pretty big shift in Terraform's approach we've been deferring it to work on some other issues that have more general impact to users at all scales, such as the limitations of the configuration language. You can see some discussion about this in hashicorp/terraform#7388.
We _are_ planning to take one step in this direction in some forthcoming work that will change the resource provider API for other reasons, to prepare the API so that batching can be supported in future without further breaking changes to the API. We'll need to do some more thinking, however, about how to give Terraform Core enough information to make smart decisions about how to optimize the graph traversal for batching, which we've not yet been able to delve into in detail.
[1] For example, if there is a dependency chain like aws_instance.foo
→ aws_ebs_snapshot.bar
→ aws_instance.baz
, it's not possible to fetch both aws_instance.foo
and aws_instance.baz
in the same batch without either deadlocking or violating the guarantee that a resource will always be dealt with before its dependents.
Thanks for the great response @apparentlymart - I'll hold off on poking at this for a bit, and look at some other way of dealing with the current scaling problems.
I've got a related issue: #3230.
It's not only an optimization problem. Not combining the calls actually causes outages. Usually route53 changes are atomic (delete this A record, add that new CNAME record). But with terraform, first the A record is deleted. Then DNS clients may get an NXDOMAIN error and cache the result for some time, then the CNAME record is added.
@apparentlymart we've been looking into this problem pretty extensively at Segment because most of our terraform plans are taking too long to run and preventing teams from releasing changes quickly enough.
I wonder if this could be solved in the provider without any changes to terraform itself. I see two aspects to this problem:
Plans are making inefficient use of the AWS APIs because they operate on single resources where the APIs are designed to support interacting with groups of resources. The plans are also read-only which makes the problem easier to approach.
Atomic changes issues reported by @joernheissler, in this case it's about combining updates into a single API call, which has stronger requirements when it comes to correctness, and a caching approach cannot be employed since those are write operations. I won't discuss this point in too much details because I'm not super familiar with the problem tho.
For the plan use case, the provider could use a local cache that gets populated by API calls. For example, when the provider issues a DescribeClusters to the ECS API, it would instead issue a ListClusters followed by a batched DescribeClusters to describe all clusters, then populate its local cache with the results. Next call to DescribeClusters would lookup the result directly from the cache instead of going back to the AWS API.
I can see how that may be inefficient for some small terraform states that only manage a few resources, but for states that have hundreds of resources (like we do) it could cut plan times by orders of magnitudes because less API calls would be issued to AWS and we wouldn't be hitting rate limits which cause terraform to enter the backoff retry loops (some of our plans take up to 30 minutes to run due to throttling errors from the AWS APIs).
We could implement this by switching the AWSClient
from using concrete types to using interfaces (for example switch from *ec2.EC2
to ec2iface.EC2API
), and have the caching layer implement those interfaces to provide the caching behavior. It would also make it easy to opt-in based on some runtime configuration flag, as well as progressively add caching to new endpoints without having to go all-in at once.
Plan times are enough of a pain that we would likely be able to contribute a pull request to implement this. I'd love to get your feedback on the idea and whether you have any concerns about accepting such a change in the AWS provider.
Hi @achille-roussel! Thanks for describing an alternative approach here.
The strategy of fetching everything to prime a cache and then serving reads from that cache is something that we've discussed and prototyped before. It's a plausible direction, but it has some limitations of its own to navigate:
As you mention, there is the potential to make this sort of behavior opt-in. This would require maintaining two different codepaths where only one is required today, and only one would be required in the future world where Terraform Core supports batching as a first-class feature. Since I don't work on the AWS provider day-to-day (my focus is on Terraform Core) I'll leave the AWS provider team to weigh in on how impactful that might be to the ongoing maintenance of the provider.
I'm curious, what is preventing a provider today from queueing resource actions internally for 1-3s before actually issuing an API request, and if more than one action is requested it can then merge these into a single batch API call? If more actions are required at a later time, these would trigger new single or batch API calls down the road, but this would probably at least cover 90%+ of the pain?
It seems as if something like that could be fully handled internal to any provider wanting to do so without affecting the overall API or interaction with core?
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed. Maintainers can also remove the stale label.
If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!
Putting a message here to shake off the bot because this is still an important issue, and with TF 0.13 and the v3 of the AWS provider out the door, I think now is a good time to re-approach it.
Most helpful comment
@apparentlymart we've been looking into this problem pretty extensively at Segment because most of our terraform plans are taking too long to run and preventing teams from releasing changes quickly enough.
I wonder if this could be solved in the provider without any changes to terraform itself. I see two aspects to this problem:
Plans are making inefficient use of the AWS APIs because they operate on single resources where the APIs are designed to support interacting with groups of resources. The plans are also read-only which makes the problem easier to approach.
Atomic changes issues reported by @joernheissler, in this case it's about combining updates into a single API call, which has stronger requirements when it comes to correctness, and a caching approach cannot be employed since those are write operations. I won't discuss this point in too much details because I'm not super familiar with the problem tho.
For the plan use case, the provider could use a local cache that gets populated by API calls. For example, when the provider issues a DescribeClusters to the ECS API, it would instead issue a ListClusters followed by a batched DescribeClusters to describe all clusters, then populate its local cache with the results. Next call to DescribeClusters would lookup the result directly from the cache instead of going back to the AWS API.
I can see how that may be inefficient for some small terraform states that only manage a few resources, but for states that have hundreds of resources (like we do) it could cut plan times by orders of magnitudes because less API calls would be issued to AWS and we wouldn't be hitting rate limits which cause terraform to enter the backoff retry loops (some of our plans take up to 30 minutes to run due to throttling errors from the AWS APIs).
We could implement this by switching the
AWSClient
from using concrete types to using interfaces (for example switch from*ec2.EC2
toec2iface.EC2API
), and have the caching layer implement those interfaces to provide the caching behavior. It would also make it easy to opt-in based on some runtime configuration flag, as well as progressively add caching to new endpoints without having to go all-in at once.Plan times are enough of a pain that we would likely be able to contribute a pull request to implement this. I'd love to get your feedback on the idea and whether you have any concerns about accepting such a change in the AWS provider.