Terraform v0.11.7
If you're running Terraform and you briefly lose Internet connectivity, Terraform will:
errored.tfstate
.There's obviously nothing you can do to prevent the connectivity issues, but when they happen, you have to go fix things manually by:
errored.tfstate
file.terraform state push errored.tfstate
.terraform apply
to get the error about the lock being unreleased and to get the lock ID.terraform force-unlock <LOCK_ID>
However, this solution has a number of problems:
I propose adding a simple retry mechanism with exponential back-off. That is, if Terraform fails to write state to a remote backend, it retries after 1 second, 2 seconds, 4 seconds, etc., up to some reasonable (and configurable) max, such as 5 minutes. This way, at least for transient connectivity issues, Terraform can resolve the issue itself.
This issue is exacerbated by:
Various timeout, connectivity, and TLS handshake issues that crop up from time to time in Terraform. For example, see https://github.com/hashicorp/terraform/issues/16448, https://github.com/hashicorp/terraform/issues/15817, https://github.com/hashicorp/terraform/issues/10779
Running apply
in multiple modules concurrently using a tool such as Terragrunt.
I think it would additionally be valuable to add retries for other API calls, including reading states. We use S3 remote states and have quite a bit of pulling values from remote states in our automation for terraform deploys. I see a failed job at least a few times a week related to failing to read a state from S3 that would have worked with a retry.
Most helpful comment
I think it would additionally be valuable to add retries for other API calls, including reading states. We use S3 remote states and have quite a bit of pulling values from remote states in our automation for terraform deploys. I see a failed job at least a few times a week related to failing to read a state from S3 that would have worked with a retry.