v0.12.0
We have our CI system set up so that pull requests create Terraform plans before they are merged. We also use an automated dependency update system that runs on a fixed schedule. This means that sometimes we have several PRs that are created at once. When this happens, our Terraform plans compete with each other to acquire state locks.
We've set a lock timeout on all of our plans, so that the plans can try again in the future. This works well when there are two consecutive PRs, but fails for more than two, as the lock-timeout argument only leads to a single retry.
It would be awesome if there was an additional parameter that would allow us to specify the number of retries that a Terraform operation makes. Then we could set our PRs to retry more than once if they can't acquire a state lock, in order to reduce the churn from our CI system!
None that I'm aware of.
Hi @edahlseng! Thanks for this feature request.
Thinking through what you are suggesting here made me think about the fact that a speculative plan (terraform plan with no intent of actually applying afterwards) doesn't really get any value of of holding and acquiring a lock anyway: it could just plan against the state snapshot present at the instant the the operation begins.
The purpose of the locking is to ensure that multiple processes can't be applying competing changes to real infrastructure at the same time, which would otherwise then collide at the end by trying to write conflicting state updates. But for commands that don't actually _make_ changes to remote infrastructure, nor create any new persisted state snapshots, the locking doesn't really add any value.
What do you think about instead re-framing this as the idea that multiple speculative terraform plan operations ought to be able to run concurrently, and thus not be competing for locks in the first place?
Hi @apparentlymart! Making that fix would certainly help a ton in our case! It wouldn't solve the problem entirely, however. As part of our CI process, we also deploy temporary environments, and this means that there are some Terraform commands that _are_ changing resources that are also competing for locks. It seems like fixing both of these cases will be needed!
I'll create a new issue to cover the needs of concurrent speculative plans!
This would be a nice feature. I am also working with a CI system that is deploying ephemeral environments and we have to use state locking to prevent problems, but this means our pipeline will fail occasionally when competing for locks.
I'm thinking that I'll have to add a retry mechanism in the shell command that starts the apply, but it would be much cleaner in the future to be able to specify to terraform the number of times to retry getting a lock and an interval to wait between also.
Also this can probably be improved upon but for anyone looking to add retrying a terraform apply (or plan) when using terraform in an automated fashion I am using this bash function which seems to work pretty well:
execute_with_retries() {
command=$1
number_of_retries=$2
sleep_interval=$3
if [[ -z $command ]]
then
echo "Expected the command to execute as the first parameter."
exit 1
fi
if [[ -z $number_of_retries ]]
then
echo "Expected the number of times to retry the command as the second parameter."
exit 1
fi
if [[ -z $sleep_interval ]]
then
echo "Expected the sleep interval between retries as the third parameter."
exit 1
fi
echo "Attempting the following command with $number_of_retries retries and $sleep_interval seconds between each retry: '$command'"
for i in $(seq 1 $number_of_retries); do [ $i -gt 1 ] && sleep $sleep_interval; $command && s=0 && break || s=$?; done; (exit $s)
}
And you can call it like so:
source ./retry-utils.sh
execute_with_retries "terraform apply -var-file=$var_file_path -auto-approve" 20 30
It will then keep trying to do the terraform apply every 30 seconds and will do this 20 times before giving up (effectively waits up to 10 minutes for the lock to become available or it gives up)
This is causing me some major headaches, also with CI. I essentially am going to have to implement this myself. If I could be given some guidance on how to do that within the terraform codebase and make a PR into terraform, I'd like to (searching around for guesses at function names, etc. isn't really something I can justify doing). But since like others above I am basically going to have to create the logic anyways, I may as well put it into the tool itself, I just need someone familiar with the project to say "this is where we ask the user to set the flags you'd want to add to, and this is where we actually retry with our existing flag, you might want to change that"
Which backend are people using here? It appears that different backends behave differently with some waiting on a lock in certain situations and others immediately failing with an error (ie. TFC with local exec enabled).
Running against an AzureRm storage account backend, it appears that lock-timeout will actually take effect and will wait for the lock (lease) to be released in most situations. However, there are instances where we run into the following error:
Error: Error locking state: Error acquiring the state lock: 2 errors occurred:
* state blob is already locked
* blob metadata "terraformlockid" was empty
Most helpful comment
Hi @edahlseng! Thanks for this feature request.
Thinking through what you are suggesting here made me think about the fact that a speculative plan (
terraform planwith no intent of actually applying afterwards) doesn't really get any value of of holding and acquiring a lock anyway: it could just plan against the state snapshot present at the instant the the operation begins.The purpose of the locking is to ensure that multiple processes can't be applying competing changes to real infrastructure at the same time, which would otherwise then collide at the end by trying to write conflicting state updates. But for commands that don't actually _make_ changes to remote infrastructure, nor create any new persisted state snapshots, the locking doesn't really add any value.
What do you think about instead re-framing this as the idea that multiple speculative
terraform planoperations ought to be able to run concurrently, and thus not be competing for locks in the first place?