Terraform: SIGTERM when using -state-out results in an empty state file

Created on 9 Jan 2018 · 12Comments · Source: hashicorp/terraform

Terraform Version

Terraform v0.11.1

Expected Behavior

Terraform should have written the contents of the input -state file to the output -state-out file. Alternatively and less intuitively, Terraform should have deleted the -state-out file.

Actual Behavior

The -state-out file was created but never written to. This is a problem because user's automation may store the empty, output statefile in place of the input statefile--losing track of resources.

Steps to Reproduce

terraform init
terraform apply -state=INPUT.tfstate -state-out=OUTPUT.tfstate, when there is a change to be made
Ctrl+C when Terraform asks for confirmation of "Do you want to perform these actions?"
OUTPUT.tfstate will have been created but will be empty.

Additional Context

We run Terraform within Concourse pipelines, and manage remote state using Concourse resources rather than Terraform's remote state support. This is why we encountered the issue, but is not necessary to reproduce.

bug cli

Source

46bit

All 12 comments

Hi @46bit! Sorry for this strange behavior and thanks for the detailed description.

Unfortunately I wasn't able to reproduce this on my system here. I applied a simple configuration containing just a null_resource, like this:

$ terraform apply -state-out=INPUT.tfstate

This created INPUT.tfstate as expected, with my null_resource instance inside it. I then made a change to the resource and ran a command like you showed in your example:

$ terraform apply -state=INPUT.tfstate -state-out=OUTPUT.tfstate
...

Plan: 1 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: ^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

After I got my prompt back, I only see INPUT.tfstate in my filesystem:

$ ls
INPUT.tfstate  test.tf

Is there anything special about the context in which Terraform is being run here? You mentioned that Terraform is running in automation, so I wonder if perhaps there are any unusual environment variables set that might be affecting terraform's behavior, for example.

Regarding the expected behavior you described: in principle we could copy the input file to the output file in this scenario, but generally we want ctrl+c to perform as few additional actions as possible to allow users to abort; an incorrectly-entered -state-out filename might well be the mistake the user was trying to escape from, after all.

I expect that having Terraform write the file in this case would be convenient for automation since it can just always take the output state file and assume it'll be updated, but I'd prefer to compromise here and guarantee no changes to the file in this case, and expect the wrapping automation to check for the presence of the file to recognize when the operation was aborted. What do you think?

apparentlymart on 10 Jan 2018

Hi @apparentlymart, sorry I didn't provide replication instructions first time. I'm using a single null_resource and an empty input statefile (see https://r.46b.it/terraform-bug-inv.zip.)

Have Terraform wait for confirmation:

$ terraform apply -state in.tfstate -state-out out.tfstate

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  + null_resource.cluster
      id: <computed>


Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

With Terraform still waiting, SIGTERM each of its processes:

$ ps aux | grep '[t]erraform'
fortysix   24794   0.0  0.0 556610096   7452 s001  S+   11:40am   0:00.01 /Users/fortysix/gds/terraform-bug-inv/.terraform/plugins/darwin_amd64/terraform-provider-null_v1.0.0_x4
fortysix   24793   0.0  0.1 556668892  17100 s001  S+   11:40am   0:00.07 /usr/local/Cellar/terraform/0.11.1/bin/terraform apply -state in.tfstate -state-out out.tfstate
fortysix   24792   0.0  0.1 556668252  12400 s001  S+   11:40am   0:00.03 terraform apply -state in.tfstate -state-out out.tfstate

$ kill -SIGTERM 24792 24793 24794

Terraform outputs its graceful shutdown message, but does not exit:

Enter a value: Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Two of Terraform's processes are still running:

$ ps aux | grep '[t]erraform'
fortysix   24803   0.0  0.1 556667484  17332 s001  S+   11:43am   0:00.09 /usr/local/Cellar/terraform/0.11.1/bin/terraform apply -state in.tfstate -state-out out.tfstate
fortysix   24802   0.0  0.1 556666972  13348 s001  S+   11:43am   0:00.57 terraform apply -state in.tfstate -state-out out.tfstate

If these are SIGTERM again, Terraform exits with Two interrupts received. Exiting immediately. Note that data loss may have occurred.

An empty -state-out file out.tfstate` remains on the filesystem.

46bit on 17 Jan 2018

After reaching that reproduction above, I had mixed feelings about whether this is bug worth fixing. On the one hand, all of Terraform's processes being killed is not the common running-by-hand case. On the other hand automation that would SIGTERM (and after a timeout SIGKILL) all of Terraform's processes is not that rare, for instance Concourse.

If you have any immediate ideas what could be wrong, I'm happy to spend a bit of time trying to patch :-)

46bit on 17 Jan 2018

Hi @46bit! Thanks for the additional details.

I will try to reproduce your results here with this additional information when I get a moment. (Unfortunately I'm about to stop work for the day, so I can't dig in right now.)

It is definitely a goal to support running Terraform in automation systems like Concourse, and indeed I've encountered similar problems in a prior role (before I was at HashiCorp!) with our automation system unceremoniously killing Terraform while it was trying to shut down gracefully.

It seems common that such systems are designed to, as you said, send a SIGTERM or SIGINT to give an opportunity for graceful shutdown, but then to abort more aggressively after a short timeout. I'm not totally sure what's best for Terraform to do in this situation, since both currently-feasible options have disadvantages:

Current behavior: SIGTERM requests that all outstanding operations stop gracefully, and then Terraform exits as normal. Unfortunately stopping outstanding operations is usually not fast enough for the standard timeouts imposed by automation systems, since a lot of operations Terraform runs are time-consuming and provide no means to pro-actively cancel.
Quick exit: SIGTERM aborts suddenly and writes the current state of the world to the state file. This would likely complete within the automation system timeout, but any Create requests in-flight would probably still complete after Terraform exits and leave resources dangling. This isn't really an option then, because it violates one of Terraform's core promises.

I believe the current behavior has a subtlety that makes it not as bad as it could be: when we get the first SIGTERM/SIGINT I believe we immediately flush the current in-progress state to the backend (which might be local disk, as in your case) so that we at least capture the results of operations that have already completed, but any _in-flight_ operations will be lost if Terraform subsequently gets SIGKILL and thus has no opportunity to re-persist the state.

All of that aside, I definitely _would_ like to fix this behavior of Terraform writing an invalid/empty state file in this case. Terraform should either write a valid and updated state file to the given path, or leave the output file entirely untouched.

It still feels best to me to leave it untouched if the user aborts before confirming, since I think that best matches user intent. Once the apply process has begun though, we should always write the latest state to the given path on exit, even if we're interrupted with SIGTERM or SIGINT.

Based on the fact that you saw Please wait for Terraform to exit or data loss may occur. in your repro, it sounds like Terraform here was trying to apply the subtlety I described above of flushing the current state on first interruption, which is not expected (because operations aren't in progress yet) and so is perhaps finding Terraform at an inconvenient moment where it's unable to complete that write properly. I'll investigate further when I have some time to dig in deeper with your repro steps here.

Thanks again for these additional details!

apparentlymart on 18 Jan 2018

🎉1

In the referenced bug I have a report of this happening without the use of -state-out, as well.

ndmckinley on 23 Jan 2018

Thanks for pulling that report upstream, @ndmckinley. I'm looking at this now to try to understand what's going on here.

apparentlymart on 5 Feb 2018

I've successfully reproduced the issue by following @46bit's process, and also confirmed it happens to the default local state file name terraform.tfstate if you don't set -state-out.

Interestingly the behavior seems to differ for SIGTERM vs. SIGINT: the former behaves as reported, whereas the latter seems to exit immediately. I also noticed that sometimes SIGINT leaves the blank file and other times it doesn't, and I suspect it's got something to do with what order the signals are received by the processes and whether it's sent to the processes individually or to the process group.

Another thing I noted is that the blank file exists already when Terraform is prompting for confirmation, but that blank file seems to get deleted if Terraform is able to exit "cleanly". It seems that the issue here is that certain types of interruption prevent that cleanup step from occurring, leaving the empty file behind.

As discussed previously, I think the ideal behavior is for Terraform to not write the file at all until it is ready to put some content in it (after any apply actions have been taken), so I'm going to follow that avenue to start.

apparentlymart on 5 Feb 2018

Hi Martin, that’s great!

When we first discovered this, it made sense that Terraform had created the file. That step goes some way to guaranteeing that Terraform can write to that output file, which is fairly important.

If the file is only attempted to be created after actions have been performed, there seems a risk that it will have nowhere to record state to. What do you think?

46bit on 5 Feb 2018

That's a good argument @46bit. I also think this might have something to do with our locking behavior, where we use flock on the local file. Perhaps instead we should write out a _valid_ empty state in this case, just to make sure we always leave the file parsable.

apparentlymart on 5 Feb 2018

Hi @46bit and @ndmckinley!

I spent a few hours looking into the codepaths around this today, and I can now explain what's going on here:

Terraform creates the empty file early on as part of acquiring the state lock, since the local backend implements this by a call to flock on the file and so the file needs to exist at this early stage.
Terraform keeps track internally of whether the file was created by this run and if that flag is true and the file is still empty when the state is subsequently unlocked then the file is deleted.
If Terraform is forcefully interrupted, it skips unlocking the state, preventing this cleanup step from occurring.

The above specifically relates to the situation where the state file does not already exist on disk, and so it's _particularly_ visible for -state-out (where you're generally _intentionally_ using a separate file, rather than writing over the old one in-place) but can also be seen for just -state or for the default filename if the state file does not already exist.

Terraform can also _intentionally_ write out an empty state file if it's asked to write out a state that is nil. While I don't believe this ever happens in any real situation, as a result of that the ReadState function actually specifically handles this situation and returns a nil state, and so the empty file is at least not _harmful_, even though it is indeed confusing.

With all of that said, my conclusion is that I will not attempt to change this behavior right now, since it is at least internally-consistent. I realize it remains annoying if you're using Terraform in an automated workflow where you want to store the output state as a replacement for the input state. In practice, that can be worked around with a slightly different workflow that is more consistent with "normal" Terraform usage, which mutates the file in-place:

$ cp INPUT.tfstate OUTPUT.tfstate
$ terraform apply -state=OUTPUT.tfstate

In this case, since the file already exists it should be left untouched in the case where Terraform is interrupted, and overwritten in-place (leaving INPUT.tfstate untouched) when writing the result of any apply changes.

We're planning to shift our focus to fixing up some rough edges in Terraform's primary CLI workflow later this year, once we've finished with the configuration language improvements that are the current focus. At that point we may revisit this flow to try to make it easier to work with. For now though, I'm planning to leave this behavior unchanged. @46bit you'd previously made a similar assertion, so I assume you're okay with that but please do let me know if you've changed your mind and have some concerns. :grinning:

terraform-providers/terraform-provider-google#982 is interesting because it seems to suggest that this symptom can surface also when the file already exists. However, based on the tracing I've done with the existing reproduction steps I think that problem has a different cause.

When Terraform opens an existing state file, it's open in read/write mode without truncating it, and so the codepath I was analyzing above should not be the cause of the file being emptied. However, Terraform _does_ truncate the file immediately before writing it, and so in principle it may be possible to interrupt Terraform during the short period between the truncate and the write. If that were to happen then the result would be an empty file on disk.

Terraform doesn't currently do the usual "create new file and rename over old file" safe-write approach because that would defeat the flock-based locking mechanism. In future I think we should rework this to use flock on a separate file than the state itself so that we can update the file atomically, but since that's a significant and risky change I'd like to gather more information before jumping to that solution.

Since the details reported in terraform-providers/terraform-provider-google#982 seem to be different than this issue as reported, I'm going to close this one and see if the original reporter is willing to open a fresh issue in this repository where we can capture trace log output, etc that may shed some light on what's going on there.

apparentlymart on 6 Feb 2018

👍1

Very understandable, @apparentlymart, and thanks so much for looking into this. We're using something very similar to the workaround you suggest. :-)

46bit on 6 Feb 2018

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.