Terraform: Can no longer reboot and continue.

Created on 12 Apr 2018  路  37Comments  路  Source: hashicorp/terraform

Hi there,
Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

I posted on the Google Group and did not get any response. The gitter chat is full of questions and no answers.

In the absence of other avenue to get a question answered I'm positing it here.

How do I reboot-and-continue with terraform? In version 0.11.3 it was possible to issue the reboot command in the shell provisioner, and when the machine comes out of the reboot, the next provisioner in the file would re-connect and continue.

Since 0.11.4 this is no longer working. When machine goes to reboot the terraform will error out and provisioning would stop.

How is this supposed to work when set up correctly?

enhancement provisioneremote-exec

Most helpful comment

I also encountered this problem when I wanted to trigger a reboot in a null_resource. For me it helped to just add & so now it looks like this for me:

provisioner "remote-exec" {
  inline = [
    "sudo reboot &",
  ]
}

Not completely verified that it works all the time but so far it has.

All 37 comments

Hi @AndrewSav!

In Terraform 0.11.4 there was a change to try to make Terraform detect and report certain error conditions, rather than retrying indefinitely. Unfortunately this change was found to be a little too sensitive, so e.g. if sshd starts up before the authorized_keys file has been populated by cloud-init then Terraform would fail with an authentication error, rather than retrying. I think this may be the root cause of your problem here.

In 0.11.6 (#17744) this behavior was refined to treat authentication errors as retryable to support situations where sshd is running before credentials are fully populated. Could you try this with version 0.11.6 or later and see if that fixes the problem for you?

I just encountered similar problem. Reboot needs to be triggered during initial setup of EC2 instance. To do that I'm using remote-exec inside null_resource:

resource "null_resource" "yum-update" {
  triggers {
    instance_id = "${aws_instance.webapp.id}"
  }

  connection = {
    type         = "ssh"
    user         = "${var.ssh_user}"
    host         = "${aws_instance.webapp.private_ip}"
    private_key  = "${file(var.ssh_key_path)}"
    bastion_host = "${var.ssh_use_bastion == true ? var.ssh_bastion_host : ""}"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo yum update -y",
      "sudo reboot",
    ]
  }

  depends_on = [
    "aws_volume_attachment.webapp-ebs-att",
  ]
}

Terraform fails with the following message:

Error: Error applying plan:

1 error(s) occurred:

* module.xyz.null_resource.yum-update: error executing "/tmp/terraform_1226926016.sh": wait: remote command exited without exit status or exit signal

It's definitely not related to authorized_keys race condition, as yum update -y got executed without issues. Exactly the same code was working just fine with previous Terraform versions.

Terraform version:

$ terraform -v
Terraform v0.11.7
+ provider.aws v1.14.1
+ provider.null v1.0.0
+ provider.template v1.0.0

It looks like in cfa299d2ee5e4b0cf868f9ac7e49d852c3d986d0 we upgraded our vendored version of the Go SSH library to a newer version that added that error message, but that went out in v0.8.5 (over a year ago) and so cannot be the culprit for a recently-introduced issue.

The error seems to indicate that the SSH server closed the connection without reporting the result of the command, as described in RFC 4254 section 6.10, which I suppose could make sense if the sshd process were killed before reboot returned. I assume that prior to Terraform v0.11.4 this error was still occurring but being silently ignored.

The tricky thing here is that arguably the new behavior is more correct since the SSH execution _is_ failing (it's not completing fully) and so therefore Terraform should not proceed and assume the instance is fully provisioned in this case... there are other reasons why the connection might be shut down that would _not_ be safe to continue.

Perhaps we can make a compromise here and add an option to the provisioner to treat this particular situation as a success, for situations where either the SSH server is being restarted or the system itself is being shut down. I'm not sure what is the best way to describe that situation to make an intuitive option, though: allow_missing_exit_status is the most directly descriptive, but doesn't really get at the _intent_ so if we went with that option I suppose configuration authors would need to annotate it with a comment explaining why:

  provisioner "remote-exec" {
    inline = [
      "sudo yum update -y",
      "sudo reboot",
    ]

    # sshd may exit before "sudo reboot" completes, preventing it from
    # returning the script's exit status.
    allow_missing_exit_status = true
  }

Adding a allow_missing_exit_status = true feature would work for me. I'm perfectly prepared to admit that rebooting during a provisioning is weird and call it out with a flag and a comment. As it is now, I'm falling back to tf 0.11.3 to keep working cause some of my fleet depend on the reboot before the next provisioner can continue. Thanks for looking at it.

The Terraform team at HashiCorp won't be able to work on this in the near future due to our focus being elsewhere, but we'd be happy to review a pull request if someone has the time and motivation to implement it.

Otherwise, we should be able to take a look at it once we've completed some other work in progress on the configuration language, which is likely to be at least a few months away.

I'm sorry for this unintended change in behavior. As an alternative to staying on 0.11.3, it might be possible to arrange for a necessary reboot operation to happen asynchronously so that the provisioner is able to complete successfully before it begins. For example, perhaps using the shutdown command with a non-now time would do the trick. If there are other subsequent provisioning steps it may be necessary to take some additional steps to ensure that the next provisioner won't connect before the reboot begins, such as revoking the authorized SSH key with some mechanism to re-install it after the reboot has completed.

@apparentlymart apologies, I'm on holiday until 26th of April and don't have access to the required infrastructure to test this until then. I'll make sure to test and report back when I've returned from holiday.

I also encountered this problem when I wanted to trigger a reboot in a null_resource. For me it helped to just add & so now it looks like this for me:

provisioner "remote-exec" {
  inline = [
    "sudo reboot &",
  ]
}

Not completely verified that it works all the time but so far it has.

@haxorof probably depends on flavor of linux. For what ever reason it did not work from terrafrorm with rancherOS for me (did not cause a reboot). Although from command line it of course works. So I still think it's affected by terraform interaction.

I think the & solution for backgrounding might be a little tricky because the sudo process still remains attached to the shell while it's running and so sshd shutting down may also send a signal to sudo, and thus in turn to reboot, and so kill it before it gets a chance to compete.

My thought about using the shutdown command above is that it's implemented in a way where the actual shutdown is managed by a background process, and so the shutdown command completes immediately, allowing the shell to exit before the shutdown begins. In the case of a systemd system, for example, I believe (IIRC) that a timed shutdown is handled by sending a message to logind, which then itself coordinates the shutdown. Since logind is a system daemon, it is not connected to your SSH session.

Just a followup, we implemented the suggestion from @haxorof (reboot &) and it's worked perfectly on ubuntu 16.04 so far. I was going to use shutdown -r +1 plus a local-exec sleep 60 but was bummed that I'd be adding a minute to every instance creation. If I could pass a sub-minute timeout to shutdown I'd have done that, but till then I'll keep with the backgrounded reboot till we run into issues with it.

@AndrewSav : Yes you are right. I tested on an Ubuntu 17.10 and now tried it on a FreeBSD. It seems that the reboot & workaround does not work with the FreeBSD version I tried.

Rather than using a time argument to shutdown, you could delay the reboot in a subshell.

(sleep 2 && reboot)&

@apparentlymart do you think a "remote-reboot" provisioner is appropriate?
@jbardin - wow, thank you so much! That actually worked for me! I'm guessing in the presence of a workable workaround this is a less of an issue now.

Guys would you like me to close the issue?

I think since this seems to be a common enough issue for users, we should consider making it part of the provisioner itself. I don't think we need another provisioner altogether, since this is just a special case of remote-exec. Having a special field like shutdown_command would be fairly easy to add, and that command could just ignore a connection failure after execution.

Hitting this issue aswell. For a temporary workaround this seems to work for me (as mentioned earlier by others):

(sleep 5 && reboot)&

The above background reboots don't appear to be working for me on Ubuntu 18-04.

Any news on this as a provisioner feature, similar to Packer's windows restart? https://www.packer.io/docs/provisioners/windows-restart.html

EDIT:

Using the following workaround (a local-exec provisioner)

  provisioner "local-exec" {
    command = "ssh -o 'StrictHostKeyChecking no' -i ${var.pem_file_path} root@${digitalocean_droplet.web.ipv4_address} '(sleep 2; reboot)&'"
  }

A similar issue exists on Windows with WinRM. A workaround that works for us is a remote-exec provisioner like this:

  provisioner "remote-exec" {
    inline = [
      "shutdown /r /t 5",
      "net stop WinRM",
    ]
    ...
  }

The first command schedules the reboot a few seconds later. It avoids the shutdown to sometime kill thenet stop WinRM. The second command makes sure that the next provisioner doesn't connect, while the machine is shutting down, and then fail. This can happen sometime even without a shutdown delay: shutdown /r /t 0. A separate remote-exec provisioner ensures that the output of the previous remote-exec provisioner is flushed.

This did not work for me with remote-exec:
"(sleep 2 && sudo reboot)&",
It didn't cause an error but it also didn't actually do a reboot.

So instead I tried this and it is working fine and would of course work for any OS.

  provisioner "local-exec" {
    command = "aws ec2 reboot-instances --instance-ids ${self.id}"
  }

@chakatz Nice workaround, Though Terraform should be working for any reboot in between the terraform run.
I am using v0.11.10 now, still the same issue.

Terraform Please provide a solution to it at the earliest.

Alternative workaround: shutdown -r +0

@frafra did you try it yourself? Because that's exactly what's not working.

@AndrewSav yes, sure, but this is a different syntax, and it works just fine for me, while with reboot, systemctl reboot and (sleep 3 && reboot) & do not. shutdown -r +0 still exits before restarting, so Terraform does not halts.

Here is my script: https://github.com/frafra/fedora-atomic-hetzner/blob/master/fedora-atomic-hetzner.tf

Hi all,

after some frustration, it seems I'm able to run with Terraform 0.11.11, but it definitely feels hacky though; having 1 null_resource with 3 provisioners (FYI: Windows instance provisioning):

provisioner "chef"  {
  # handles pre-reboot config mngmt; completes cleanly; schedules a delayed reboot
}

# see https://github.com/hashicorp/terraform/issues/17844#issuecomment-422960337 (above)
# `[remote-exec]: error during provision, continue requested` (see "on_failure" below)
provisioner "remote-exec" {
  inline = [
    "shutdown /r /f /t 5 /c "forced reboot",
    "net stop WinRM"
  ]
  # Terraform > v0.11.3 will fail if the provisioner doesn't report the exit status, but here we'll explicitly allow failure
  on_failure = "continue"
}

provisioner "chef"  {
  # handles post-reboot config mngmt
}

More advanced testing still in progress, but initial tests seem fine...

I guess in an ideal scenario, I'd like the Chef run to exit with code 35 or 37, but then the Terraform Chef provisioner to allow that to happen, reconnect and then pick up and complete the provisioning.
Perhaps not dissimilar to kitchen, using retry_on_exit_code (an array of exit codes to indicate that kitchen should retry the converge command) and max_retries (number of times to retry the converge before passing along the failed status).

Happy to get stuck in with a few more pointers on the Terraform internals - thanks in advance for your feedback!

  provisioner "remote-exec" {
    when = "create"

    inline = [
      "sudo shutdown -r +60",
      "echo 0",
    ]
  }

If anyone is fighting with this on Linux (connection actively refused error) I've written a little PowerShell/Bash combo that should cover Terraform running on both Windows and Linux: https://gist.github.com/janoszen/9df88ba0b906af1c18c0812a7128af7a

@frafra hm... there is no mention of shutdown in that script you linked.

@frafra hm... there is no mention of shutdown in that script you linked.

I moved the commands in a shell scripts that gets executed by TF; it is in the same repository :-)

Shameless plug here, but maybe it actually helps someone to get reasonable workaround for this issue. I created TF provider, which is able to execute the comment, but ignore the result for that purpose and I don't have any problems with reboots now. The configuration is limited, but can be easily extended. Also Windows is not supported.

https://github.com/invidian/terraform-provider-sshcommand

I have a done a quick implementation of the allow_missing_exit_status at https://github.com/hashicorp/terraform/pull/22180 as described by @apparentlymart to handle this case, tested on both linux and windows systems.

I'm not totally sold on this or something more general as "ignore_errors" that would allow more use cases and weird stuff.

Today got panic in terraform on VM reboot (Terraform version: 0.12.3)

2019-07-26T07:06:02.722+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [ERROR] scp stderr: "Sink: C0644 32 terraform_1671735816.sh\n"
2019-07-26T07:06:02.722+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.725+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting remote command: chmod 0777 /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.731+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] remote command exited with '0': chmod 0777 /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.732+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.734+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting remote command: /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.759+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] remote command exited with '0': /tmp/terraform_1671735816.sh
2019-07-26T07:06:02.760+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:02.760+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Starting remote scp process:  scp -vt /tmp
2019-07-26T07:06:02.763+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Started SCP session, beginning transfers...
2019-07-26T07:06:02.763+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Copying input data into temporary file so we can read the length
2019-07-26T07:06:02.764+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Beginning file upload...
2019-07-26T07:06:02.768+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] SCP session complete, closing stdin pipe.
2019-07-26T07:06:02.768+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] Waiting for SSH session to complete.
2019-07-26T07:06:02.769+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [ERROR] scp stderr: "Sink: C0644 0 terraform_1671735816.sh\n"
2019/07/26 07:06:02 [TRACE] EvalApplyProvisioners: provisioning module.node.vsphere_virtual_machine.machine with "remote-exec"
2019/07/26 07:06:02 [TRACE] GetResourceInstance: vsphere_virtual_machine.machine is a single instance
2019-07-26T07:06:02.771+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] connecting to TCP connection for SSH
2019-07-26T07:06:02.772+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] handshaking with SSH
2019-07-26T07:06:02.849+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] starting ssh KeepAlives
2019-07-26T07:06:02.849+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:02 [DEBUG] opening new ssh session
2019-07-26T07:06:03.137+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:03 [WARN] ssh session open error: 'ssh: unexpected packet in response to channel open: <nil>', attempting reconnect
2019-07-26T07:06:03.137+1200 [DEBUG] plugin.terraform.exe: remote-exec-provisioner (internal) 2019/07/26 07:06:03 [DEBUG] connecting to TCP connection for SSH
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: panic: runtime error: invalid memory address or nil pointer dereference
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: [signal 0xc0000005 code=0x0 addr=0x0 pc=0x17a8b7c]
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: 
2019-07-26T07:06:04.853+1200 [DEBUG] plugin.terraform.exe: goroutine 258 [running]:
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: github.com/hashicorp/terraform/communicator/ssh.(*Communicator).Connect.func1(0xc000180b40, 0x223fe40, 0xc000519300)
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe:  /opt/teamcity-agent/work/9e329aa031982669/src/github.com/hashicorp/terraform/communicator/ssh/communicator.go:235 +0x12c
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe: created by github.com/hashicorp/terraform/communicator/ssh.(*Communicator).Connect
2019-07-26T07:06:04.854+1200 [DEBUG] plugin.terraform.exe:  /opt/teamcity-agent/work/9e329aa031982669/src/github.com/hashicorp/terraform/communicator/ssh/communicator.go:227 +0x519
2019/07/26 07:06:04 [WARN] Errors while provisioning vsphere_virtual_machine.machine with "remote-exec", so aborting
2019/07/26 07:06:04 [TRACE] EvalApplyProvisioners: module.node.vsphere_virtual_machine.machine provisioning failed, but we will continue anyway at the caller's request
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalMaybeTainted
2019/07/26 07:06:04 [TRACE] EvalMaybeTainted: module.node.vsphere_virtual_machine.machine encountered an error during creation, so it is now marked as tainted
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalWriteState
2019/07/26 07:06:04 [TRACE] EvalWriteState: writing current state object for module.node.vsphere_virtual_machine.machine
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalIf
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalIf
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalWriteDiff
2019/07/26 07:06:04 [TRACE] module.node: eval: *terraform.EvalApplyPost
2019/07/26 07:06:04 [ERROR] module.node: eval: *terraform.EvalApplyPost, err: 1 error occurred:
    * rpc error: code = Unavailable desc = transport is closing

2019/07/26 07:06:04 [ERROR] module.node: eval: *terraform.EvalSequence, err: rpc error: code = Unavailable desc = transport is closing
2019/07/26 07:06:04 [TRACE] [walkApply] Exiting eval tree: module.node.vsphere_virtual_machine.machine
2019/07/26 07:06:04 [TRACE] vertex "module.node.vsphere_virtual_machine.machine": visit complete
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provisioner.file (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provider.vsphere (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "provisioner.remote-exec (close)" errored, so skipping
2019/07/26 07:06:04 [TRACE] dag/walk: upstream of "root" errored, so skipping
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: reading latest snapshot from terraform.tfstate
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: snapshot file has nil snapshot, but that's okay
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: read nil snapshot
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: no original state snapshot to back up
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 1
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: removing lock metadata file .terraform.tfstate.lock.info
2019/07/26 07:06:04 [TRACE] statemgr.Filesystem: unlocked by closing terraform.tfstate
2019-07-26T07:06:04.870+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=20112 error="exit status 2"
2019-07-26T07:06:04.870+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=25320
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=24520
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.887+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.889+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=19932
2019-07-26T07:06:04.889+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.891+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=18572
2019-07-26T07:06:04.891+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.892+1200 [DEBUG] plugin: plugin process exited: path=C:\Users\asavinykh\scoop\apps\terraform\current\terraform.exe pid=16888
2019-07-26T07:06:04.892+1200 [DEBUG] plugin: plugin exited
2019-07-26T07:06:04.893+1200 [DEBUG] plugin: plugin process exited: path=E:\Sources\docker_ops\terraform\instances\t-ap-test-01\.terraform\plugins\windows_amd64\terraform-provider-vsphere_v1.12.0_x4.exe pid=27036
2019-07-26T07:06:04.893+1200 [DEBUG] plugin: plugin exited

Today got panic in terraform on VM reboot (Terraform version: 0.12.3)

@AndrewSav it does not look related, but could you try the the change on https://github.com/hashicorp/terraform/pull/22180 ?

New proposed solution:

  provisioner "remote-exec" {
    inline = [ "reboot" ]
    on_failure = "continue"
    connection { host = self.ipv4_address }
  }

@frafra for what it's worth, I'm still getting connection errors intermittently even with on_failure = "continue" with next provisioned not being able to execute.

I found systemctl reboot to work fine, while reboot throws an error.

The problem is that it's a race. So you change something, timing slightly changes and it works once and you think you fixed it, but it intermittently keeps failing.

allow_missing_exit_status

Is this available for terraform 0.12.24 ? I am running into issue : An argument named "allow_missing_exit_status" is not expected here. I am using the provider null 2.1.2.

@roshanp85 no.

Was this page helpful?
0 / 5 - 0 ratings