Terraform: null_resource with ssh connection connects but hangs on executing file and remote provisioners

Created on 10 Mar 2017 · 20Comments · Source: hashicorp/terraform

Hi there,

Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

Run terraform -v to show the version. If you are not running the latest version of Terraform, please upgrade because your issue may have already been fixed.

Terraform v0.8.8

Affected Resource(s)

Please list the resources as a list:

null_resource
connection
remote-exec
file

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

I've...

tried using a single connection object on null_resource
also embedded connection on remote-exec and file respectively
tried using my ssh-agent, killing my ssh-agent, using a private_key and it makes no impact
tried combinations of specifying just user, private_key, host, and added type, agent, timeout
tried just file w/o remote-exec, and vice versus
let it run for as long as 30 minutes
switching to a local-exec with scp and it works fine

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key.

single connection

resource "null_resource" "sync_docker_files" {
  depends_on = ["module.demo"]
  triggers = {
    instance_id = "${module.demo.instance_id}"
  }
  connection {
    type = "ssh"
    user = "core"
    private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
    host = "${module.demo.instance_dns}"
    agent = false
    timeout = "10s"
  }

  provisioner "remote-exec" {
    inline = [
      "/usr/bin/sudo /usr/bin/chown core:core /mnt"
    ]
  }

  provisioner "file" {
    source = "${path.module}/../docker-compose.yml"
    destination = "/mnt/"
  }

  provisioner "file" {
    source = "${path.module}/../services"
    destination = "/mnt"
  }

}

multiple connection

resource "null_resource" "sync_docker_files" {
  depends_on = ["module.demo"]
  triggers = {
    instance_id = "${module.demo.instance_id}"
  }
  provisioner "remote-exec" {
    inline = [
      "/usr/bin/sudo /usr/bin/chown core:core /mnt"
    ]
    connection {
      type = "ssh"
      user = "core"
      private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
      host = "${module.demo.instance_dns}"
      agent = false
      timeout = "10s"
    }

  }

  provisioner "file" {
    source = "${path.module}/../docker-compose.yml"
    destination = "/mnt/"
    connection {
      type = "ssh"
      user = "core"
      private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
      host = "${module.demo.instance_dns}"
      agent = false
      timeout = "10s"
    }
  }

  provisioner "file" {
    source = "${path.module}/../services"
    destination = "/mnt"
    connection {
      type = "ssh"
      user = "core"
      private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
      host = "${module.demo.instance_dns}"
      agent = false
      timeout = "10s"
    }
  }
}

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

https://gist.github.com/johnt337/e5e6aa157728ef03afc68f2ab2684e9c

Panic Output

If Terraform produced a panic, please provide a link to a GitHub Gist containing the output of the crash.log.

Expected Behavior

What should have happened?

It should connect to the server
It should modify directory permissions for /mnt
It should copy files up to the server in that location.

Actual Behavior

What actually happened?

It connects to the server
It hangs on the file and remote-exec provisioners
Try to ctrl+c and it keeps waiting

$ make build-infra
Get: file:///dockerfiles/jvm-profiling-demo/infrastructure/modules/s3-bucket
Get: file:///dockerfiles/jvm-profiling-demo/infrastructure/modules/demo
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but
will not be persisted to local or remote state storage.

data.template_file.ssh_private_key: Refreshing state...
data.template_file.ssh_public_key: Refreshing state...
aws_security_group.demo: Refreshing state... (ID: sg-xxxxxx)
aws_iam_role.demo-server: Refreshing state... (ID: demo-server)
aws_key_pair.authorized_key: Refreshing state... (ID: demo)
aws_iam_role_policy.demo-server-ec2-tag: Refreshing state... (ID: demo-server:demo-server-ec2-tag)
aws_iam_instance_profile.demo-server: Refreshing state... (ID: demo-server)
module.demo.data.template_file.user-data: Refreshing state...
module.s3.aws_s3_bucket.site_bucket: Refreshing state... (ID: demo-jvm-profiling-us-east-1-xxxxxx)
module.demo.aws_instance.demo: Refreshing state... (ID: i-xxxxxx)
module.demo.aws_route53_record.demo: Refreshing state... (ID: xxxxxx_jvm-profiling-demo.mydemo.com_A)
aws_iam_role_policy.demo-server-s3: Refreshing state... (ID: demo-server:demo-server-s3)

The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.

Your plan was also saved to the path below. Call the "apply" subcommand
with this plan file and Terraform will exactly execute this execution
plan.

Path: infra.tfplan

+ null_resource.sync_docker_files
    triggers.%:           "1"
    triggers.instance_id: "i-xxxxxxxxx"


Plan: 1 to add, 0 to change, 0 to destroy.
null_resource.sync_docker_files: Creating...
  triggers.%:           "" => "1"
  triggers.instance_id: "" => "i-xxxxxxxxx"
null_resource.sync_docker_files: Provisioning with 'remote-exec'...
null_resource.sync_docker_files (remote-exec): Connecting to remote host via SSH...
null_resource.sync_docker_files (remote-exec):   Host: jvm-profiling-demo.mydemo.com
null_resource.sync_docker_files (remote-exec):   User: core
null_resource.sync_docker_files (remote-exec):   Password: false
null_resource.sync_docker_files (remote-exec):   Private key: true
null_resource.sync_docker_files (remote-exec):   SSH Agent: false
null_resource.sync_docker_files (remote-exec): Connected!
null_resource.sync_docker_files: Still creating... (10s elapsed)
null_resource.sync_docker_files: Still creating... (20s elapsed)
...
null_resource.sync_docker_files: Still creating... (34m31s elapsed)
null_resource.sync_docker_files: Still creating... (34m41s elapsed)
...

Looking at the trace output it states its waiting for it to finish.

Steps to Reproduce

Please list the steps required to reproduce the issue:

plan-infra: infrastructure/deployment.tf get-modules
  @cd infrastructure && terraform plan -out infra.tfplan

build-infra: plan-infra infrastructure/infra.tfplan
  @cd infrastructure && terraform apply infra.tfplan

Switch to the terraform folder, run terraform plan -out infra.tfplan
Switch to the terraform folder, run terraform apply infra.tfplan
(optional) ctrl+c once and it does not respond

Important Factoids

Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?

Plain VPC, running coreos, and using DNS for the host.
CoreOS AMIs:

{
  "variable": {
    "amis":{
      "type":"map",
      "default":{
        "us-east-1.coreos.1235.6.0":"ami-3b7f9e2d",
        "us-west-2.coreos.1235.6.0":"ami-12942672"
      }
    }
  }
}

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here?

None that I could find matched this exact problem.

bug provisioneremote-exec v0.10 v0.11 v0.9

Source

johnt337

👍2

All 20 comments

I've continued to let this last run go...

null_resource.sync_docker_files: Still creating... (1h38m41s elapsed)

johnt337 on 10 Mar 2017

It finally died.

null_resource.sync_docker_files: Still creating... (2h11m11s elapsed)
null_resource.sync_docker_files: Still creating... (2h11m21s elapsed)
Error applying plan:

1 error(s) occurred:

* Failed to upload script: dial tcp 54.236.xx.xx:22: getsockopt: operation timed out

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

johnt337 on 11 Mar 2017

I'm somewhat happy to see that I'm not the only one suffering from this issue and that it might eventually get some traction and get fixed.

I initially thought that this was due to connectivity issues with the remote hosts and/or some firewall tearing down idle connections during remote-execs (because I haven't seen it for 'file' provisioners).

So I filed: #https://github.com/hashicorp/terraform/issues/12139 Getting the remote-exec provisioner to detect SSH connectivity issues (with or without bastion_host)

I spent sometime trying to enable an SSH keepalive to make sure long running (and quiet) 'remote-exec' would not be terminated but this didn't help. I can see that TCPKeepAlive is enabled in the Terraform SSH client (communicator/ssh/communicator.go) , but I think this only applies to non-bastion connection. Anyway I have no proof yet that the SSH connection is being teared down. I also made my long-running 'remote-exec' script quiet verbose to make sure that regarless of keep alive parameters the connection would not stay idle and the result is the same. The 'remote-exec' stops producing any output, while it is still running on the target. And the resource is "Still creating .." until the termination Terraform with CTRL-C (twice).

So, I'm still looking at this problem with the same angle, trying to see if the problem shows up on SSH disconnections. I'll be watching this ticket! I'll update it if I find some workaround. Thanks for reporting this issue.

fisabelle on 17 Mar 2017

This seems to be working for me in 0.9.0. I would say give it a try and see if it works for you now.

johnt337 on 17 Mar 2017

Thanks @johnt337 .

I might have forgot to tell, but for me this error is intermittent, that's why I've been thinking it could be related to some sporadic networking issue. So I will see if with 0.9.0 I'm getting a better success rate and will update the ticket.

No luck. I'm still getting the same issue.

The 'remote-exec' stops being updated (while the script is running on the target) and the resource creation hang indefinitely.
module.a.null_resource.b.0 (remote-exec): Running ...
module.a.null_resource.b.0: Still creating... (1m40s elapsed)
module.a.null_resource.b.0 (remote-exec): Running ...
module.a.null_resource.b.0: Still creating... (1m50s elapsed)
module.a.null_resource.b.0: Still creating... (2m50s elapsed)

One thing I think might have an impact... concurrency. I think there was no issue with that before I started running these remote-execs concurrently on multiple instances. Have you tried running it simultaneously on multiple targets ? I will try to reproduce it on a single target see it's less subject to this issue and will update this ticket.

UPDATE I'm getting the same result when attempting the remote-exec on a single instance, so this does not seem related to concurrency of remote-exec.

fisabelle on 17 Mar 2017

I am seeing the same behaviour. I am on 0.9.1 now but still the same. Saw the same also on 0.8.8.
Difference between the two versions is that while you hit Ctrl+C on 0.8.8 to cancel the hang the tasks that are allready finished are saved in state. While in 0.9.1 when you hit Ctrl+C nothing is saved. I need to go back manually and delete the resources created...
Really need a workaround/fix for this...

kristjanelias on 23 Mar 2017

For my case it seems to be hanging on chef provisioner running inside null_resource.

My case:

I am seeing situations like this:

module.app.null_resource.chef_client.3: Still creating... (4m30s elapsed)
module.app.null_resource.chef_client.3: Still creating... (4m40s elapsed)
module.app.null_resource.chef_client.3: Still creating... (4m50s elapsed)
module.app.null_resource.chef_client.3: Still creating... (5m0s elapsed)
....
module.app.null_resource.chef_client.3: Still creating... (6m50s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m0s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m10s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m20s elapsed)

that go on endlessly untill they are shut down with double Ctrl+C (that disgards made changes) or some eventual timeout.

Resource that is running looks like:

resource "null_resource" "chef_client" {
  count = "${var.count}"

  # Liberate ssh key of root etc
  provisioner "remote-exec" {
    inline = [
      "# Upgrade/Install loads of packages ",
    ]

    connection {
      type        = "ssh"
      user        = "${var.ssh_user}"
      private_key = "${file(var.admin_private_key_path)}"
      host        = "${aws_instance.instance.*.private_ip[count.index]}"
    }
  }

  # Delete old node & client from chef server
  provisioner "remote-exec" {
    inline = [
      "# Do some other stuff",
    ]

    connection {
      type        = "ssh"
      user        = "root"
      private_key = "${file(var.admin_private_key_path)}"
      host        = "${var.chef_ip}"
    }
  }

  # Provision host with chef
  provisioner "chef" {
    attributes_json = <<EOF
{
"fqdn": "${var.name_prefix}${count.index + 1}${var.name_postfix}.${var.domain}"
}
EOF

    environment             = "${var.chef_env}"
    run_list                = ["${split(",", var.chef_default_run_list)}"]
    node_name               = "${var.name_prefix}${count.index + 1}${var.name_postfix}"
    secret_key              = "${file(var.chef_secret_key_path)}"
    server_url              = "${var.chef_server_url}"
    user_name               = "${var.chef_validation_client_name}"
    user_key                = "${file(var.chef_validation_key_path)}"
    version                 = "${var.chef_version}"
    fetch_chef_certificates = true

    connection {
      type        = "ssh"
      user        = "root"
      private_key = "${file(var.admin_private_key_path)}"
      host        = "${aws_instance.instance.*.private_ip[count.index]}"
    }
  }
}

When i look at processes running i see:

ps -ef|grep terra
  502  8857 79870   0  1:27PM ttys003    0:01.82 terraform apply
  502  8858  8857   0  1:27PM ttys003    0:03.11 /usr/local/terraform-v0.9.1/terraform apply
  502  8862  8858   0  1:27PM ttys003    0:00.34 /usr/local/terraform-v0.9.1/terraform internal-plugin provider terraform
  502  8863  8858   0  1:27PM ttys003    0:00.13 /usr/local/terraform-v0.9.1/terraform internal-plugin provider template
  502  8864  8858   0  1:27PM ttys003    0:01.83 /usr/local/terraform-v0.9.1/terraform internal-plugin provider aws
  502  8865  8858   0  1:27PM ttys003    0:00.17 /usr/local/terraform-v0.9.1/terraform internal-plugin provider null
  502  8866  8858   0  1:27PM ttys003    0:00.78 /usr/local/terraform-v0.9.1/terraform internal-plugin provisioner remote-exec
  502  8867  8858   0  1:27PM ttys003    0:00.95 /usr/local/terraform-v0.9.1/terraform internal-plugin provisioner chef

And when i kill the chef provisioner that is running i get:

pgrep -f "provisioner chef"
8867

pkill -f "provisioner chef"

ps -ef|grep terra
  502  9390  5628   0  1:36PM ttys006    0:00.00 grep terra

# End of TF apply
Error applying plan:

1 error(s) occurred:

* module.app.null_resource.chef_client[3]: 1 error(s) occurred:

* unexpected EOF

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

And once i run terraform apply again the execution will continue with the running resource tainted but this time the resource runs through correctly. This can maybe be because the work left to do is not as time consuming as the first run... The initial chef run is quite long because of the number of packages installed after bootstrap

kristjanelias on 24 Mar 2017

@kristjanelias, I've done a lot of tracing to try to identify the issue and my prime suspect is the AWS network breaking up the SSH connection between our workstation and the instance at some point after the instance was spawned (within 5 minutes or so). Even a separate shell connection to the target gets interrupted at this point and don't recover. On the target, the SSH shell is still active, trying to send data back to the client and eventually timing out after many minutes. I tried to bypass the bastion, use local-exec with the OpenSSH client explicity configured to send keepalive packets, but the result it the same, at some point (well after cloud-init) the socket is silenced. I don't have a support agreement with Amazon to work out this issue. In my case, the remote-exec is also rather long (4-5 minutes) so the issue is more likely to occur than with a quick command. If I wait a bit before launching it , the SSH session is not interrupted. That's a very ugly workaround.

fisabelle on 24 Mar 2017

Now i am seeing the exact same behaviour using Openstack provider.
I am creating instances and provisioning them using chef provisioner.
But again, the thing is freezing after about 5 minutes of chef provision.
screen shot 2017-03-31 at 16 02 35

From chef client log it can be seen that SIGTERM was recieved for some reason. But Terraform will not stop...

kristjanelias on 31 Mar 2017

I'm seeing the exact same issue. I have a null_resource joined to some AWS instances, which runs a simple script on my Docker swarm if any node is rebuilt.

Without fail, a "terraform apply" will always fail upon the first deployment, hanging on the null_resource. Running it again, the null_resource is created, and the issue does not return - unless I destroy/rebuild the environment.

@isabellf This explanation definitely sounds plausible.

R0quef0rt on 22 Sep 2017

Same issue on Azure, v0.11.2. "null_resource" on "remote_exec" everything hangs. The files though are provisioned. If I try the same but through different resource ("azure_virtual_machine") "remote_exec" works without issues for the same machine.

tokmac on 10 Jan 2018

I was under the impression that this issue started producing because I changed the SSH port from default 22 to a random one. But from the comments here it looks like this is reproducing for the default port too.

One thing to note is that if I cancel this midway after noticing the possible timeout, and rerun terraform apply, after a few times the execution actually succeeds. This issue might have possible multiple causes.

chamilad on 26 Feb 2018

1 error(s) occurred:

null_resource.puppet-mercury: 1 error(s) occurred:
ssh: rejected: administratively prohibited (open failed)

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.```

My null_resource

resource "null_resource" "puppet-mercury" { depends_on = ["aws_instance.puppet-mercury"] triggers { cluster_instance_ids = "${join(",", aws_instance.puppet-mercury.*.id)}" } provisioner "puppet" { connection { user = "user" //username for ssh connection type = "ssh" agent = "false" timeout = "1m" bastion_host = ["${aws_instance.nat01.public_ip}"] bastion_private_key = "${file("/Users/malipr/.ssh/<<KEY>>")}" host = "${element(aws_instance.puppet-mercury.*.private_ip, 0)}" private_key = "${file("/home/praveen/.ssh/<<KEY>>")}" } puppetmaster_ip = "${var.puppet_master_ip}" #ip of Puppet Master use_sudo = true }

prvnmali2017 on 19 Jun 2018

Facing the same issue. Enabled debug logs and found:

2018-06-26T12:11:53.102Z [DEBUG] plugin.terraform: chef-provisioner (internal) 2018/06/26 12:11:53 [INFO] sleeping for 20s

I can see a retry with exponential backoff in communicator/communicator.go. Can I control the deadline from somewhere so that context.DeadlineExceeded is thrown and it exits?

rr0hit on 26 Jun 2018

Faced with the same issue (terraform v0.10.5).
Noticed that it happens when no response messages (in my case, puppet agent don't send messages) from provisioning instance to terraform during >2.5 minutes (maybe less, didn't tested). Seems terraform losing connection to remote instance after this time and don't care about it.

Tried change ssh timeouts, didn't help.
Found only one worked method for now: send messages from provisioning instance each 1 minute during full process of provision:

...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (4 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m40s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m50s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (5 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m40s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m50s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (6 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Info: Creating state file /var/lib/puppet/state/state.yaml
...

For send messages, each 1 minute, during instance provisioning, you can run (add to script) something like that:

i=0; while [ $i -ne 30 ]; do (( i++ )); sleep 60; ps aux | grep "puppet agent --test" | grep -vw grep > /dev/null && echo "Puppet still running ($i minutes)..." || i=30; done &

This is can help someone in similar issue until it won't be fixed in terraform.

Halytskyi on 12 Sep 2018

👍1

Confirmed this also happens with Terraform 0.11.13, using a null_resource with a remote-exec provisioner in it, to provision to Azure.

scarolan on 21 May 2019

I'm experiencing ssh timeouts using a null_resource with a chef provisioner in it -> AWS EC2, on terraform v 0.11.13. The provisioner connects as expected when placed directly on the aws_instance, but times out on the null_resource

thefunkjunky on 22 May 2019

on both 0.11.7 and 0.11.13, I'm experiencing the null_resource hanging issue, with remote-exec provisioner in it

bowfeng on 28 Jun 2019

Hi Everyone,

While there's a few different symptoms being displayed here, I think we need to narrow down this issue to the behavior in the original post. In that case we have a successful connection, but the command or session is never completed. Unfortunately once terraform hands off control to execute on a remote host, there's not much else that it can do to ensure it completes successfully.

Terraform 0.12 contains a couple changes to help make things that terraform can control more reliable. The connection blocks must contain a host parameter now, which ensures that the connection is being made to a host explicitly defined in the configuration, rather than assuming it will be correctly supplied by the provider. The ssh connection itself is also sending keepalive messages, allowing Terraform to more quickly disconnect from unresponsive hosts.

While these changes can't ensure that all hosts and networks are configured and behave correctly when the provisioner is executed, it should prevent the original issue of the provisioner hanging after connecting to the remote host.

We're going to close this particular issue out, as it has only been reported on versions that are no longer under active development. Any other issues regarding provisioners in 0.12 should be filed as new issues.

Thanks!

jbardin on 6 Dec 2019

👍1

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.