Hi there,
Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.
Run terraform -v to show the version. If you are not running the latest version of Terraform, please upgrade because your issue may have already been fixed.
Terraform v0.8.8
Please list the resources as a list:
null_resourceconnectionremote-execfileIf this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.
connection object on null_resourceconnection on remote-exec and file respectivelyscp and it works fine# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key.
resource "null_resource" "sync_docker_files" {
depends_on = ["module.demo"]
triggers = {
instance_id = "${module.demo.instance_id}"
}
connection {
type = "ssh"
user = "core"
private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
host = "${module.demo.instance_dns}"
agent = false
timeout = "10s"
}
provisioner "remote-exec" {
inline = [
"/usr/bin/sudo /usr/bin/chown core:core /mnt"
]
}
provisioner "file" {
source = "${path.module}/../docker-compose.yml"
destination = "/mnt/"
}
provisioner "file" {
source = "${path.module}/../services"
destination = "/mnt"
}
}
resource "null_resource" "sync_docker_files" {
depends_on = ["module.demo"]
triggers = {
instance_id = "${module.demo.instance_id}"
}
provisioner "remote-exec" {
inline = [
"/usr/bin/sudo /usr/bin/chown core:core /mnt"
]
connection {
type = "ssh"
user = "core"
private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
host = "${module.demo.instance_dns}"
agent = false
timeout = "10s"
}
}
provisioner "file" {
source = "${path.module}/../docker-compose.yml"
destination = "/mnt/"
connection {
type = "ssh"
user = "core"
private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
host = "${module.demo.instance_dns}"
agent = false
timeout = "10s"
}
}
provisioner "file" {
source = "${path.module}/../services"
destination = "/mnt"
connection {
type = "ssh"
user = "core"
private_key = "${file("${path.module}/../services/containers/demo-bastion/conf/demo.pem")}"
host = "${module.demo.instance_dns}"
agent = false
timeout = "10s"
}
}
}
Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.
https://gist.github.com/johnt337/e5e6aa157728ef03afc68f2ab2684e9c
If Terraform produced a panic, please provide a link to a GitHub Gist containing the output of the crash.log.
What should have happened?
What actually happened?
file and remote-exec provisionersctrl+c and it keeps waiting$ make build-infra
Get: file:///dockerfiles/jvm-profiling-demo/infrastructure/modules/s3-bucket
Get: file:///dockerfiles/jvm-profiling-demo/infrastructure/modules/demo
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but
will not be persisted to local or remote state storage.
data.template_file.ssh_private_key: Refreshing state...
data.template_file.ssh_public_key: Refreshing state...
aws_security_group.demo: Refreshing state... (ID: sg-xxxxxx)
aws_iam_role.demo-server: Refreshing state... (ID: demo-server)
aws_key_pair.authorized_key: Refreshing state... (ID: demo)
aws_iam_role_policy.demo-server-ec2-tag: Refreshing state... (ID: demo-server:demo-server-ec2-tag)
aws_iam_instance_profile.demo-server: Refreshing state... (ID: demo-server)
module.demo.data.template_file.user-data: Refreshing state...
module.s3.aws_s3_bucket.site_bucket: Refreshing state... (ID: demo-jvm-profiling-us-east-1-xxxxxx)
module.demo.aws_instance.demo: Refreshing state... (ID: i-xxxxxx)
module.demo.aws_route53_record.demo: Refreshing state... (ID: xxxxxx_jvm-profiling-demo.mydemo.com_A)
aws_iam_role_policy.demo-server-s3: Refreshing state... (ID: demo-server:demo-server-s3)
The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.
Your plan was also saved to the path below. Call the "apply" subcommand
with this plan file and Terraform will exactly execute this execution
plan.
Path: infra.tfplan
+ null_resource.sync_docker_files
triggers.%: "1"
triggers.instance_id: "i-xxxxxxxxx"
Plan: 1 to add, 0 to change, 0 to destroy.
null_resource.sync_docker_files: Creating...
triggers.%: "" => "1"
triggers.instance_id: "" => "i-xxxxxxxxx"
null_resource.sync_docker_files: Provisioning with 'remote-exec'...
null_resource.sync_docker_files (remote-exec): Connecting to remote host via SSH...
null_resource.sync_docker_files (remote-exec): Host: jvm-profiling-demo.mydemo.com
null_resource.sync_docker_files (remote-exec): User: core
null_resource.sync_docker_files (remote-exec): Password: false
null_resource.sync_docker_files (remote-exec): Private key: true
null_resource.sync_docker_files (remote-exec): SSH Agent: false
null_resource.sync_docker_files (remote-exec): Connected!
null_resource.sync_docker_files: Still creating... (10s elapsed)
null_resource.sync_docker_files: Still creating... (20s elapsed)
...
null_resource.sync_docker_files: Still creating... (34m31s elapsed)
null_resource.sync_docker_files: Still creating... (34m41s elapsed)
...
Looking at the trace output it states its waiting for it to finish.
Please list the steps required to reproduce the issue:
plan-infra: infrastructure/deployment.tf get-modules
@cd infrastructure && terraform plan -out infra.tfplan
build-infra: plan-infra infrastructure/infra.tfplan
@cd infrastructure && terraform apply infra.tfplan
terraform plan -out infra.tfplanterraform apply infra.tfplanctrl+c once and it does not respondAre there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?
Plain VPC, running coreos, and using DNS for the host.
CoreOS AMIs:
{
"variable": {
"amis":{
"type":"map",
"default":{
"us-east-1.coreos.1235.6.0":"ami-3b7f9e2d",
"us-west-2.coreos.1235.6.0":"ami-12942672"
}
}
}
}
Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here?
I've continued to let this last run go...
null_resource.sync_docker_files: Still creating... (1h38m41s elapsed)
It finally died.
null_resource.sync_docker_files: Still creating... (2h11m11s elapsed)
null_resource.sync_docker_files: Still creating... (2h11m21s elapsed)
Error applying plan:
1 error(s) occurred:
* Failed to upload script: dial tcp 54.236.xx.xx:22: getsockopt: operation timed out
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
I'm somewhat happy to see that I'm not the only one suffering from this issue and that it might eventually get some traction and get fixed.
I initially thought that this was due to connectivity issues with the remote hosts and/or some firewall tearing down idle connections during remote-execs (because I haven't seen it for 'file' provisioners).
So I filed: #https://github.com/hashicorp/terraform/issues/12139 Getting the remote-exec provisioner to detect SSH connectivity issues (with or without bastion_host)
I spent sometime trying to enable an SSH keepalive to make sure long running (and quiet) 'remote-exec' would not be terminated but this didn't help. I can see that TCPKeepAlive is enabled in the Terraform SSH client (communicator/ssh/communicator.go) , but I think this only applies to non-bastion connection. Anyway I have no proof yet that the SSH connection is being teared down. I also made my long-running 'remote-exec' script quiet verbose to make sure that regarless of keep alive parameters the connection would not stay idle and the result is the same. The 'remote-exec' stops producing any output, while it is still running on the target. And the resource is "Still creating .." until the termination Terraform with CTRL-C (twice).
So, I'm still looking at this problem with the same angle, trying to see if the problem shows up on SSH disconnections. I'll be watching this ticket! I'll update it if I find some workaround. Thanks for reporting this issue.
This seems to be working for me in 0.9.0. I would say give it a try and see if it works for you now.
Thanks @johnt337 .
I might have forgot to tell, but for me this error is intermittent, that's why I've been thinking it could be related to some sporadic networking issue. So I will see if with 0.9.0 I'm getting a better success rate and will update the ticket.
No luck. I'm still getting the same issue.
The 'remote-exec' stops being updated (while the script is running on the target) and the resource creation hang indefinitely.
module.a.null_resource.b.0 (remote-exec): Running ...
module.a.null_resource.b.0: Still creating... (1m40s elapsed)
module.a.null_resource.b.0 (remote-exec): Running ...
module.a.null_resource.b.0: Still creating... (1m50s elapsed)
module.a.null_resource.b.0: Still creating... (2m50s elapsed)
One thing I think might have an impact... concurrency. I think there was no issue with that before I started running these remote-execs concurrently on multiple instances. Have you tried running it simultaneously on multiple targets ? I will try to reproduce it on a single target see it's less subject to this issue and will update this ticket.
UPDATE I'm getting the same result when attempting the remote-exec on a single instance, so this does not seem related to concurrency of remote-exec.
I am seeing the same behaviour. I am on 0.9.1 now but still the same. Saw the same also on 0.8.8.
Difference between the two versions is that while you hit Ctrl+C on 0.8.8 to cancel the hang the tasks that are allready finished are saved in state. While in 0.9.1 when you hit Ctrl+C nothing is saved. I need to go back manually and delete the resources created...
Really need a workaround/fix for this...
For my case it seems to be hanging on chef provisioner running inside null_resource.
My case:
I am seeing situations like this:
module.app.null_resource.chef_client.3: Still creating... (4m30s elapsed)
module.app.null_resource.chef_client.3: Still creating... (4m40s elapsed)
module.app.null_resource.chef_client.3: Still creating... (4m50s elapsed)
module.app.null_resource.chef_client.3: Still creating... (5m0s elapsed)
....
module.app.null_resource.chef_client.3: Still creating... (6m50s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m0s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m10s elapsed)
module.app.null_resource.chef_client.3: Still creating... (7m20s elapsed)
that go on endlessly untill they are shut down with double Ctrl+C (that disgards made changes) or some eventual timeout.
Resource that is running looks like:
resource "null_resource" "chef_client" {
count = "${var.count}"
# Liberate ssh key of root etc
provisioner "remote-exec" {
inline = [
"# Upgrade/Install loads of packages ",
]
connection {
type = "ssh"
user = "${var.ssh_user}"
private_key = "${file(var.admin_private_key_path)}"
host = "${aws_instance.instance.*.private_ip[count.index]}"
}
}
# Delete old node & client from chef server
provisioner "remote-exec" {
inline = [
"# Do some other stuff",
]
connection {
type = "ssh"
user = "root"
private_key = "${file(var.admin_private_key_path)}"
host = "${var.chef_ip}"
}
}
# Provision host with chef
provisioner "chef" {
attributes_json = <<EOF
{
"fqdn": "${var.name_prefix}${count.index + 1}${var.name_postfix}.${var.domain}"
}
EOF
environment = "${var.chef_env}"
run_list = ["${split(",", var.chef_default_run_list)}"]
node_name = "${var.name_prefix}${count.index + 1}${var.name_postfix}"
secret_key = "${file(var.chef_secret_key_path)}"
server_url = "${var.chef_server_url}"
user_name = "${var.chef_validation_client_name}"
user_key = "${file(var.chef_validation_key_path)}"
version = "${var.chef_version}"
fetch_chef_certificates = true
connection {
type = "ssh"
user = "root"
private_key = "${file(var.admin_private_key_path)}"
host = "${aws_instance.instance.*.private_ip[count.index]}"
}
}
}
When i look at processes running i see:
ps -ef|grep terra
502 8857 79870 0 1:27PM ttys003 0:01.82 terraform apply
502 8858 8857 0 1:27PM ttys003 0:03.11 /usr/local/terraform-v0.9.1/terraform apply
502 8862 8858 0 1:27PM ttys003 0:00.34 /usr/local/terraform-v0.9.1/terraform internal-plugin provider terraform
502 8863 8858 0 1:27PM ttys003 0:00.13 /usr/local/terraform-v0.9.1/terraform internal-plugin provider template
502 8864 8858 0 1:27PM ttys003 0:01.83 /usr/local/terraform-v0.9.1/terraform internal-plugin provider aws
502 8865 8858 0 1:27PM ttys003 0:00.17 /usr/local/terraform-v0.9.1/terraform internal-plugin provider null
502 8866 8858 0 1:27PM ttys003 0:00.78 /usr/local/terraform-v0.9.1/terraform internal-plugin provisioner remote-exec
502 8867 8858 0 1:27PM ttys003 0:00.95 /usr/local/terraform-v0.9.1/terraform internal-plugin provisioner chef
And when i kill the chef provisioner that is running i get:
pgrep -f "provisioner chef"
8867
pkill -f "provisioner chef"
ps -ef|grep terra
502 9390 5628 0 1:36PM ttys006 0:00.00 grep terra
# End of TF apply
Error applying plan:
1 error(s) occurred:
* module.app.null_resource.chef_client[3]: 1 error(s) occurred:
* unexpected EOF
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
And once i run terraform apply again the execution will continue with the running resource tainted but this time the resource runs through correctly. This can maybe be because the work left to do is not as time consuming as the first run... The initial chef run is quite long because of the number of packages installed after bootstrap
@kristjanelias, I've done a lot of tracing to try to identify the issue and my prime suspect is the AWS network breaking up the SSH connection between our workstation and the instance at some point after the instance was spawned (within 5 minutes or so). Even a separate shell connection to the target gets interrupted at this point and don't recover. On the target, the SSH shell is still active, trying to send data back to the client and eventually timing out after many minutes. I tried to bypass the bastion, use local-exec with the OpenSSH client explicity configured to send keepalive packets, but the result it the same, at some point (well after cloud-init) the socket is silenced. I don't have a support agreement with Amazon to work out this issue. In my case, the remote-exec is also rather long (4-5 minutes) so the issue is more likely to occur than with a quick command. If I wait a bit before launching it , the SSH session is not interrupted. That's a very ugly workaround.
Now i am seeing the exact same behaviour using Openstack provider.
I am creating instances and provisioning them using chef provisioner.
But again, the thing is freezing after about 5 minutes of chef provision.

From chef client log it can be seen that SIGTERM was recieved for some reason. But Terraform will not stop...
I'm seeing the exact same issue. I have a null_resource joined to some AWS instances, which runs a simple script on my Docker swarm if any node is rebuilt.
Without fail, a "terraform apply" will always fail upon the first deployment, hanging on the null_resource. Running it again, the null_resource is created, and the issue does not return - unless I destroy/rebuild the environment.
@isabellf This explanation definitely sounds plausible.
Same issue on Azure, v0.11.2. "null_resource" on "remote_exec" everything hangs. The files though are provisioned. If I try the same but through different resource ("azure_virtual_machine") "remote_exec" works without issues for the same machine.
I was under the impression that this issue started producing because I changed the SSH port from default 22 to a random one. But from the comments here it looks like this is reproducing for the default port too.
One thing to note is that if I cancel this midway after noticing the possible timeout, and rerun terraform apply, after a few times the execution actually succeeds. This issue might have possible multiple causes.
1 error(s) occurred:
null_resource.puppet-mercury: 1 error(s) occurred:
ssh: rejected: administratively prohibited (open failed)
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.```
My null_resource
resource "null_resource" "puppet-mercury" {
depends_on = ["aws_instance.puppet-mercury"]
triggers {
cluster_instance_ids = "${join(",", aws_instance.puppet-mercury.*.id)}"
}
provisioner "puppet" {
connection {
user = "user" //username for ssh connection
type = "ssh"
agent = "false" timeout = "1m"
bastion_host = ["${aws_instance.nat01.public_ip}"]
bastion_private_key = "${file("/Users/malipr/.ssh/<<KEY>>")}"
host = "${element(aws_instance.puppet-mercury.*.private_ip, 0)}"
private_key = "${file("/home/praveen/.ssh/<<KEY>>")}"
}
puppetmaster_ip = "${var.puppet_master_ip}" #ip of Puppet Master
use_sudo = true
}
Facing the same issue. Enabled debug logs and found:
2018-06-26T12:11:53.102Z [DEBUG] plugin.terraform: chef-provisioner (internal) 2018/06/26 12:11:53 [INFO] sleeping for 20s
I can see a retry with exponential backoff in communicator/communicator.go. Can I control the deadline from somewhere so that context.DeadlineExceeded is thrown and it exits?
Faced with the same issue (terraform v0.10.5).
Noticed that it happens when no response messages (in my case, puppet agent don't send messages) from provisioning instance to terraform during >2.5 minutes (maybe less, didn't tested). Seems terraform losing connection to remote instance after this time and don't care about it.
Tried change ssh timeouts, didn't help.
Found only one worked method for now: send messages from provisioning instance each 1 minute during full process of provision:
...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (4 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m40s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (6m50s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (5 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m40s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (7m50s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m0s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m10s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Puppet still running (6 minutes)...
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m20s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric.4: Still creating... (8m30s elapsed)
module.server.openstack_compute_instance_v2.cmgeneric[4] (remote-exec): Info: Creating state file /var/lib/puppet/state/state.yaml
...
For send messages, each 1 minute, during instance provisioning, you can run (add to script) something like that:
i=0; while [ $i -ne 30 ]; do (( i++ )); sleep 60; ps aux | grep "puppet agent --test" | grep -vw grep > /dev/null && echo "Puppet still running ($i minutes)..." || i=30; done &
This is can help someone in similar issue until it won't be fixed in terraform.
Confirmed this also happens with Terraform 0.11.13, using a null_resource with a remote-exec provisioner in it, to provision to Azure.
I'm experiencing ssh timeouts using a null_resource with a chef provisioner in it -> AWS EC2, on terraform v 0.11.13. The provisioner connects as expected when placed directly on the aws_instance, but times out on the null_resource
on both 0.11.7 and 0.11.13, I'm experiencing the null_resource hanging issue, with remote-exec provisioner in it
Hi Everyone,
While there's a few different symptoms being displayed here, I think we need to narrow down this issue to the behavior in the original post. In that case we have a successful connection, but the command or session is never completed. Unfortunately once terraform hands off control to execute on a remote host, there's not much else that it can do to ensure it completes successfully.
Terraform 0.12 contains a couple changes to help make things that terraform can control more reliable. The connection blocks must contain a host parameter now, which ensures that the connection is being made to a host explicitly defined in the configuration, rather than assuming it will be correctly supplied by the provider. The ssh connection itself is also sending keepalive messages, allowing Terraform to more quickly disconnect from unresponsive hosts.
While these changes can't ensure that all hosts and networks are configured and behave correctly when the provisioner is executed, it should prevent the original issue of the provisioner hanging after connecting to the remote host.
We're going to close this particular issue out, as it has only been reported on versions that are no longer under active development. Any other issues regarding provisioners in 0.12 should be filed as new issues.
Thanks!
I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.