Terraform: Sporadic SSH connection failure with remote-exec

Created on 22 Aug 2015 · 16Comments · Source: hashicorp/terraform

I'm using the AWS provider and have reviewed all Terraform issues related to SSH connectivity failures involving the remote-exec provisioner that I am aware of and have exhausted all avenues based on those conversation threads.

Here's the exception emitted by Terraform:

Error applying plan:

1 error(s) occurred:

dial tcp 52.19.120.112:22: i/o timeout
...

As I mention in the title, the timeout occurs sporadically. To provide a bit of context, I'm using the latest released version of Terraform. We create machine (EC2) clusters of varying sizes. Each cluster is is associated with the same, newly created, VPC. The VPC currently has a single NAT subnet that does route through an internet-accessible gateway. All of this is of course hosted in Amazon.

Here are the pseudo-steps:

Create a VPC
Create an internet gateway associated with the VPC created in step 1.
Create a public subnet (map_public_ip_on_launch = true) associated with the VPC created in step 1.
Create a routing table associated with the VPC created in step 1.
Create a routing table association between the subnet (step 3) and the routing table (step 4).
Create a security group associated with the VPC created in step 1.
Create an EC2 instance associated to the security group created in step 6 and with associate_public_ip_address = true. The remote-exec provisioner associated with this instance is the one that times out.

Any help is greatly appreciated.

bug provisionefile provisioneremote-exec

Source

mprimeaux

👍1

Most helpful comment

Any update on how to prevent this from occurring or what exactly is causing it?

Michael-McD on 22 Mar 2018

👍4

All 16 comments

Perhaps I'm being cautiously optimistic but I consider this matter closed. I ended up using the aws_main_route_table_association resource to ensure default subnet connectivity between the public internet (me) and each node of the cluster.

mprimeaux on 23 Aug 2015

I spoke too soon. I'm still experiencing sporadic SSH connectivity failures using remote-exec provisioners with the AWS provider. Any and all suggestions are welcome. I wonder if this a race condition with respect to AWS metadata propagation.

mprimeaux on 23 Aug 2015

UPDATE: As an example, for a simple 2 node cluster, the remote-exec provisioner successfully connected to one of the nodes but failed to connect to the other one.

mprimeaux on 23 Aug 2015

UPDATE: This behavior occurs nearly 100% of the time.I've executed scripts against nearly all AWS regions and all suffer from this same timeout.

Is anyone else experiencing issues with SSH connectivity to AWS? Please note the steps outlined in my initial comment.

mprimeaux on 26 Aug 2015

👍3

I've experienced a similar issue. I'm provisioning AWS VPC instances that are in a public subnet with Chef. Security groups and ACLs only allow ssh access via a VPN connection.

The problem seems to stem from the fact that Terraform does not consistently use the private IP. Using the same resource definition with two different instances, it will try to connect to one using the private IP and the other with the public. Since ssh is not allowed on the public IP, any instance it tries to connect to that way timesout.

lmickh on 26 Aug 2015

Many thanks for the information. I'll try to define a connection within the remote-exec provisioner and explicitly specify the public IP address of the target node (self.public_ip) for the host attribute to see if that make a difference.

mprimeaux on 26 Aug 2015

As it turns out, it's not the remote-exec provisioner that's failing but the file provisioner attempting to connect to the target machine over SSH. My assumption, since I haven't looked at the source code yet, is the default connection defined within the resource is used by the file provisioner for the source connection. It's the destination of the file provisioner that's failing to connect.

I specified...

host = "${self.public_ip}"

...within the connection block of the related resource to no avail.

mprimeaux on 26 Aug 2015

👍1

Just experience the same, two times in a row (completely 2 different AWS environments), so it seems there's some kind of a pattern closely related to remote-exec.

Terraforom version: v0.6.16

Plan:

+ aws_eip.xyz-dev-sftp-eip
+ aws_eip_association.xyz-dev-sftp-eip-assoc
+ aws_instance.xyz-dev-sftp
+ aws_route_table.xyz-dev-sftp-rt
+ aws_route_table_association.xyz-dev-sftp-rt-assoc
+ aws_subnet.xyz-dev-sftp-subnet
+ null_resource.xyz-dev-sftp-consul-ip-helper

Error:

aws_instance.xyz-dev-sftp: Creation complete
aws_eip_association.xyz-dev-sftp-eip-assoc: Creating...
  allocation_id:        "" => "eipalloc-6e77312b"
  instance_id:          "" => "i-2ecae3a4"
  network_interface_id: "" => "<computed>"
  private_ip_address:   "" => "<computed>"
  public_ip:            "" => "<computed>"
aws_eip_association.xyz-dev-sftp-eip-assoc: Creation complete
Error applying plan:

1 error(s) occurred:

* dial tcp 52.50.181.123:22: i/o timeout

All resources except null_resource have been successfully provisioned. null_resource looks like this:

resource "null_resource" "xyz-dev-sftp-consul-ip-helper" {
  count = 1

  triggers {
    index  = "${var.xyz-dev-sftp-null-index}"
    a_addr = "${join(",", aws_instance.xyz-dev-a.*.private_ip)}"
    b_addr = "${join(",", aws_instance.xyz-dev-b.*.private_ip)}"
    c_addr = "${join(",", aws_instance.xyz-dev-c.*.private_ip)}"
  }

  connection = {
    type        = "ssh"
    user        = "ec2-user"
    host        = "${element(aws_eip.xyz-dev-sftp-eip.*.public_ip, count.index)}"
    private_key = "${file("~/.ssh/dev-terraform.pem")}"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-a.*.private_ip, aws_instance.xyz-dev-a.*.availability_zone))}\" | sudo tee /etc/terraform/a.addr",
      "sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-b.*.private_ip, aws_instance.xyz-dev-b.*.availability_zone))}\" | sudo tee /etc/terraform/b.addr",
      "sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-c.*.private_ip, aws_instance.xyz-dev-c.*.availability_zone))}\" | sudo tee /etc/terraform/c.addr",
    ]
  }
}

It's worth to mention that terraform apply executed right after this ends successfully.

jwadolowski on 18 Jun 2016

Based on a bit of research, my perception of what you're encountering is propagation delay of AWS metadata for the objects being created. What I ended up implementing to mitigate this specifically is simply a local execution provisioner that pauses long enough for the metadata to propagate.

# Allow AWS infrastructure metadata to propagate.
provisioner "local-exec" {
  command = "sleep 15"
}

It's less than ideal but I needed a timely work-around.

mprimeaux on 2 Jul 2016

Same error with definition:

resource "aws_instance" "my" {
  count                       = "${var.count}"
  instance_type               = "${var.instance_type}"
  ami                         = "${lookup(var.amis, var.region)}"
  key_name                    = "${var.tag}_ssh_key"
  subnet_id                   = "${aws_subnet.my.id}"
  private_ip                  = "${cidrhost("10.43.0.0/16", 10 + count.index)}"
  associate_public_ip_address = true

  vpc_security_group_ids = ["${aws_security_group.my.id}"]

  root_block_device {
    volume_type           = "gp2"
    volume_size           = "${var.block_device_volume_size}"
    delete_on_termination = true
  }

  connection {
    user        = "${var.aws_instance_user}"
    private_key = "${base64decode(var.ssh_private_key)}"
    host        = "${self.public_ip}"
  }

  provisioner "file" {
    source      = "${path.module}/scripts"
    destination = "/tmp"
  }

  provisioner "remote-exec" {
    inline = [
      "chmod +x /tmp/scripts/bootstrap.sh", 
      "/tmp/scripts/bootstrap.sh"
    ]
  }
}

output:

Error applying plan:

3 error(s) occurred:

* aws_instance.my[1]: 1 error(s) occurred:

* timeout
* aws_instance.my[0]: 1 error(s) occurred:

* dial tcp 13.57.18.229:22: i/o timeout
* aws_instance.my[2]: 1 error(s) occurred:

* dial tcp 52.53.198.237:22: i/o timeout

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

$ terraform -v
Terraform v0.10.2

holyketzer on 5 Oct 2017

Same issue here with vsphere provider:

Terraform v0.11.1
+ provider.vsphere v1.1.1

Worked fine on 0.4.2.

dry4ng on 3 Jan 2018

👍2

Any update on how to prevent this from occurring or what exactly is causing it?

Michael-McD on 22 Mar 2018

👍4

It seems like security group blocked access from where terraform running to remote instance ?!

son-vuanh on 1 Apr 2019

I have experienced something similar "timeout"
At my end - the issue was AMI id is wrong for the given region. It got fixed after updating AMI ID

venkat566 on 9 Jun 2019

Hello! :robot:

This issue relates to an older version of Terraform that is no longer in active development, and because the area of Terraform it relates to has changed significantly since the issue was opened we suspect that the issue is either fixed or that the circumstances around it have changed enough that we'd need an updated issue report in order to reproduce and address it.

If you're still seeing this or a similar issue in the latest version of Terraform, please do feel free to open a new bug report! Please be sure to include all of the information requested in the template, even if it might seem redundant with the information already shared in _this_ issue, because the internal details relating to this problem are likely to be different in the current version of Terraform.

Thanks!

hashibot on 27 Aug 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.