I'm using the AWS provider and have reviewed all Terraform issues related to SSH connectivity failures involving the remote-exec provisioner that I am aware of and have exhausted all avenues based on those conversation threads.
Here's the exception emitted by Terraform:
Error applying plan:
1 error(s) occurred:
- dial tcp 52.19.120.112:22: i/o timeout
...
As I mention in the title, the timeout occurs sporadically. To provide a bit of context, I'm using the latest released version of Terraform. We create machine (EC2) clusters of varying sizes. Each cluster is is associated with the same, newly created, VPC. The VPC currently has a single NAT subnet that does route through an internet-accessible gateway. All of this is of course hosted in Amazon.
Here are the pseudo-steps:
Any help is greatly appreciated.
Perhaps I'm being cautiously optimistic but I consider this matter closed. I ended up using the aws_main_route_table_association resource to ensure default subnet connectivity between the public internet (me) and each node of the cluster.
I spoke too soon. I'm still experiencing sporadic SSH connectivity failures using remote-exec provisioners with the AWS provider. Any and all suggestions are welcome. I wonder if this a race condition with respect to AWS metadata propagation.
UPDATE: As an example, for a simple 2 node cluster, the remote-exec provisioner successfully connected to one of the nodes but failed to connect to the other one.
UPDATE: This behavior occurs nearly 100% of the time.I've executed scripts against nearly all AWS regions and all suffer from this same timeout.
Is anyone else experiencing issues with SSH connectivity to AWS? Please note the steps outlined in my initial comment.
I've experienced a similar issue. I'm provisioning AWS VPC instances that are in a public subnet with Chef. Security groups and ACLs only allow ssh access via a VPN connection.
The problem seems to stem from the fact that Terraform does not consistently use the private IP. Using the same resource definition with two different instances, it will try to connect to one using the private IP and the other with the public. Since ssh is not allowed on the public IP, any instance it tries to connect to that way timesout.
Many thanks for the information. I'll try to define a connection within the remote-exec provisioner and explicitly specify the public IP address of the target node (self.public_ip) for the host attribute to see if that make a difference.
As it turns out, it's not the remote-exec provisioner that's failing but the file provisioner attempting to connect to the target machine over SSH. My assumption, since I haven't looked at the source code yet, is the default connection defined within the resource is used by the file provisioner for the source connection. It's the destination of the file provisioner that's failing to connect.
I specified...
host = "${self.public_ip}"
...within the connection block of the related resource to no avail.
Just experience the same, two times in a row (completely 2 different AWS environments), so it seems there's some kind of a pattern closely related to remote-exec.
Terraforom version: v0.6.16
Plan:
+ aws_eip.xyz-dev-sftp-eip
+ aws_eip_association.xyz-dev-sftp-eip-assoc
+ aws_instance.xyz-dev-sftp
+ aws_route_table.xyz-dev-sftp-rt
+ aws_route_table_association.xyz-dev-sftp-rt-assoc
+ aws_subnet.xyz-dev-sftp-subnet
+ null_resource.xyz-dev-sftp-consul-ip-helper
Error:
aws_instance.xyz-dev-sftp: Creation complete
aws_eip_association.xyz-dev-sftp-eip-assoc: Creating...
allocation_id: "" => "eipalloc-6e77312b"
instance_id: "" => "i-2ecae3a4"
network_interface_id: "" => "<computed>"
private_ip_address: "" => "<computed>"
public_ip: "" => "<computed>"
aws_eip_association.xyz-dev-sftp-eip-assoc: Creation complete
Error applying plan:
1 error(s) occurred:
* dial tcp 52.50.181.123:22: i/o timeout
All resources except null_resource have been successfully provisioned. null_resource looks like this:
resource "null_resource" "xyz-dev-sftp-consul-ip-helper" {
count = 1
triggers {
index = "${var.xyz-dev-sftp-null-index}"
a_addr = "${join(",", aws_instance.xyz-dev-a.*.private_ip)}"
b_addr = "${join(",", aws_instance.xyz-dev-b.*.private_ip)}"
c_addr = "${join(",", aws_instance.xyz-dev-c.*.private_ip)}"
}
connection = {
type = "ssh"
user = "ec2-user"
host = "${element(aws_eip.xyz-dev-sftp-eip.*.public_ip, count.index)}"
private_key = "${file("~/.ssh/dev-terraform.pem")}"
}
provisioner "remote-exec" {
inline = [
"sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-a.*.private_ip, aws_instance.xyz-dev-a.*.availability_zone))}\" | sudo tee /etc/terraform/a.addr",
"sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-b.*.private_ip, aws_instance.xyz-dev-b.*.availability_zone))}\" | sudo tee /etc/terraform/b.addr",
"sudo mkdir -p /terraform && echo -e \"${join("\n", formatlist("%s,%s", aws_instance.xyz-dev-c.*.private_ip, aws_instance.xyz-dev-c.*.availability_zone))}\" | sudo tee /etc/terraform/c.addr",
]
}
}
It's worth to mention that terraform apply executed right after this ends successfully.
Based on a bit of research, my perception of what you're encountering is propagation delay of AWS metadata for the objects being created. What I ended up implementing to mitigate this specifically is simply a local execution provisioner that pauses long enough for the metadata to propagate.
# Allow AWS infrastructure metadata to propagate.
provisioner "local-exec" {
command = "sleep 15"
}
It's less than ideal but I needed a timely work-around.
Same error with definition:
resource "aws_instance" "my" {
count = "${var.count}"
instance_type = "${var.instance_type}"
ami = "${lookup(var.amis, var.region)}"
key_name = "${var.tag}_ssh_key"
subnet_id = "${aws_subnet.my.id}"
private_ip = "${cidrhost("10.43.0.0/16", 10 + count.index)}"
associate_public_ip_address = true
vpc_security_group_ids = ["${aws_security_group.my.id}"]
root_block_device {
volume_type = "gp2"
volume_size = "${var.block_device_volume_size}"
delete_on_termination = true
}
connection {
user = "${var.aws_instance_user}"
private_key = "${base64decode(var.ssh_private_key)}"
host = "${self.public_ip}"
}
provisioner "file" {
source = "${path.module}/scripts"
destination = "/tmp"
}
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/scripts/bootstrap.sh",
"/tmp/scripts/bootstrap.sh"
]
}
}
output:
Error applying plan:
3 error(s) occurred:
* aws_instance.my[1]: 1 error(s) occurred:
* timeout
* aws_instance.my[0]: 1 error(s) occurred:
* dial tcp 13.57.18.229:22: i/o timeout
* aws_instance.my[2]: 1 error(s) occurred:
* dial tcp 52.53.198.237:22: i/o timeout
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
$ terraform -v
Terraform v0.10.2
Same issue here with vsphere provider:
Terraform v0.11.1
+ provider.vsphere v1.1.1
Worked fine on 0.4.2.
Any update on how to prevent this from occurring or what exactly is causing it?
It seems like security group blocked access from where terraform running to remote instance ?!
I have experienced something similar "timeout"
At my end - the issue was AMI id is wrong for the given region. It got fixed after updating AMI ID
Hello! :robot:
This issue relates to an older version of Terraform that is no longer in active development, and because the area of Terraform it relates to has changed significantly since the issue was opened we suspect that the issue is either fixed or that the circumstances around it have changed enough that we'd need an updated issue report in order to reproduce and address it.
If you're still seeing this or a similar issue in the latest version of Terraform, please do feel free to open a new bug report! Please be sure to include all of the information requested in the template, even if it might seem redundant with the information already shared in _this_ issue, because the internal details relating to this problem are likely to be different in the current version of Terraform.
Thanks!
I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
Any update on how to prevent this from occurring or what exactly is causing it?