Terraform: Remote Exec Failing Even After Successful Execution of the Script

Created on 23 Jul 2018 · 8Comments · Source: hashicorp/terraform

Terraform Version

Terraform v0.11.7

* provider.null: version = "~> 1.0"
* provider.template: version = "~> 1.0"

Terraform Configuration Files

 resource "null_resource" "pr13_remote_exec_0" {
    count = "1"

    provisioner "file" {
        content      = "${element(data.template_file.pr13_template_0.*.rendered, count.index)}"
        destination  = "/tmp/remote-exec.sh"
        connection {
            type     = "ssh"
            user     = "povijayan"
            private_key = "${file("/Users/povijayan/.ssh/id_rsa")}"
            host     = "${element(var.hosts_0,count.index)}"
        }
    }

    provisioner "remote-exec" {
        inline = [
            "sudo mkdir -p /opt/test/remote-exec-scripts",
            "sudo cp -Rvf /tmp/remote-exec.sh /opt/test/remote-exec-scripts/",
            "sudo chmod +x /opt/test/remote-exec-scripts/remote-exec.sh",
            "sudo /opt/test/remote-exec-scripts/remote-exec.sh"
        ]
        connection {
            type     = "ssh"
            user     = "povijayan"
            private_key = "${file("/Users/povijayan/.ssh/id_rsa")}"
            host     = "${element(var.hosts_0,count.index)}"
            timeout = "30m"
        }
    }
}

Debug Output

Error: Error applying plan:

1 error(s) occurred:

null_resource.pr13_remote_exec_0: error executing "/tmp/terraform_1857918131.sh": wait: remote command exited without exit status or exit signal

Expected Behavior

Terraform remote exec should complete remote exec without any errors since remote exec script used having proper exit codes.

Actual Behavior

Remote exec script which we are executing using above configuration is failing with above error.
When checked the script execution log it got executed fine and ended with proper exit code, but still remote exec is failing.

We are able to reproduce this with simple script like below(but it is not consistent). Not able to understand the reason for this error. Please help here.

#! /bin/bash
exec > >(tee /var/log/test.log|logger -t test -s 2>/dev/console) 2>&1

sleep 10m

exit 0

Steps to Reproduce

terraform init
terraform apply

bug provisioneremote-exec v0.11 waiting for reproduction

Source

ponvino

👍4

Most helpful comment

I do. Ubuntu 20.04.1 LTS (Focal Fossa)

bit-factor on 26 Oct 2020

👍2

All 8 comments

This could be due to general SSH keepalive logic. (http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html)

If there is not response from your bash script in a specific time, your connection broken due to KeepAlive logic. And the script continue to work on server. But you can't handle.

To solve it, you can change your sshd_config file (/etc/ssh/sshd_config) on your server with ClientAliveInterval and ClientAliveCountMax parameters.

My provider DigitalOcean has cloud-init feature. (https://cloudinit.readthedocs.io/en/latest/). I send the cloud-config data in user_data parameter (This parameter works on DigitalOcean droplets) to change sshd_config file when cloud initialization. For example;

#cloud-config
write_files:
  - content: |
        ...
        ...
        ...
        ClientAliveInterval 120
        ClientAliveCountMax 720
    path: /etc/ssh/sshd_config

It means the client keeps alive in 120x720 seconds (1 day) without doing anything. Server sends 720 empty packet per 120 seconds to the client. I think best way is cloud-init feature to solve this problem.

If your provider has not this feature, you can solve with provisioners. For example;

resource <YOUR_PROVIDER_OR_NULL_RESOURCE> <RESOURCE_NAME> {
    ...
    ...
    ...

    provisioner file {
        destination = "/etc/ssh/sshd_config"
        source      = "<YOUR_SSHD_CONFIG_FILE_PATH>"
    }

    provisioner remote-exec {
        inline = [
            "systemctl restart sshd", # This works Centos. If you use another OS, you must change this line.
        ]
    }
}

resource null_resource <RESOURCE_NAME> {
    connection {
        ...
        ...
        ...
    }

    provisioner remote-exec {
        inline = [
             # The following commands to test. You can remove these commands and write your commands.
             "sleep 10m",
             "echo COMPLETED"
        ]
    }
}

I hope solve your problem.

ersoyfilinte on 22 Aug 2018

👍1

So I'm assuming I'd need to make the change on the server in a separate, short remote_exec in order to prepare for my complicated provisioning script which will run on the next remote_exec session.

lancerkind on 15 May 2020

👍1

If I'm understanding the history on this correctly, this appears to be due to the lack of ssh keepalive in 0.11.x. It was added and now I don't think this should happen anymore. If anyone is still seeing this behavior, please leave a note here, ideally with a reproduction case in 0.13.x or 0.14.0 pre-releases. Otherwise, I'll close this around the second 0.14 beta and consider it resolved.

danieldreier on 14 Oct 2020

I am still seeing this issue with RedHat7.

gsinghab2 on 19 Oct 2020

I do. Ubuntu 20.04.1 LTS (Focal Fossa)

bit-factor on 26 Oct 2020

👍2

I too face the issue

KesavanKing on 15 Dec 2020

👍1

Hello,

I am also facing the same issue in Ubuntu18 version can you please look into it ASAP.

anupugalavat on 15 Dec 2020

Also experiencing this issue on Ubuntu 20.04.
In my case it seems that a combination of docker run and docker exec as the two last commands in remote-exec are the cause. Removing the docker exec command the issue is not happening, which makes the keepalive bit mentioned above sound like the culprit.

In case anyone still has issues after the changes to keepalive, I noticed that explicitly exiting with status 0 at the end of the remote-exec block also works, not sure whether it might cause false-positives in some cases tho

  provisioner "remote-exec" {
    inline = [
      ....
      "exit 0"
    ]
  }

EDIT @ersoyfilinte's solution of setting ClientAliveInterval and ClientAliveCountMax through cloud-config worked for me!

marinomeneghel on 1 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

jsonencode does not create JSON structure

rkulagowski · 3Comments

vsphere_virtual_machine with chef provisioner fails to connect with IPv6 address

sprokopiak · 3Comments

Error reading depends_on for aws_db_instance[db-eu-we1_preprod]: unknown slice type: *ast.LiteralType

Seraf · 3Comments

[OpenStack] Problem with networking service: No suitable endpoint could be found in the service catalog.

allomov · 3Comments

Unable to create launch configuration when there is no default VPC

rnowosielski · 3Comments