Terraform: Remote Exec Failing Even After Successful Execution of the Script

Created on 23 Jul 2018  路  8Comments  路  Source: hashicorp/terraform

Terraform Version

Terraform v0.11.7

* provider.null: version = "~> 1.0"
* provider.template: version = "~> 1.0"

Terraform Configuration Files

 resource "null_resource" "pr13_remote_exec_0" {
    count = "1"

    provisioner "file" {
        content      = "${element(data.template_file.pr13_template_0.*.rendered, count.index)}"
        destination  = "/tmp/remote-exec.sh"
        connection {
            type     = "ssh"
            user     = "povijayan"
            private_key = "${file("/Users/povijayan/.ssh/id_rsa")}"
            host     = "${element(var.hosts_0,count.index)}"
        }
    }

    provisioner "remote-exec" {
        inline = [
            "sudo mkdir -p /opt/test/remote-exec-scripts",
            "sudo cp -Rvf /tmp/remote-exec.sh /opt/test/remote-exec-scripts/",
            "sudo chmod +x /opt/test/remote-exec-scripts/remote-exec.sh",
            "sudo /opt/test/remote-exec-scripts/remote-exec.sh"
        ]
        connection {
            type     = "ssh"
            user     = "povijayan"
            private_key = "${file("/Users/povijayan/.ssh/id_rsa")}"
            host     = "${element(var.hosts_0,count.index)}"
            timeout = "30m"
        }
    }
}

Debug Output

Error: Error applying plan:

1 error(s) occurred:

  • null_resource.pr13_remote_exec_0: error executing "/tmp/terraform_1857918131.sh": wait: remote command exited without exit status or exit signal

Expected Behavior

Terraform remote exec should complete remote exec without any errors since remote exec script used having proper exit codes.

Actual Behavior

Remote exec script which we are executing using above configuration is failing with above error.
When checked the script execution log it got executed fine and ended with proper exit code, but still remote exec is failing.

We are able to reproduce this with simple script like below(but it is not consistent). Not able to understand the reason for this error. Please help here.

#! /bin/bash
exec > >(tee /var/log/test.log|logger -t test -s 2>/dev/console) 2>&1

sleep 10m

exit 0

Steps to Reproduce

  1. terraform init
  2. terraform apply
bug provisioneremote-exec v0.11 waiting for reproduction

Most helpful comment

I do. Ubuntu 20.04.1 LTS (Focal Fossa)

All 8 comments

This could be due to general SSH keepalive logic. (http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html)

If there is not response from your bash script in a specific time, your connection broken due to KeepAlive logic. And the script continue to work on server. But you can't handle.

To solve it, you can change your sshd_config file (/etc/ssh/sshd_config) on your server with ClientAliveInterval and ClientAliveCountMax parameters.

My provider DigitalOcean has cloud-init feature. (https://cloudinit.readthedocs.io/en/latest/). I send the cloud-config data in user_data parameter (This parameter works on DigitalOcean droplets) to change sshd_config file when cloud initialization. For example;

#cloud-config
write_files:
  - content: |
        ...
        ...
        ...
        ClientAliveInterval 120
        ClientAliveCountMax 720
    path: /etc/ssh/sshd_config

It means the client keeps alive in 120x720 seconds (1 day) without doing anything. Server sends 720 empty packet per 120 seconds to the client. I think best way is cloud-init feature to solve this problem.

If your provider has not this feature, you can solve with provisioners. For example;

resource <YOUR_PROVIDER_OR_NULL_RESOURCE> <RESOURCE_NAME> {
    ...
    ...
    ...

    provisioner file {
        destination = "/etc/ssh/sshd_config"
        source      = "<YOUR_SSHD_CONFIG_FILE_PATH>"
    }

    provisioner remote-exec {
        inline = [
            "systemctl restart sshd", # This works Centos. If you use another OS, you must change this line.
        ]
    }
}

resource null_resource <RESOURCE_NAME> {
    connection {
        ...
        ...
        ...
    }

    provisioner remote-exec {
        inline = [
             # The following commands to test. You can remove these commands and write your commands.
             "sleep 10m",
             "echo COMPLETED"
        ]
    }
}

I hope solve your problem.

So I'm assuming I'd need to make the change on the server in a separate, short remote_exec in order to prepare for my complicated provisioning script which will run on the next remote_exec session.

If I'm understanding the history on this correctly, this appears to be due to the lack of ssh keepalive in 0.11.x. It was added and now I don't think this should happen anymore. If anyone is still seeing this behavior, please leave a note here, ideally with a reproduction case in 0.13.x or 0.14.0 pre-releases. Otherwise, I'll close this around the second 0.14 beta and consider it resolved.

I am still seeing this issue with RedHat7.

I do. Ubuntu 20.04.1 LTS (Focal Fossa)

I too face the issue

Hello,

I am also facing the same issue in Ubuntu18 version can you please look into it ASAP.

Also experiencing this issue on Ubuntu 20.04.
In my case it seems that a combination of docker run and docker exec as the two last commands in remote-exec are the cause. Removing the docker exec command the issue is not happening, which makes the keepalive bit mentioned above sound like the culprit.

In case anyone still has issues after the changes to keepalive, I noticed that explicitly exiting with status 0 at the end of the remote-exec block also works, not sure whether it might cause false-positives in some cases tho

  provisioner "remote-exec" {
    inline = [
      ....
      "exit 0"
    ]
  }

EDIT @ersoyfilinte's solution of setting ClientAliveInterval and ClientAliveCountMax through cloud-config worked for me!

Was this page helpful?
0 / 5 - 0 ratings