Packer: Package install/Git clone operations with ansible_local provisioner on Triton builder fail with exit status 2300218

Created on 5 Mar 2017 · 20Comments · Source: hashicorp/packer

Overview

When attempting to build an image on Triton, using the ansible_local provisioner, any long running operations, such as installing some packages or fetching a (kind of) large Git repository, the build eventually fails stating the following error:

==> Some builds didn't complete successfully and had errors:
--> triton: Error executing Ansible: Non-zero exit status: 2300218

==> Builds finished but no artifacts were created.

I'm able to consistently reproduce this issue across several of my Ansible role's with varying long-running operations causing the same behaviour. Adding debugging flags (PACKER_LOG for packer, and -vvvv for Ansible in extra_arguments) doesn't reveal any helpful information at all.
There's nothing to indicate what is causing this behaviour.
If I use the -debug flag for packer, so that there is a pause between operations; I'm able to SSH into the system I'm deploying, and run the exact operation Ansible is performing without failure.

Furthermore, what's strange is that the operations actually seem to be successful. For instance: cloning a Git repository will cause a failure in packer with exit code 2300218, but when inspecting the local repository contents, before the machine is torn down, it looks like it has managed to fully clone the repo.

On top of this... I was actually able to get an image to build successfully for one of my Ansible roles, by simply re-executing the packer build in a loop, until it randomly decided to succeed.

Operations failing in described manner

Package installs in cmacrae.sonarr role:

  - name: Ensure Sonarr dependencies are installed
    package: name={{ sonarr_dependencies }} state=present

The variable sonarr_dependencies is equal to:

  - mono-devel

Only one package, but lots of dependencies.

Git clone in cmacrae.couchpotato role:

  - name: Fetch CouchPotato source code
    git:
      repo: "{{ couchpotato_clone_uri }}"
      dest: "{{ couchpotato_user_home }}/src"
      update: false
      accept_hostkey: true
    become: true
    become_user: "{{ couchpotato_user_name }}"

The above variable values:

couchpotato_user_home: /var/lib/{{ couchpotato_user_name }}
couchpotato_clone_uri: 'git://github.com/RuudBurger/CouchPotatoServer'
couchpotato_user_name: couchpotato

Other similar operations (package installs/git clones) from different roles

Platform/Environment information

Packer version: v0.12.3
Host platform: macOS Sierra 10.12.3
Target platform (builder): Ubuntu 16.04: 20161213 (Joyent Triton image)
Debug log output from PACKER_LOG=1 packer build template.json.
Example Packer payload
Example Ansible playbook

buildetriton communicatossh need-more-info provisioneansible-local

Source

cmacrae

Most helpful comment

So I came across the same issue with the chef-local provisioner - long running processes would be interrupted by sshd disconnecting the client. What was somewhat frustrating was the test-kitchen and its EC2 plugin would be able to converge an AWS node just fine.

Ultimately, I was able to solve this problem by running a user-data script that changed the ClientAliveInterval, ClientAliveCount, and TCPKeepAlive settings for sshd_config and then restarted ssh before Chef took over the box. The same can probably be accomplished with the shell provisioner.

#!/bin/bash

# These SSH configuration values are set when the server comes up so that Packer can
# maintain a hanging, trafficless SSH connection. They're reverted by the ssh recipe.
#
sed -i -e '/Defaults    requiretty/{ s/.*/# Defaults    requiretty/ }'  /etc/sudoers
sed -i -e '/ClientAliveInterval 300/{ s/.*/ClientAliveInterval 1000/ }' /etc/ssh/sshd_config
sed -i -e '/ClientAliveCountMax 0/{ s/.*/ClientAliveCountMax 3/ }'      /etc/ssh/sshd_config
sed -i -e '/#TCPKeepAlive yes/{ s/.*/TCPKeepAlive yes/ }'               /etc/ssh/sshd_config

service sshd restart

Part of the problem was that I was using a CIS hardened AMI from the marketplace that had a more tightly controlled ssh setup.

howdoicomputer on 15 Jul 2017

❤2 👍2

All 20 comments

The exit code is a magic number signifying that ssh was disconnected without returning a exit code. Do you have any firewall in between that disconnects idle TCP connections? Or does Joyent have such? I guess it works when you run in debug mode since keep the connection alive.

rickard-von-essen on 5 Mar 2017

@rickard-von-essen Ah, I see. Nope, no firewall, physical or virtual. The firewall on the Triton instances is turned off by default when provisioning with Packer. It's strange how it seems to be during network operations - d'you think that could perhaps be the cause, somehow? I don't think it's anywhere near enough traffic to saturate the link, though

cmacrae on 5 Mar 2017

Anything I can do to help get investigation moving on this? Debugging steps, etc?

cmacrae on 19 Apr 2017

@cmacrae Can you please try the patch here? https://github.com/hashicorp/packer/pull/4809

@rickard-von-essen I think we have to either allow ansible and all other provisioners to disconnect without error, or add expect_disconnect to all of them like we do with the shell provisioner.

I'm in favor of the first option because I don't want to directly support them, so I'd rather put the behavior back to the way it was before introducing the disconnect code,

mwhooker on 19 Apr 2017

@mwhooker Yep, so, #4809 is just acting the same: the operations never complete, as described above. Except now, rather than simply exiting with an error, it echoes "Remote end disconnected while executing Ansible. Ignoring" then proceeds with packing up the image.

Not ideal, because now I'm left with half baked images that don't work because some operation failed to complete, but an image was produced anyway.

cmacrae on 19 Apr 2017

hmm, okay. I'm not sure that's a bug with packer... We can't pick up where a script left off if we get disconnected in the middle of executing it.

mwhooker on 19 Apr 2017

I wonder if this is an idle disconnect. Let me see if there's any configuration we expose. If you could get the remote command to write to stdout every 30s or so I get the feeling that might fix it

edit: Looks like we set a keepalive of 5 seconds

mwhooker on 19 Apr 2017

@mwhooker Sure, absolutely. Just really have no idea why it's happening with Packer specifically, though. I'm able to launch a container with the exact same characteristics, provision the system in the exact same way, all without being disconnected. So, I figured that for some reason it was something Packer was doing.

Hmm, okay, not sure how I'll work something like that out. I'm using the ansible-local provisioner.
Is Packer capable of running parallel provisioners? If so I can just chuck a while loop into a separate provisioner, otherwise; I'm not sure how I'd achieve that at the same time the ansible-local provisioner is running.

cmacrae on 19 Apr 2017

Since this is ansible-local I don't think it has anything to do with it. I rather suspect that there is a long running operation which causes no network traffic and ssh times out.

It would be interesting if we could reproduce that in an isolated case.

rickard-von-essen on 19 Apr 2017

So, to take both your above points into account: something interesting I've observed is that if I put a little shell provisioner in before the ansible-local provisioner is called to do its thing, and I specify a package install of something that the Ansible role/playbook was going to do (and causing this issue), the package install works, which then - as the package is already installed, Ansible gets to and only takes a second to check, before continuing.

A big difference here is that a package install from most distributions' package managers will feedback verbose information whilst the operation is taking place. Whereas Ansible will simply echo "Doing X" and then just sit there until it's finished that operation.

cmacrae on 19 Apr 2017

Is there any chance you could send me a tarball that I can extract and just run exactly what you have? I'd like to be able to reproduce this issue myself but I don't have any existing ansible playbooks

mwhooker on 19 Apr 2017

@mwhooker Sure! What d'you need in the tarball? Just the Ansible stuff?

cmacrae on 19 Apr 2017

everything I need so I can just run packer build. I don't need anything for the builder, I'll probably use my own. Everything required for the provisioners, though

mwhooker on 19 Apr 2017

@mwhooker Chucked it into a gist. Really all you need here is the playbook and a look at the Packer manifest. I included the variables file for the sake of it, obviously with most of the information redacted.

Will this suffice? Or do you need anything else?

cmacrae on 19 Apr 2017

unfortunately I'm not able to reproduce this issue on amazon:

4623.json

{
    "builders": [
        {
            "access_key": "{{user `aws_access_key`}}",
            "ami_name": "packer-qs-{{timestamp}}",
            "instance_type": "t2.micro",
            "region": "us-east-1",
            "secret_key": "{{user `aws_secret_key`}}",
            "source_ami": "ami-80861296",
            "ssh_username": "ubuntu",
            "type": "amazon-ebs"
        }
    ],
    "provisioners": [
        {
            "inline": [
                "sudo apt-get install -y ansible"
            ],
            "type": "shell"
        },
        {
            "groups": "plex_servers",
            "playbook_file": "4623.yml",
            "type": "ansible"
        }
    ],
    "variables": {
        "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
        "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
        "plex_source_machine_img": "8879c758-c0da-11e6-9e4b-93e32a67e805",
        "plex_source_machine_name": "packer-plex-provision-{{timestamp}}",
        "plex_source_machine_pkg": "b7ea1559-b600-ef40-afd1-8e6b8375a0de",
        "region": "{{env `AWS_DEFAULT_REGION`}}"
    }
}

4623.yml

---

- hosts: plex_servers

  vars:
    plex_pkg_url: 'https://downloads.plex.tv/plex-media-server/1.4.4.3495-edef59192/plexmediaserver_1.4.4.3495-edef59192_amd64.deb'
    plex_pkg_sha: ba05818febf0267c04151d9a243f2eff0823310a1c89b58d8be4608f4f7a7402

  tasks:
    - name: Fetch the Plex package file
      get_url:
        url: "{{ plex_pkg_url }}"
        sha256sum: "{{ plex_pkg_sha }}"
        dest: '/home/ubuntu/plexmediaserver.deb'

    - name: Ensure Plex is installed
      apt:
        deb: /home/ubuntu/plexmediaserver.deb
        state: present
      register: plex_install
      become: true

    - name: Pause forever
      pause: minutes=3

    - name: Restart Plex
      service:
        name: plexmediaserver
        state: restarted
      when: plex_install|changed
      become: true

    - name: Ensure the Plex service is started/enabled
      service:
        name: plexmediaserver
        state: started
        enabled: true
      become: true

mwhooker on 20 Apr 2017

@mwhooker Hmm, okay - so it works for you when using AWS? If so, I guess this issue should be closed and I should seek help from the Joyent community/engineers.

cmacrae on 20 Apr 2017

Let's leave this open for now, but it might be a good idea to do as you say. I'll see if I can create a joyent account in the meantime. It's still possible there's something we can do in packer to prevent this

mwhooker on 20 Apr 2017

#!/bin/bash

# These SSH configuration values are set when the server comes up so that Packer can
# maintain a hanging, trafficless SSH connection. They're reverted by the ssh recipe.
#
sed -i -e '/Defaults    requiretty/{ s/.*/# Defaults    requiretty/ }'  /etc/sudoers
sed -i -e '/ClientAliveInterval 300/{ s/.*/ClientAliveInterval 1000/ }' /etc/ssh/sshd_config
sed -i -e '/ClientAliveCountMax 0/{ s/.*/ClientAliveCountMax 3/ }'      /etc/ssh/sshd_config
sed -i -e '/#TCPKeepAlive yes/{ s/.*/TCPKeepAlive yes/ }'               /etc/ssh/sshd_config

service sshd restart

Part of the problem was that I was using a CIS hardened AMI from the marketplace that had a more tightly controlled ssh setup.

howdoicomputer on 15 Jul 2017

❤2 👍2

Closing this because it looks like the solution was to change the keepalive for sshd, rather than a problem within Packer.

SwampDragons on 11 Dec 2018

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.