When attempting to build an image on Triton, using the ansible_local
provisioner, any long running operations, such as installing some packages or fetching a (kind of) large Git repository, the build eventually fails stating the following error:
==> Some builds didn't complete successfully and had errors:
--> triton: Error executing Ansible: Non-zero exit status: 2300218
==> Builds finished but no artifacts were created.
I'm able to consistently reproduce this issue across several of my Ansible role's with varying long-running operations causing the same behaviour. Adding debugging flags (PACKER_LOG
for packer, and -vvvv
for Ansible in extra_arguments
) doesn't reveal any helpful information at all.
There's nothing to indicate what is causing this behaviour.
If I use the -debug
flag for packer, so that there is a pause between operations; I'm able to SSH into the system I'm deploying, and run the exact operation Ansible is performing without failure.
Furthermore, what's strange is that the operations actually seem to be successful. For instance: cloning a Git repository will cause a failure in packer with exit code 2300218
, but when inspecting the local repository contents, before the machine is torn down, it looks like it has managed to fully clone the repo.
On top of this... I was actually able to get an image to build successfully for one of my Ansible roles, by simply re-executing the packer build in a loop, until it randomly decided to succeed.
cmacrae.sonarr
role: - name: Ensure Sonarr dependencies are installed
package: name={{ sonarr_dependencies }} state=present
The variable sonarr_dependencies
is equal to:
- mono-devel
Only one package, but lots of dependencies.
cmacrae.couchpotato
role: - name: Fetch CouchPotato source code
git:
repo: "{{ couchpotato_clone_uri }}"
dest: "{{ couchpotato_user_home }}/src"
update: false
accept_hostkey: true
become: true
become_user: "{{ couchpotato_user_name }}"
The above variable values:
couchpotato_user_home: /var/lib/{{ couchpotato_user_name }}
couchpotato_clone_uri: 'git://github.com/RuudBurger/CouchPotatoServer'
couchpotato_user_name: couchpotato
v0.12.3
PACKER_LOG=1 packer build template.json
.The exit code is a magic number signifying that ssh was disconnected without returning a exit code. Do you have any firewall in between that disconnects idle TCP connections? Or does Joyent have such? I guess it works when you run in debug mode since keep the connection alive.
@rickard-von-essen Ah, I see. Nope, no firewall, physical or virtual. The firewall on the Triton instances is turned off by default when provisioning with Packer. It's strange how it seems to be during network operations - d'you think that could perhaps be the cause, somehow? I don't think it's anywhere near enough traffic to saturate the link, though
Anything I can do to help get investigation moving on this? Debugging steps, etc?
@cmacrae Can you please try the patch here? https://github.com/hashicorp/packer/pull/4809
@rickard-von-essen I think we have to either allow ansible and all other provisioners to disconnect without error, or add expect_disconnect
to all of them like we do with the shell provisioner.
I'm in favor of the first option because I don't want to directly support them, so I'd rather put the behavior back to the way it was before introducing the disconnect code,
@mwhooker Yep, so, #4809 is just acting the same: the operations never complete, as described above. Except now, rather than simply exiting with an error, it echoes "Remote end disconnected while executing Ansible. Ignoring" then proceeds with packing up the image.
Not ideal, because now I'm left with half baked images that don't work because some operation failed to complete, but an image was produced anyway.
hmm, okay. I'm not sure that's a bug with packer... We can't pick up where a script left off if we get disconnected in the middle of executing it.
I wonder if this is an idle disconnect. Let me see if there's any configuration we expose. If you could get the remote command to write to stdout every 30s or so I get the feeling that might fix it
edit: Looks like we set a keepalive of 5 seconds
@mwhooker Sure, absolutely. Just really have no idea why it's happening with Packer specifically, though. I'm able to launch a container with the exact same characteristics, provision the system in the exact same way, all without being disconnected. So, I figured that for some reason it was something Packer was doing.
Hmm, okay, not sure how I'll work something like that out. I'm using the ansible-local
provisioner.
Is Packer capable of running parallel provisioners? If so I can just chuck a while loop into a separate provisioner, otherwise; I'm not sure how I'd achieve that at the same time the ansible-local
provisioner is running.
Since this is ansible-local I don't think it has anything to do with it. I rather suspect that there is a long running operation which causes no network traffic and ssh times out.
It would be interesting if we could reproduce that in an isolated case.
So, to take both your above points into account: something interesting I've observed is that if I put a little shell provisioner in before the ansible-local
provisioner is called to do its thing, and I specify a package install of something that the Ansible role/playbook was going to do (and causing this issue), the package install works, which then - as the package is already installed, Ansible gets to and only takes a second to check, before continuing.
A big difference here is that a package install from most distributions' package managers will feedback verbose information whilst the operation is taking place. Whereas Ansible will simply echo "Doing X" and then just sit there until it's finished that operation.
Is there any chance you could send me a tarball that I can extract and just run exactly what you have? I'd like to be able to reproduce this issue myself but I don't have any existing ansible playbooks
@mwhooker Sure! What d'you need in the tarball? Just the Ansible stuff?
everything I need so I can just run packer build
. I don't need anything for the builder, I'll probably use my own. Everything required for the provisioners, though
@mwhooker Chucked it into a gist. Really all you need here is the playbook and a look at the Packer manifest. I included the variables file for the sake of it, obviously with most of the information redacted.
Will this suffice? Or do you need anything else?
unfortunately I'm not able to reproduce this issue on amazon:
{
"builders": [
{
"access_key": "{{user `aws_access_key`}}",
"ami_name": "packer-qs-{{timestamp}}",
"instance_type": "t2.micro",
"region": "us-east-1",
"secret_key": "{{user `aws_secret_key`}}",
"source_ami": "ami-80861296",
"ssh_username": "ubuntu",
"type": "amazon-ebs"
}
],
"provisioners": [
{
"inline": [
"sudo apt-get install -y ansible"
],
"type": "shell"
},
{
"groups": "plex_servers",
"playbook_file": "4623.yml",
"type": "ansible"
}
],
"variables": {
"aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
"aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
"plex_source_machine_img": "8879c758-c0da-11e6-9e4b-93e32a67e805",
"plex_source_machine_name": "packer-plex-provision-{{timestamp}}",
"plex_source_machine_pkg": "b7ea1559-b600-ef40-afd1-8e6b8375a0de",
"region": "{{env `AWS_DEFAULT_REGION`}}"
}
}
---
- hosts: plex_servers
vars:
plex_pkg_url: 'https://downloads.plex.tv/plex-media-server/1.4.4.3495-edef59192/plexmediaserver_1.4.4.3495-edef59192_amd64.deb'
plex_pkg_sha: ba05818febf0267c04151d9a243f2eff0823310a1c89b58d8be4608f4f7a7402
tasks:
- name: Fetch the Plex package file
get_url:
url: "{{ plex_pkg_url }}"
sha256sum: "{{ plex_pkg_sha }}"
dest: '/home/ubuntu/plexmediaserver.deb'
- name: Ensure Plex is installed
apt:
deb: /home/ubuntu/plexmediaserver.deb
state: present
register: plex_install
become: true
- name: Pause forever
pause: minutes=3
- name: Restart Plex
service:
name: plexmediaserver
state: restarted
when: plex_install|changed
become: true
- name: Ensure the Plex service is started/enabled
service:
name: plexmediaserver
state: started
enabled: true
become: true
@mwhooker Hmm, okay - so it works for you when using AWS? If so, I guess this issue should be closed and I should seek help from the Joyent community/engineers.
Let's leave this open for now, but it might be a good idea to do as you say. I'll see if I can create a joyent account in the meantime. It's still possible there's something we can do in packer to prevent this
So I came across the same issue with the chef-local
provisioner - long running processes would be interrupted by sshd disconnecting the client. What was somewhat frustrating was the test-kitchen
and its EC2 plugin would be able to converge an AWS node just fine.
Ultimately, I was able to solve this problem by running a user-data script that changed the ClientAliveInterval
, ClientAliveCount
, and TCPKeepAlive
settings for sshd_config and then restarted ssh before Chef took over the box. The same can probably be accomplished with the shell provisioner.
#!/bin/bash
# These SSH configuration values are set when the server comes up so that Packer can
# maintain a hanging, trafficless SSH connection. They're reverted by the ssh recipe.
#
sed -i -e '/Defaults requiretty/{ s/.*/# Defaults requiretty/ }' /etc/sudoers
sed -i -e '/ClientAliveInterval 300/{ s/.*/ClientAliveInterval 1000/ }' /etc/ssh/sshd_config
sed -i -e '/ClientAliveCountMax 0/{ s/.*/ClientAliveCountMax 3/ }' /etc/ssh/sshd_config
sed -i -e '/#TCPKeepAlive yes/{ s/.*/TCPKeepAlive yes/ }' /etc/ssh/sshd_config
service sshd restart
Part of the problem was that I was using a CIS hardened AMI from the marketplace that had a more tightly controlled ssh setup.
Closing this because it looks like the solution was to change the keepalive for sshd, rather than a problem within Packer.
I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
So I came across the same issue with the
chef-local
provisioner - long running processes would be interrupted by sshd disconnecting the client. What was somewhat frustrating was thetest-kitchen
and its EC2 plugin would be able to converge an AWS node just fine.Ultimately, I was able to solve this problem by running a user-data script that changed the
ClientAliveInterval
,ClientAliveCount
, andTCPKeepAlive
settings for sshd_config and then restarted ssh before Chef took over the box. The same can probably be accomplished with the shell provisioner.Part of the problem was that I was using a CIS hardened AMI from the marketplace that had a more tightly controlled ssh setup.