After I upgrade ansible 2.0.0.1 from 1.9.4, when performing a playlook always encounter intermittent fault "failed to resolve remote temporary directory from".
$ ansible-playbook playlooks/playlook-filters.yml
PLAY ***************************************************************************
TASK [command] *****************************************************************
fatal: [YT_8_22]: FAILED! => {"failed": true, "msg": "ERROR! failed to resolve remote temporary directory from ansible-tmp-1452759681.1-95441304852350: `( umask 22 && mkdir -p \"$( echo /tmp/.ansible/tmp/ansible-tmp-1452759681.1-95441304852350 )\" && echo \"$( echo /tmp/.ansible/tmp/ansible-tmp-1452759681.1-95441304852350 )\" )` returned empty string"}
...ignoring
TASK [debug] *******************************************************************
ok: [YT_8_22] => {
"msg": "it failed"
}
PLAY RECAP *********************************************************************
YT_8_22 : ok=2 changed=0 unreachable=0 failed=0
This is happening for one of the tasks randomly, following are those tasks:
---
- hosts: YT_8_22
tasks:
- shell: /bin/true
register: result
ignore_errors: True
- debug: msg="it failed"
when: result|failed
I try on 1.9.4, have not seen the phenomenon. Can anybody tell me how to do that ?
I found another links to similar problems: https://groups.google.com/forum/#!searchin/ansible-project/Intermittent$20error%7Csort:relevance/ansible-project/FyK6au2O9KY/tWuf31P9AQAJ
Hi there,
I also cough that with -vvvv
flag.
<45.56.96.138> ESTABLISH SSH CONNECTION FOR USER: root
<45.56.96.138> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/riversy/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/riversy/.ansible/cp/ansible-ssh-%h-%p-%r -tt 45.56.96.138 'mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )"'
fatal: [45.56.96.138]: FAILED! => {"failed": true, "msg": "ERROR! failed to resolve remote temporary directory from ansible-tmp-1453707878.36-19081627291623: `mkdir -p \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )\" && echo \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453707878.36-19081627291623 )\"` returned empty string"}
My version of ansible is:
ansible 2.0.0.2
config file = /etc/ansible/ansible.cfg
configured module search path = Default w/o overrides
Will be appreciate for any advice.
Update: Sometimes it also happens even on: ansible all -i inventory/test -m ping
command.
I have noticed that if I run a playbook back to back I generally get success.
IE
Run #1:
fatal: [1.2.3.4]: FAILED! => {"failed": true, "msg": "failed to resolve remote temporary directory from ansible-tmp-1453754269.23-53039386122234: ( umask 22 && mkdir -p \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453754269.23-53039386122234 )\" && echo \"$( echo $HOME/.ansible/tmp/ansible-tmp-1453754269.23-53039386122234 )\" )
returned empty string"}
Run #2: Success
However waiting ~ a minute the failure will occur again. Monitoring the remote machine I do not see the ~/.ansible/tmp/
Also started experiencing the error after upgrading from Ansible 1.9.4 to 2.0.0.2.
Doubt it matters, but just in case it's a CentOS 7.1 VPS hosted on Linode.
I encountered this on random shell/command tasks in a playbook against a CentOS vps on amazon. That's our only CentOS 7.2 machine with 20+ other amazon ami servers which haven't shown the issue so far.
I tried 2.0.0, 2.0.0.2, 2.1, pipelining on/off, controlpersist on/off, and running from OS X / amazon ami. I also tried removing the .ansible files from the target node with no effect
This might not be related, but I ended up resolving this issue for now by restarting the machine. I had noticed logs like this mentioned a corrupted btmp log, and executed 'cat /dev/null > /var/log/btmp', followed by a restart. The btmp thing might not be related, but it was logging on every ansible command/connection. I had moved my /var partition to a different disk the day before, it's possible it became corrupted from that.
sshd[24031]: pam_unix(sshd:session): session closed for user centos
sshd[24248]: Accepted publickey for centos from port 45694 ssh2:
sshd[24248]: pam_unix(sshd:session): session opened for user centos by (uid=0)
sshd[24248]: pam_lastlog(sshd:session): corruption detected in /var/log/btmp
sshd[24251]: Received disconnect from : 11: disconnected by user
@mgaley based on your feedback I looked into the ssh connection to my client and found something interesting.
I am using RHEL 6.6 for both source and destination machines and Ansible 2.0.0.2
Remoteserver#> tail -f /var/log/secure
Jan 27 07:47:48 Remoteserver sshd[19329]: Accepted publickey for jenkins from Ansiblehost port 41384 ssh2
Jan 27 07:47:48 Remoteserver sshd[19329]: pam_unix(sshd:session): session opened for user jenkins by (uid=0)
<< --- >>
Ansibleserver: ERROR! failed to resolve remote temporary directory
SSH Connection remains established by the Ansible PID (but not the original pid that opens it, I assume this is expected)
Running again within the 60s window where the connection remains established produced my above result of success
"ControlPersist=60s"
<<-->>
Retry playbook (previous Established SSH session is used)
Jan 27 07:48:11 Remoteserver sshd[19333]: subsystem request for sftp
Jan 27 07:48:12 Remoteserver sshd[19333]: subsystem request for sftp
Jan 27 07:49:13 Remoteserver sshd[19333]: Received disconnect from Ansiblehost: 11: disconnected by user
Jan 27 07:49:13 Remoteserver sshd[19329]: pam_unix(sshd:session): session closed for user jenkins
Playbook completes successfully.
I see the same issue as well running
ansible 2.1.0 (devel 4b1d621442) last updated 2016/01/28 19:20:32 (GMT -400)
I also had the same issue when using the same playbook against 5 amazon ami servers in 5 separate blocks of a play, only against the first node though. Didn't see anything related in the secure log this time.
Only in related to shell/command modules, running multiple playbooks at the same time locally, even to the same remote hosts didn't cause the error to happen any more often.
We get this as well. Interesting to note https://groups.google.com/forum/#!msg/ansible-devel/5i6VDHKZ30I/0ksVpEEICwAJ the possibility of a race condition in dir creation?
I'm seeing this too. It appears to happen randomly on arbitrary tasks in a playbook, always regarding this tmp file stuff. I just re-run the playbook until it works. I am using a network filesystem (rackspace block storage).
Happens to me as well. Truly random it seems.
There are two issues here.
I suspect something to do with how function mkdtemp in file https://github.com/ansible/ansible/blob/42e312d3bd0516ceaf2b4533ac643bd9e05163cd/lib/ansible/plugins/shell/sh.py works.
Anyone feel I'm completely on the wrong track?
+1 getting the same issue. Really random
I'm getting the same, using archlinux vagrant controller and archlinux vagrant targets.
All boxes are on ansible v 2.0.0.2
This is caused by a bug in the Linux kernel or in openssh, not sure which one is at fault.
https://bugzilla.mindrot.org/show_bug.cgi?id=2492
https://lkml.org/lkml/2015/12/9/775
One workaround is to downgrade the linux kernel on the server to a version that does not include commit 1a48632ffed61352a7810ce089dc5a8bcd505a60. For Ubuntu 14.04 this is e.g. 3.16.0-60, or 3.19.0-22 or earlier.
I am running kernel 2.6.32-504 (rhel 6.6). And this bug only occurs on Ansible 2.x not previous versions.
I'm using kernel 4.3.3-3-ARCH and am not interested in going back to 3.x kernels on these machines.
Another workaround would be using a different transport other than Openssh (paramiko?).
I'm going to close this ticket as Ansible itself is not doing anything wrong here, this is just a bad interaction between tools we depend on.
Why did it only start appearing after folks switched from Ansible 1.9.x to Ansible 2.0.x?
Not disputing that it's not an issue with the underlying tools, just genuinely curious why it wasn't an issue before.
I'm not sure how this is considered resolved... @rpsiv confirmed it happens with older kernels as well, so not a kernel bug. It didn't happen for anyone before Ansible 2.0.x and now it's widespread.
between tools we depend on
Which are these tools that you're referring to @bcoca ? I'd be happy to find the root cause and help it get fixed by working with other open source teams, however at this point it's not clear to me where should I (we) begin? Thanks.
@bcoca : ping?
Thing works perfectly in 1.9.4.
After all, if you see source of problem, please consider adding workaround in ansible v2.x
The fact this happens only once on each server is a clue I think.
@khushil it has happened multiple times on some of our servers, even rebooting them etc. didn't help. Sometimes the issue goes away suddenly but sometimes no matter what we do it persists. It's really annoying :-/
+1000.
I get this error every time (on random server) only with parallel execution (5, default):
ansible-playbook -i hosts -f 5 base_packages.yml
but never with (-f 1):
ansible-playbook -i hosts -f 1 base_packages.yml
@TomasCrhonek I just tried with forks=1 and bumped into the same issue.
It happens also when limited (-l) to only one host (but forks= at default).
Ok, after hundreds of runs I finaly get this error with -f 1. It's very rare.
(Debian Testing up to date, ansible 2.0.0.2)
Anyone got a workaround? With docker module it happens all the time :(
I've rolled back to 1.9
Didn't found a way.
On Mon, Feb 15, 2016, 18:30 leonty [email protected] wrote:
Anyone got a workaround? With docker module it happens all the time :(
—
Reply to this email directly or view it on GitHub
https://github.com/ansible/ansible/issues/13876#issuecomment-184254466.
I had the same issue with RHEL 6.6 for both source and destination machines and Ansible 2.0.0.2.
But, when I deactivate the ssh ControleMaster, in ansible.cfg, it seems resolve this issue.
[ssh_connection]
ssh_args = -o ControlMaster=no
Tell me if it works for you too...
Returned back to ansible 2.0.0.2 and tried to reproduce the fix of @vincentdelaunay .
But I've no found this error here anymore. It looks like the issue was fixed by someth other package over a time.
My system is Ubuntu 15.10
OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015
Kernel is 4.2.0-27-generic
Hope it will help.
@vincentdelaunay I try deactivate the ssh ControleMaster, in ansible.cfg on ansible 2.0.0.2, I've no found this error here anymore, too.
But the same environment, in ansible 1.9.4, don't need deactivate the ssh ControleMaster, in ansible.cfg. I tried to use -vvvv and contrast ansible 1.9.4 and 2.0.0.2 in the process of execution SSH parameters, no differences were found.
If anyone know the reason lead to the fail?
I see this error intermittently even with Paramiko. The behavior seems to be that each task is likely to fail from this error with small (~1-2%) probability, but for a playbook with hundreds of tasks it is certain to occur.
Setting ssh_args = -o ControlMaster=no
didn't help.
Server linux kernel 4.4.0.
However, I don't believe it is really an Ansible bug. The following bash script, isolated from Ansible's access pattern, reliably reproduces for me:
#!/bin/bash
set -e
HOST=YOUR_REMOTE_HOST_HERE
for i in `seq 1 100`; do
TEMP=`ssh -C -o ControlMaster=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -tt $HOST 'mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )"'`
WC=`echo "$TEMP" | wc -c | tr -d ' '`
echo "$TEMP"
echo "$WC"
if [ "$WC" != "60" ]; then
echo "error in $TEMP"
exit 1
fi
done
For some reason, some of these SSH connections mysteriously don't have any output.
I have traced this down to associated with the SSH flag -tt
.
The following patch completely resolves the problem for me:
diff -r /Users/drew/Code/ansible-old/plugins/connection/ssh.py /Library/Python/2.7/site-packages/ansible/plugins/connection/ssh.py
572,575c572
< if in_data:
< cmd = self._build_command('ssh', self.host, cmd)
< else:
< cmd = self._build_command('ssh', '-tt', self.host, cmd)
---
> cmd = self._build_command('ssh', self.host, cmd)
691c688
< self._connected = False
\ No newline at end of file
---
> self._connected = False
Reading through the history here, it seems this flag is used to work around some complex problem involving sudoer configuration on some hosts.
I don't suffer from whatever problem this flag is here to solve, but I do suffer from the problem it creates, so applying this patch is a very effective workaround for me.
@drewcrawford big :+1: for going down the rabbit hole on this one. Curious to hear more of what problem the unpatched code is trying to workaround.
After I closed a bunch of connections to the target, Ansible stopped giving me these errors when it ran. Neither -f1 nor the full set of --ssh-extra-args disabling Control (https://groups.google.com/d/msg/ansible-project/FyK6au2O9KY/PS4yH6Y2AwAJ) seemed to have much influence.
Having the same problem since I updated some servers from kernel 3.x to kernel 4.x. I can confirm that neither -f1 nor the full set of --ssh-extra-args disabling Control didn't help. I patched ansible like @drewcrawford and now it works.
@drewcrawford's fix works for me too. I have some playbooks with hundreds of tasks, which virtually guaranteed some arbitrary task would fail with this error. I've been using the fix for a few days now without issue.
@drewcrawford Nice work there, incidentally for anyone considering it would thoroughly stay away from going below the 3.19.0-50-generic kernel on Ubuntu with Docker as you'll hit a lot of issues with an aufs bug.
The bug is still there in ansible 2.0.1.0 (stable-2.0 61e9841e08)
I can confirm that patch by @drewcrawford works for me.
I applyed this patch from @bcoca
https://github.com/ansible/ansible/issues/14377#issuecomment-181909899
Which solves the issue for me and also does not break sudo
Regarding the patch mentioned directly above--while it's likely workable for most folks, it's not perfect: https://github.com/ansible/ansible/issues/14377#issuecomment-190528876
There's a lot of conflicting information in this issue. There could be multiple issues being exposed but because it's intermittent it's hard to tell for sure. According to @drewcrawford 's research, this would seem to be in some way related to pseudo terminal (tty) allocation with ssh. That may make sense with the information earlier in the bug about some bad interaction between ssh and the linux kernel tty drivers.
If removing -tt is all that's needed then people should be able to work around the problem with a config file change on the ansible side and possibly a config file change to sudo on the remote hosts. The conditional he's found in the ssh connection plugin is triggered by the pipelining setting in ansible.cfg. http://docs.ansible.com/ansible/intro_configuration.html#pipelining
So set pipelining = True in ansible.cfg and if necessary remove requiretty from your /etc/sudoers config file so that sudo can work.
Note that there was another comment even earlier in the issue that said that turning pipelining on and off didn't affect the issue at all. That's why I wonder if there's mutliple issues that have the same symptom. No way to find out for sure unless more people try turning pipelining on to see if it affects how frequently they have this issue in any way.
The conditional he's found in the ssh connection plugin is triggered by the pipelining setting in ansible.cfg
I have had the following in my ansible.cfg since install:
[ssh_connection]
pipelining=True
It did not solve the issue. The only reliable solution for my case was to disable -tt
via patch. There is some other way to still end up in the -tt
case with the above config file.
There could be multiple issues being exposed but because it's intermittent it's hard to tell for sure.
The bash script in this comment, is not intermittent. The intermittent part is that short ansible playbooks may not stress test as thoroughly as that bash script. But the script itself should be able to reliably diagnose the particular problem that I suffer from.
If someone can report they suffer from the same error without failing that bash script, then we would know there are multiple issues.
Does pipelining = True yield a lower failure rate?
If you run with pipelining = True and -vvvv do you see that the last connection before the error uses -tt or not? Uses pipelining or not?
And you are correct that only modules go through pipelining so it's possible that we're failing when executing a remote shell command rather than a module... but I'd expect that we'd see some indication of that from the answers to my previous questions.
Ah -- and probably one more question -- do you have requiretty=True in your remote sudoers? I think I tested a few months ago and it was still unsafe to remove -tt if requiretty is on but a counter example would be wonderful.
I've worked around it by having a ssh wrapper in PATH on my host. I use a makefile to start ansible-playbook, and I prepend $PWD/.ssh-hack/ to the PATH before executing ansible-playbook.
.ssh-hack/ssh is the following script:
#!/usr/bin/perl
use strict;
use warnings;
my @command;
foreach my $component (@ARGV)
{
next if $component eq '-tt';
push(@command,$component);
}
exec('/usr/bin/ssh',@command);
After adding this hack this issue has gone away.
@zerodogg
Reopening on the chance that we can find a code workaround to this problem. Can't make promises about the viability of that possibility until we gather more information, though.
@abadger None of my hosts have requiretty set (default Debian and Arch sudoers). Perhaps if removing -tt
by default isn't feasable, it could be made into an host option along the lines of ansible_python_interpreter
? ansible_no_force_tty=BOOL
or something, it would at least allow it to be worked around without a hack.
@drewcrawford I'm wondering if this is related to some garbage input coming across the line and being echoed back. If you prepend echo -n '' |
to your ssh command, does it still reliably fail for you?
Also, I'd really like someone who's able to reproduce this to run it with ANSIBLE_DEBUG=1
and post the full output (gist would be preferable to pasting it here).
@drewcrawford btw, we keep asking you to test as we're still unable to reproduce even with your handy script :-(
Here's an equivalent script to @drewcrawford's, using the internal API:
from ansible.utils.display import Display display = Display(verbosity=3) from ansible.playbook.play_context import PlayContext from ansible.plugins import connection_loader p = PlayContext() p.remote_addr = '<YOUR HOST HERE>' p.ssh_args = '-o ControlMaster=no' #p.verbosity = 4 ssh = connection_loader.get('ssh', p, None) for i in range(1000): res = ssh.exec_command('mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )"') if res[1].strip() != '/path/to/users/home/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189': print("FAILED") break print(res) ssh.exec_command('rm -rf $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189')
So, first off let me say that I completely believe the reporters of this bug, in that there is something going on here. There are just too many people reporting this to dismiss it. However, several of us have been trying to reproduce this for an hour on a variety of setups and have yet to see the reported failure.
So at this point, we really need a definitive way to reproduce this. If someone could please use my above script or the following variation on a set of EC2 instances and provide us the relevant AMI id's to help debug this, we'd really appreciate it.
import time from ansible.utils.display import Display display = Display(verbosity=3) from ansible.playbook.play_context import PlayContext from ansible.plugins import connection_loader p = PlayContext() p.remote_addr = '<YOUR HOST HERE>' p.ssh_args = '-o ControlMaster=auto -o ControlPersist=3' #p.verbosity = 4 ssh = connection_loader.get('ssh', p, None) for i in range(1000): res = ssh.exec_command('mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189 )"') if res[1].strip() != '/<YOUR REMOTE HOME HERE>/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189': print("FAILED") break print(res) ssh.exec_command('rm -rf $HOME/.ansible/tmp/ansible-tmp-1456012812.4-96487876295189') time.sleep(5)
The main difference with this script is that it does use ControlPersist, but sets the timeout and sleeps such that a new persisted connection is created each time.
Ok, I'm actually able to reproduce this consistently against this docker image: chrismeyers/centos6
, using the following script (which uses the action plugin methods more directly to create the remote temp):
#!/usr/bin/python from ansible.utils.display import Display display = Display(verbosity=3) from ansible.playbook.play_context import PlayContext from ansible.plugins import action_loader, connection_loader p = PlayContext() p.remote_addr = '172.17.0.2' p.remote_user = 'root' p.password = 'docker.io' p.pipelining = True ssh = connection_loader.get('ssh', p, None) handler = action_loader.get('normal', None, ssh, p, None, None, None) while True: tmp = handler._make_tmp_path() print(tmp) handler._remove_tmp_path(tmp)
woot!
So some more fun with this bug - running sshd in the above container with -ddd to try and get some debugging info about the remote side does not fail. I've been running the above script against it for over 11 hours without triggering the bug.
My software stack:
This bug always come up when starting heavy processes (docker containers or java) at aws t2.micro, and go away with t2.large.
It seems that the system is unstable with increasing load.
Hmm, buffer-flush race with the muxed-connection channel close? What if we hung a tiny sleep off the end of the commands?
@ederrm - that's really helpful info- I think we're mostly doing no-op-ish things in our tests. I wonder if we can trigger this more often by kicking off a background spinwait or something...
Experiencing exactly the same behavior as mentioned above.
$ ansible --version
ansible 2.0.1.0
config file =
configured module search path = Default w/o overrides
[16:50:27][jenkins@master51_192.168.121.200:][~]
$ ansible -vvvv nat21.host.ee --private-key /var/lib/jenkins/.ssh/id_rsa -u root -m ping
No config file found; using defaults
Loaded callback minimal of type stdout, v2.0
<nat21.host.ee> ESTABLISH SSH CONNECTION FOR USER: root
<nat21.host.ee> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o Port=22 -o 'IdentityFile="/var/lib/jenkins/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt nat21.host.ee '/bin/sh -c '"'"'mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1459864230.34-240286631163976 `" && echo "` echo $HOME/.ansible/tmp/ansible-tmp-1459864230.34-240286631163976 `"'"'"''
nat21.host.ee | FAILED! => {
"failed": true,
"msg": "failed to resolve remote temporary directory from ansible-tmp-1459864230.34-240286631163976: `mkdir -p \"` echo $HOME/.ansible/tmp/ansible-tmp-1459864230.34-240286631163976 `\" && echo \"` echo $HOME/.ansible/tmp/ansible-tmp-1459864230.34-240286631163976 `\"` returned empty string"
}
[16:50:42][jenkins@master51_192.168.121.200:][~]
$ ansible -vvvv nat21.host.ee --private-key /var/lib/jenkins/.ssh/id_rsa -u root -m ping
No config file found; using defaults
Loaded callback minimal of type stdout, v2.0
<nat21.host.ee> ESTABLISH SSH CONNECTION FOR USER: root
<nat21.host.ee> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o Port=22 -o 'IdentityFile="/var/lib/jenkins/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt nat21.host.ee '/bin/sh -c '"'"'mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1459864243.6-2450988655105 `" && echo "` echo $HOME/.ansible/tmp/ansible-tmp-1459864243.6-2450988655105 `"'"'"''
<nat21.host.ee> PUT /tmp/tmprOjXcP TO /root/.ansible/tmp/ansible-tmp-1459864243.6-2450988655105/ping
<nat21.host.ee> SSH: EXEC sftp -b - -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o Port=22 -o 'IdentityFile="/var/lib/jenkins/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r '[nat21.host.ee]'
<nat21.host.ee> ESTABLISH SSH CONNECTION FOR USER: root
<nat21.host.ee> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o Port=22 -o 'IdentityFile="/var/lib/jenkins/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/var/lib/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt nat21.host.ee '/bin/sh -c '"'"'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1459864243.6-2450988655105/ping; rm -rf "/root/.ansible/tmp/ansible-tmp-1459864243.6-2450988655105/" > /dev/null 2>&1'"'"''
nat21.host.ee | SUCCESS => {
"changed": false,
"invocation": {
"module_args": {
"data": null
},
"module_name": "ping"
},
"ping": "pong"
}
As you can see running the ping in first attempt failed, running again exactly the same command for the 2nd time worked fine. Has something to do with ssh persistent connection?
edit:
Issue was resolving by adding:
...
[ssh_connection]
ssh_args = -o ControlMaster=no
...
Into ansible.cfg
I can barely run my playbook fully, it often fails in "setup" too. Also, handlers won't be run, if a change was made, and then ansible aborted.
grep ControlMaster etc/ansible.cfg
ssh_args = -o ControlMaster=no -o ControlPersist=60s
ansible-playbook buildbot-slaves.yml --limit=juschinka.home
fatal: [juschinka.home]: FAILED! => {"failed": true, "msg": "failed to resolve remote temporary directory from ansible-tmp-1460312056.02-264065629966800:mkdir -p \"
echo $HOME/.ansible/tmp/ansible-tmp-1460312056.02-264065629966800\" && echo \"
echo $HOME/.ansible/tmp/ansible-tmp-1460312056.02-264065629966800\"
returned empty string"}
Also I do not have any better luck without openssh, paramiko seems to not work too:
ansible-playbook -c paramiko buildbot-slaves.yml --limit=juschinka.home
fatal: [juschinka.home]: FAILED! => {"failed": true, "msg": "failed to resolve remote temporary directory from ansible-tmp-1460312119.01-155638274935319: mkdir -p \"
echo $HOME/.ansible/tmp/ansible-tmp-1460312119.01-155638274935319 \" && echo \"
echo $HOME/.ansible/tmp/ansible-tmp-1460312119.01-155638274935319 \"
returned empty string"}
I first noticed this, once I started working on my faster desktop machine, the laptop didn't do it, although the Debian installed ansible on the laptop didn't do it. What's interesting is that this only started today, I worked all week in this setup without an issue and that only one machine is very affected, and one less often, but both are Debian Stretch (Testing). I do ansible-playbook from both machines. It does not matter on which I do run the playbook, local and remote, both can fail. Two Debian stable machines are practically never affected.
What I didn't get from the discussion, if "sudo" is used, even if I am using remote_user = root. I didn't yet clean up the sudo configuration. This could definitely be different.
I have been using ansible without running into this problem for quite a while. Earlier today, I began a large file upload on another device (unrelated), and I started consistently running into this problem -- typically in the first 3 or 4 stages of my playbook. In my case, setting pipelining=True prevented the issue from recurring as often, but it still happened on a command which did use -tt:
ssh -C -vvv -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ubuntu -o ConnectTimeout=10 -o ControlPath=/Users/crschmidt/.ansible/cp/ansible-ssh-%h-%p-%r -tt 66.228.35.122 '( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1460329511.62-77852023686217 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1460329511.62-77852023686217 )" )'
I think this supports the idea that this is a race condition; in my case, possibly made more common due to upstream bandwidth constraints lowering my effective bandwidth for ansible.
Thought I would mention this in case this additional data helped.
crschmidt-macbookair:~ crschmidt$ ansible-playbook --version
ansible-playbook 2.0.0.2
I would suggest more people confirm if the changes that @drewcrawford posted earlier work for them or not, in my case they have allowed me to reliably avoid this issue.
If more people can confirm whether that change fixes their issues, then it might be easier to get something accepted upstream that introduces it as an option or as otherwise appropriate.
@cewood good idea.
I can confirm this patch seems to fix the issue.
In my case, Playbooks hardly went all the way through using only one fork (and it got worse with multiple forks). With the patch applied, it works with as many forks as (virtual) hosts (on a shared node). Thanks!
fixed the issue for me, thanks @drewcrawford
I also confirm the patch from @drewcrawford fix the issue here.
I had done the change by @drewcrawford manually (added an "or True") to the condition, and it fixes the issue entirely for me. I think that also confirms his patch.
So I think we should probably restrict the removal of the -tt
to situations where we're running raw shell commands, for instance the mkdir command. The sudoable
option I think would encompass this, as it is set to false in this situation:
cmd = self._connection._shell.mkdtemp(basefile, use_system_tmp, tmp_mode) result = self._low_level_execute_command(cmd, sudoable=False)
So @drewcraford's patch would become:
diff --git a/lib/ansible/plugins/connection/ssh.py b/lib/ansible/plugins/connection/ssh.py index dcea2e5..b03b15f 100644 --- a/lib/ansible/plugins/connection/ssh.py +++ b/lib/ansible/plugins/connection/ssh.py @@ -559,11 +559,12 @@ class Connection(ConnectionBase): # python interactive-mode but the modules are not compatible with the # interactive-mode ("unexpected indent" mainly because of empty lines) - if in_data: - cmd = self._build_command('ssh', self.host, cmd) + if not in_data and sudoable: + args = ('ssh', '-tt', self.host, cmd) else: - cmd = self._build_command('ssh', '-tt', self.host, cmd) + args = ('ssh', self.host, cmd) + cmd = self._build_command(*args) (returncode, stdout, stderr) = self._run(cmd, in_data, sudoable=sudoable) return (returncode, stdout, stderr)
And a sample test:
# ANSIBLE_SSH_PIPELINING=0 ansible -m ping awxlocal -vvv -c ssh Using /etc/ansible/ansible.cfg as config file <192.168.122.100> ESTABLISH SSH CONNECTION FOR USER: testing <192.168.122.100> SSH: EXEC ssh -C -q -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=testing -o ConnectTimeout=10 192.168.122.100 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095 `" && echo "` echo $HOME/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095 `" )'"'"'' <192.168.122.100> PUT /tmp/tmp2GYPZG TO /var/lib/testing home/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095/ping <192.168.122.100> SSH: EXEC scp -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=testing -o ConnectTimeout=10 /tmp/tmp2GYPZG '[192.168.122.100]:'"'"'/var/lib/testing home/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095/ping'"'"'' <192.168.122.100> ESTABLISH SSH CONNECTION FOR USER: testing <192.168.122.100> SSH: EXEC ssh -C -q -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=testing -o ConnectTimeout=10 -tt 192.168.122.100 '/bin/sh -c '"'"'LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python '"'"'"'"'"'"'"'"'/var/lib/testing home/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095/ping'"'"'"'"'"'"'"'"'; rm -rf "/var/lib/testing home/.ansible/tmp/ansible-tmp-1460397005.14-70049559304095/" > /dev/null 2>&1'"'"'' awxlocal | SUCCESS => { "changed": false, "invocation": { "module_args": { "data": null }, "module_name": "ping" }, "ping": "pong" }
^ You can see in the above, that the first EXEC line does not use -tt
, however the next two do.
Can someone who's having this issue try the above patch to make sure it resolves the issue in the same way as Drew's original?
@cewood The patch from @drewcrawford worked for me. I've been running this for a week without error.
FWIW, my patch above seems to resolve the issue in my reproducing setup, if anyone else wants to verify:
1) docker run -d chrismeyers/centos6
2) run my python script above (running multiple instances of it may speed up a failure)
With the above, I usually see a failure in < 3 minutes. With my patch, the scripts ran for 30+ minutes without a failure. The problem I have here is that it doesn't always fail, the bug is so intermittent that I have trouble saying it is definitively fixed despite my test script running for that long.
I've gone ahead and merged in the patch I posted above. This will also be included in the next release.
If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:
Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular issue is resolved.
Thank you!
To add one more datapoint confirming that there was a bug specific to the -tt
option, I experienced these same symptoms in Ansible 1.5.4; I was able to reproduce the error effectively 100% of the time in at least one task on a large playbook and the perl script hack that @zerodogg posted worked around the problem immediately.
@jimi-c Do you have a release date on the patch ? I'm experiencing this issue a bit to often for my taste causing issues with our deploys.
@macnibblet this will be in 2.0.2 and 2.1.
2.0.2 should be out very soon (today-ish), 2.1 will most likely be out in early May at this point.
Re-opening, as we still feel removing -tt is a workaround rather than the final answer to this bug.
I have also had this issue especially when running playbooks with many tasks. This also caused get_url (or wget with the command module) to fail when downloading larger files. It says that the download is complete, but the entire file has not been downloaded. I upgraded to 2.0.2 and the issue seems to be gone. Thanks! :)
After upgrade 2.0.1 -> 2.0.2 this issue appeared (wasn't present in .1)
Steps to reproduce:
ansible -vvvv all -i mongo, -c docker -m ping
Response in .2:
Using …/ansible.cfg as config file
Loaded callback minimal of type stdout, v2.0
ESTABLISH DOCKER CONNECTION FOR USER: None
mongo | FAILED! => {
"changed": false,
"failed": true,
"invocation": {
"module_name": "ping"
},
"module_stderr": "",
"module_stdout": "",
"msg": "MODULE FAILURE",
"parsed": false
}
Definitely still in 2.0.2. I'm running Packer to create an Ubuntu 14.04 VM from scratch (not ideal, but it's what we use) and use the ansible provisioner upon installation of the OS to run my 200+ task playbook. I thought it was just me because of the odd set up and Packer itself being beta-ish (version 0.10.0 is the latest). But I have definitely seen this in ansible 2.0.1 and 2.0.2.
I haven't tried any of the scripts or workarounds myself though, so nothing to add in that regard. Sorry!
Yeah, I have this problem as well (with 2.0.2)...
Experienced this with Ansible 2.0.1 connecting to an Ubuntu 16.04 host (but not when connecting to 14.04 hosts). Upgraded to 2.0.2 and it has not reoccurred as yet.
Experienced this issue with ansible 2.0.2. I'm running ansible against a chef/centos-6.6 VM spun up through vagrant. Haven't tried the workaround as I'm also not sure if it's a valid fix.
EDIT: Tried the workaround and it still gives the same error.
2.0.2.0 fixed this for me
Still seeing this in 2.0.2.0 when hitting Docker for Mac Beta containers.
Setting ansible_user=root in the inventory has worked around this issue for me.
Haven't seen this issue with 2.0.2.0 at all. Previously it was very frequent.
I used to see this isssue a lot before upgrading to 2.0.2.0, it seems to be fixed for me after the upgrade.
I've created an environment to reproduce the issue @ https://github.com/dregin/ansible-docker-connector
Can others check to see if they see the same behaviour?
Still see this after an upgrade from 1.9.x to 2.0.2.
In ansible.cfg there was a gathering = smart
, disabled this config fixed it for me. http://docs.ansible.com/ansible/intro_configuration.html#sts=gathering%C2%B6
Target: Debian 7
Host: CentOS 6
Seeing the error with ansible 2.0.2.0 with Docker containers (target=ubuntu 14.04 && host=ubuntu 14.04)
@dregin's workaround, setting ansible_user=root seems to change the issue from "always" to "intermittently".
So still having problems :(
This issue goes away for me if I set the following in ansible.cfg
:
[ssh_connection]
pipelining = True
Host: Ubuntu 16.04 i386 (ansible OS package version 2.0.0.2-2)
Target: Various Debian (7/8) / Ubuntu (mainly 16.04).
@tim-seoss that is due to the fact that pipelining mode disables the use of -tt
as well (since the Python interpreter goes into interactive mode if there is a pty).
Just here to say +1
From my Ubuntu 16.04 MaaS server, with Ansible from Ubuntu repos, I'm seeing this problem randomly, while running Ansible against the MaaS nodes (just 2 for now)... :-/
Looks like that:
[ssh_connection]
ssh_args = -o ControlMaster=no
...helps... I'll try more...
EDITED (June, 13): Didn't worked... :(
Same problem here.
Bazooka solution: on the remote host rm -rf /tmp/* ~/.ansible/*
rm -rf /tmp/*
is generally a bad idea if your remote does anything other than running Ansible tasks (what is very likely).
rm -rf ~/.ansible/tmp
on the remote machine (as root) did the job for me (ansible server: Ubuntu 16.10, ansible 2.0.0.2)
Not sure if just a coincidence, but I was seeing this issue heavily on 16.04 but almost always (maybe even only?) when utilizing the command
module. When I switched to an equivalent file
module version of mkdir -p
, I stopped seeing this issue. FWIW I wiped out all .ansible/
directories (on target hosts), but that didn't solve it for me.
I encounter the same on ubuntu 16.04, but adding --ssh-extra-args="-o ControlMaster=no -o ControlPath=none -o ControlPersist=no" seem to help.
@mspanc: be careful- if you're using ssh_extra_args (and not ssh_args), you're mixing your ControlPersist args with the default generated ones, so the OpenSSH behavior might be, shall we say, "undefined"?
This only appeared to fix the issue, anyway. I had to use another ubuntu
14.04 machine to successfully run my ansible scripts.
same here ansible 2.1.2.0 (stable-2.1 559fcbe531), target rhel 7.2
funny thing when it does not work, it is running with "-vvv". And little bit later as soon as it stops working with "-vvv", it is running normal without verbose again.
The Openssh bugzilla mentioned above ( https://bugzilla.mindrot.org/show_bug.cgi?id=2492 ) has this comment:
The fix has now been incorporated into the following Linux mainline kernels: 4.1.26, 4.4.12, 4.5.6, 4.6.1 and the newly released 4.7.
and links to this commit: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0f40fbbcc34e093255a2b2d70b6b0fb48c3f39aa
Many Long-Term Linux distros (RHEL, Ubuntu-LTS, etc) Are running older kernels.. You'd have to audit the patches they applied to the kernel to find out if they've backported the fix to their packages.
Anyone has a clue about this problem? We tried all of the aforementioned solutions (except updating the kernel itself).But none of them worked. We build a nice installation infrastructure using ansible. But this small issue is killing us. Whenever we execute ansible-playbook command, the first one throws this failure message. Mostly, the second one works. Also, during the installation, we see the same problem at random points.
Host & Targets are CentOS 6.6 and we are using Ansible 2.1.2.
Going back to 1.9.4 is solving the problem however we are loosing lots of nice features that we have already implemented.
Any help or idea would be appreciated.
Thanks
We tried all of the aforementioned solutions (except updating the kernel itself).
Any reason why you can't update the kernel?
It is very difficult to persuade people for kernel upgrade. Since the
kernel is baselined for this system. (The only reasonable excuse could be
Dirty cow bug. )
On Oct 31, 2016 3:41 PM, "joernheissler" [email protected] wrote:
We tried all of the aforementioned solutions (except updating the kernel
itself).Any reason why you can't update the kernel?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ansible/ansible/issues/13876#issuecomment-257283301,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APNso4aGfA1x4MeZ9hUQr9ih12kMpYLtks5q5eHngaJpZM4HEsbd
.
Hi,
Just for the record, I had this problem with a Controller which has CentOS 6.6 (this minor version). Updating ssh of this server make it works again. I found this solution easier than updating the kernel ;-)
Regards,
JYL
Hi,
Thanks for the reply. Even though I have already updated my ansible server
to Centos 6.8, I will try that on another ansible server.
Serdar
On Dec 21, 2016 11:47 AM, "Jean-Yves LENHOF" notifications@github.com
wrote:
Hi,
Just for the record, I had this problem with a Controller which has CentOS
6.6 (this minor version). Updating ssh of this server make it works again.
I found this solution easier than updating the kernel ;-)Regards,
JYL
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ansible/ansible/issues/13876#issuecomment-268468483,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APNsowBfyc2t4YLFhvdIl9jIxmKA6D2xks5rKOesgaJpZM4HEsbd
.
Hi,
I'm getting the same error with CentOS7.
Ansible version
ansible 2.2.1.0
@chinatree Greetings! Thanks for taking the time to open this issue. In order for the community to handle your issue effectively, we need a bit more information.
Here are the items we could not find in your description:
Please set the description of this issue with this template:
https://raw.githubusercontent.com/ansible/ansible/devel/.github/ISSUE_TEMPLATE.md
I am also still getting this issue regularly, mostly when setting up packer boxes. I am using Ansible 2.3.0 on Ubuntu 16.10.
@JoelFeiner Just out of interest, what OS / Linux Kernel Version is your managed node running?
@joernheissler: output of uname -a
on the machine running Ansible is below
Linux ********* 4.8.0-52-generic #55-Ubuntu SMP Fri Apr 28 13:28:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The VMs are running Centos 7, and a uname -a
on one of them looks like this:
Linux localhost.localdomain 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
@JoelFeiner So you're running a 4 years old kernel which doesn't include the fix (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0f40fbbcc34e093255a2b2d70b6b0fb48c3f39aa).
Btw, I haven't seen this bug myself for a long time now, since I upgraded kernels on my servers.
Why is this bug not closed anyway? It's not an ansible problem.
I've gotten the go-ahead to use a newer kernel from elrepo. Hopefully that will fix the issue then.
LOL: I've stumbled on a sleep implementation that fails on sleep 0
, as in:
$ sleep 0; echo $?
1
$ sleep 1; echo $?
0
Hacking plugins/action/__init__.py
to add "{ sleep 0 || :; }"
instead of "sleep 0"
does help for now (using Bourne shell here only, not sure how that could work for csh
).
However, there is another sleep implementation available on that machine, but specifying an explicit /path/to/sleep seems to be unsupported as of ansible-2.2.2.0...
For the records: This is /usr/bin/sleep on Interix (a POSIX layer for Windows, abandoned meanwhile, but still part of my infra here).
@joernheissler: unfortunately, this happened to me again on kernel 4.4.69. This kernel contains the commit with the fix.
Are you able to put together a complete test case so that others can reproduce it?
It's an very intermittent error that can happen with any task, so I don't think that is feasible.
JoelFeiner, try upgrading ssh as written in the thread...
@jylenhofgfi I don't see anything in this thread about upgrading SSH, only the kernel. What version are you using? On the remote hosts or local?
Mostly set ControlMaster=no should fix this issue. You should check your openssh version. since version 5.1, openssh started to support multiplexing, but it works super well in version 5.6. So I suggest upgrade your openssh server to 5.6 or higher.
Release Notes: https://www.openssh.com/txt/release-5.1
@chinatree You have not responded to information requests in this issue so we will assume it no longer affects you. If you are still interested in this, please create a new issue with the requested information.
reopening till https://github.com/ansible/ansible/issues/31022 is merged.
!needs_info
resolved_by_pr https://github.com/ansible/ansible/issues/31022
bot_skip
For anyone trying to experiment with this problem in devel, we now have a use_tty config option that can disable the connection plugin from adding "-tt" without having to hack the code.
https://github.com/ansible/ansible/commit/218987eac1eb9a5671f996d2887dcca349199e51
The original problem described in this issue with temp dir resolution should have been resolved by https://github.com/ansible/ansible/pull/31677.
Most helpful comment
I have traced this down to associated with the SSH flag
-tt
.The following patch completely resolves the problem for me:
Reading through the history here, it seems this flag is used to work around some complex problem involving sudoer configuration on some hosts.
I don't suffer from whatever problem this flag is here to solve, but I do suffer from the problem it creates, so applying this patch is a very effective workaround for me.