Description
Running /usr/bin/salt-call state.apply
on Xubuntu 20.04 minion finishes but never completes.
Summary for local
--------------
Succeeded: 153
Failed: 0
--------------
Total states run: 153
Total run time: 16.030 s
But command never returns back to terminal.
ps -ef |grep salt-call
root 2928868 2914767 0 08:12 pts/28 00:00:00 sudo /usr/bin/salt-call state.apply
root 2928869 2928868 0 08:12 pts/28 00:00:05 /usr/bin/python3 /usr/bin/salt-call state.apply
root 3026694 3026554 0 09:37 pts/27 00:00:00 grep --color=auto salt-call
Notice the command has been running over an hour but only took 16 seconds for the states to be applied.
Setup
Steps to Reproduce the behavior
Xubuntu 20.04 install with ZFS root
more /etc/apt/sources.list.d/saltstack.list
deb [arch=amd64] http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main
Salt master is Ubuntu 18.04 version 3001 recently upgraded from nigrogen.
All other minions on network are 18.04 and working as expected.
Expected behavior
salt-call state.apply
to finish and return to terminal.
Screenshots
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Master:
salt --versions-report
Salt Version:
Salt: 3001
Dependency Versions:
cffi: Not Installed
cherrypy: unknown
dateutil: 2.6.1
docker-py: Not Installed
gitdb: 2.0.3
gitpython: 2.1.8
Jinja2: 2.10
libgit2: 0.26.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.4.7
pygit2: 0.26.2
Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
python-gnupg: 0.4.1
PyYAML: 3.12
PyZMQ: 17.1.2
smmap: 2.0.3
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.2.5
System Versions:
dist: ubuntu 18.04 Bionic Beaver
locale: UTF-8
machine: x86_64
release: 4.15.0-101-generic
system: Linux
version: Ubuntu 18.04 Bionic Beaver
Minion With issue:
Salt Version:
Salt: 3001
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: 4.1.0
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: 1.1.0
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 18.1.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2
System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-39-generic
system: Linux
version: Ubuntu 20.04 focal
Working Minon:
salt-call --versions-report
Salt Version:
Salt: 2017.7.4
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.10
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 1.0.7
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.2.5
System Versions:
dist: Ubuntu 18.04 bionic
locale: UTF-8
machine: x86_64
release: 4.15.0-101-generic
system: Linux
version: Ubuntu 18.04 bionic
Additional context
Add any other context about the problem here.
Could be duplicate of https://github.com/saltstack/salt/issues/57456
I've got the same issue under Ubuntu 20.04 as well. If run from the master the state succeed and exit right after like normal. When executed directly on minion it hangs at the end until CTRL + C is used.
```# salt-call --versions-report
Salt Version:
Salt: 3001
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 18.1.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2
System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-1018-aws
system: Linux
version: Ubuntu 20.04 focal
```
I don't think its a duplicate of #57456 because this issue is when an SLS render to nothing and the executed state from the master will exit but with and error status code and its not that case here.
I'm hesitant on #57356 as well as I keep most of my states without jinja references and filter via grains in the top.sls file.
It also fails if I run just a single state file. For example, my vscode state:
salt-call state.apply vscode
cat vscode/init.sls
vscode_repo:
pkgrepo.managed:
- name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
- file: /etc/apt/sources.list.d/vscode.list
- key_url: https://packages.microsoft.com/keys/microsoft.asc
- refresh_db: true
vscode:
pkg.installed:
- name: code
- require:
- pkgrepo: vscode_repo
Output is similar to other fails:
local:
----------
ID: vscode_repo
Function: pkgrepo.managed
Name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
Result: True
Comment: Package repo 'deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main' already configured
Started: 13:02:08.425043
Duration: 107.504 ms
Changes:
----------
ID: vscode
Function: pkg.installed
Name: code
Result: True
Comment: All specified packages are already installed
Started: 13:02:09.400992
Duration: 73.534 ms
Changes:
Summary for local
------------
Succeeded: 2
Failed: 0
------------
Total states run: 2
Total run time: 181.038 ms
^C
CTRL-C after waiting a couple of minutes
This is either not a duplicate of #57456, or that issue's title is misleading. As far as I can tell, applying a highstate via salt-call which contains pkg.installed
will prevent salt-call from shutting down cleanly. Here is a vagrant file with the minimum steps required to reproduce the bug:
Vagrantfile
Vagrant.configure("2") do |config|
config.vm.box = "bento/ubuntu-20.04"
config.vm.provision "shell", inline: <<-SHELL
wget -qO - https://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add -
echo 'deb http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main' > /etc/apt/sources.list.d/saltstack.list
echo '127.0.1.1 salt' >> /etc/hosts
apt-get update
apt-get install -y salt-master
echo 'autosign_file: /etc/salt/autosign.conf' > /etc/salt/master
echo '*' > /etc/salt/autosign.conf
chmod 600 /etc/salt/autosign.conf
systemctl restart salt-master
apt-get install -y salt-minion
mkdir /srv/salt
cat << EOT > /srv/salt/top.sls
base:
"*":
- bug
EOT
cat << EOT > /srv/salt/bug.sls
this will block salt-call from exiting:
pkg.installed:
- name: cowsay
EOT
salt-call state.apply -l debug
SHELL
end
Expected behavior:
Running vagrant up
should install salt, highstate, install the cowsay package, then exit.
Actual behavior:
Package is installed as expected, but salt-call never exits, causing vagrant up
to also hang.
that issue's title is misleading
The title describes the issue in 3000.1 (empty sls / fileserver? hang) that shifted into this one (pkg state? hang) in 3001.
Sorry, I just meant that it doesn't appear to be related to an empty formula. Applying a highstate with an empty formula works just fine in my tests.
@mgdotson thanks for the bug report. I also got this on a 20.04 box as well.
salt-call --local state.apply
does not hang!
salt-call state.apply
hangs!
I am seeing this on fedora 32 as well.
Reproduced in 20.04 Docker container. Working on that.
Got a lot of "Message timed out" on 3000.3 & 3001.
using --async help.
@jpmenil I don't think it could be related to this one. Please try to find a similar issue or create a new one with detailed description.
Confirmed this is related to pkg.installed
, and also confirmed if you use --local
against the same set of states it doesn't hang. Running 20.04 with current pkgs as of today.
Anyone running anything other than python3-apt
2.0.0
?
Edit: Note that highstate/states executed via the salt-minion
schedule do not appear affected, this is only salt-call
when --local
is not used.
All the procs left are ZMQ showing an epoll_wait
on an fd. So likely related to a pipe or local socket used by ZMQ.
---/usr/bin/python(146114)-+-{ZMQbg/IO/0}(146194)
|-{ZMQbg/IO/0}(146425)
|-{ZMQbg/Reaper}(146193)
`-{ZMQbg/Reaper}(146424)
epoll_wait(14,
epoll_wait(35,
epoll_wait(12,
epoll_wait(33,
# ls -l /proc/147046/fd/14 /proc/147279/fd/35 /proc/147045/fd/12 /proc/147278/fd/33
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147046/fd/14 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147279/fd/35 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147045/fd/12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147278/fd/33 -> 'anon_inode:[eventpoll]'
ii libzmq5:amd64 4.3.2-2ubuntu1
ii python3-zmq 18.1.1-3
@JaseFace yep and I'm working in this direction. Thank you for analyze.
Does transport: tcp
help? The issue seems to be related to ZMQ: https://github.com/saltstack/salt/issues/57456#issuecomment-663153437
yes, returns instantly with transport set to tcp.
we did not get this fixed for the point release, other priorities came up, but still in progress and will go for Magnesium
this means salt isn't usable for ubuntu 20.04? As you cannot run mixed clients tcp and not tcp you cannot even upgrade your infra
@disaster123 salt is still mostly usable and I've personally started upgrading infrastructure. It appears the bug only affects CLI use of salt-call
. Both salt
on the master and scheduled states on the minion still work, so when salt does hang, you can just Ctrl-C once it's finished to exit.
That being said, the bug is quite annoying, and there are some legitimately broken workflows, such as scripted use of salt-call
. It's disappointing that we didn't get a fix for the 3001 version release.
Very disappointing...!
@sagetherage how to solve this?
With transport: tcp salt-call itself is working BUT now i'm running into the next issue
[ERROR ] Exception parsing response
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/transport/tcp.py", line 1214, in _stream_return
for framed_msg in unpacker:
File "msgpack/_unpacker.pyx", line 562, in msgpack._cmsgpack.Unpacker.__next__
File "msgpack/_unpacker.pyx", line 493, in msgpack._cmsgpack.Unpacker._unpack
ValueError: 1048720 exceeds max_str_len(1048576)
it seems unpossible to run salt on ubuntu currently. tcp isn't working either.
I tried to work around this by downgrading zmq + pyzmq but it still does not work see: https://github.com/saltstack/salt/issues/57456#issuecomment-672827943
Was anybody able to fix this? Even downgrading zmq did not work for me.
@disaster123 both @cmcmarrow and @krionbsd are working on these issues that are not exactly the same but seem related or similar. Not sure they have a fix for all, but I will let them answer on specifics.
Hi, I am having the same behaviour on CentOS 8.2 where every state hangs at closing MQ. States run perfectly fine in a CentOS 7 based environment with an older version of PyZMQ. I am on Salt 2019.2.5 and gather between this bug report and #57456 the fixes are reduce security or downgrade PyZMQ to 18.0.0 - are these the only options until an official fix is released?
[root@controller ~]# salt-call --versions-report
Salt Version:
Salt: 2019.2.5
Dependency Versions:
cffi: 1.11.5
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: 2.14
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.8 (default, Apr 16 2020, 01:36:27)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 19.0.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.2
ZMQ: 4.3.2
System Versions:
dist: centos 8.2.2004 Core
locale: UTF-8
machine: x86_64
release: 4.18.0-193.14.2.el8_2.x86_64
system: Linux
version: CentOS Linux 8.2.2004 Core
local:
Name: vm-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:17.543186 Duration: 1730.06 ms
Name: mgmt-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:19.273469 Duration: 392.715 ms
Summary for local
------------
Succeeded: 2
Failed: 0
------------
Total states run: 2
Total run time: 2.123 s
[DEBUG ] Closing AsyncZeroMQReqChannel instance
Regards,
Matt.
@ocfmatt we are working on a long term solution, but since we do not maintain this library it is taking some time. It was one suggestion to try to downgrade, but we realize that isn't really a good idea, either, but looking for a real fix to this, still.
Sorry for the lack of communication. The zmq hang should be merged soon. The reason why salt-call hanged was because we left the clean up to garbage collector on exit. The zmq library sometimes closes its objects out of order when left to its own. But if closed by salt this can be worked around. I tried to patch this bug for zmq but I could not figure out a good way to do it. So we just put in a patch for salt.
I would be grateful if you guys check if my fix zmq hang branch stops your environment from hanging. I would hate to see you guys wait for another release because I missed something.
Hey @cmcmarrow, thanks for this patch. I've tested it on ~70 machines that exhibited the ZMQ hang, and it has resolved the problem. I'll be leaving them with the patch to burn in further, but it's pretty straightforward, I doubt anything will come up.
Hi @cmcmarrow, will these commits work in 2019 varients or are they just for 300X versions?
Most helpful comment
@ocfmatt we are working on a long term solution, but since we do not maintain this library it is taking some time. It was one suggestion to try to downgrade, but we realize that isn't really a good idea, either, but looking for a real fix to this, still.