Salt: [BUG] /usr/bin/salt-call state.apply hangs at end on Xubuntu 20.04 salt 3001

Created on 2 Jul 2020  路  31Comments  路  Source: saltstack/salt

Description
Running /usr/bin/salt-call state.apply on Xubuntu 20.04 minion finishes but never completes.

Summary for local
--------------
Succeeded: 153
Failed:      0
--------------
Total states run:     153
Total run time:    16.030 s

But command never returns back to terminal.

ps -ef |grep salt-call
root     2928868 2914767  0 08:12 pts/28   00:00:00 sudo /usr/bin/salt-call state.apply
root     2928869 2928868  0 08:12 pts/28   00:00:05 /usr/bin/python3 /usr/bin/salt-call state.apply
root     3026694 3026554  0 09:37 pts/27   00:00:00 grep --color=auto salt-call

Notice the command has been running over an hour but only took 16 seconds for the states to be applied.

Setup

Steps to Reproduce the behavior
Xubuntu 20.04 install with ZFS root

more /etc/apt/sources.list.d/saltstack.list
deb [arch=amd64] http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main

Salt master is Ubuntu 18.04 version 3001 recently upgraded from nigrogen.
All other minions on network are 18.04 and working as expected.

Expected behavior
salt-call state.apply to finish and return to terminal.

Screenshots

Versions Report

salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Master:
salt --versions-report
Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.8
         Jinja2: 2.10
        libgit2: 0.26.0
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.4.7
         pygit2: 0.26.2
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 17.1.2
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: ubuntu 18.04 Bionic Beaver
         locale: UTF-8
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 Bionic Beaver

Minion With issue:
Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: 4.1.0
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: 1.1.0
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-39-generic
         system: Linux
        version: Ubuntu 20.04 focal


Working Minon:

salt-call --versions-report
Salt Version:
           Salt: 2017.7.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.7
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: UTF-8
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 bionic

Additional context
Add any other context about the problem here.

Bug Confirmed Duplicate Magnesium ZMQ fixed-pending-your-verification salt-call severity-high

Most helpful comment

@ocfmatt we are working on a long term solution, but since we do not maintain this library it is taking some time. It was one suggestion to try to downgrade, but we realize that isn't really a good idea, either, but looking for a real fix to this, still.

All 31 comments

I've got the same issue under Ubuntu 20.04 as well. If run from the master the state succeed and exit right after like normal. When executed directly on minion it hangs at the end until CTRL + C is used.

```# salt-call --versions-report
Salt Version:
Salt: 3001

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 18.1.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2

System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-1018-aws
system: Linux
version: Ubuntu 20.04 focal
```

I don't think its a duplicate of #57456 because this issue is when an SLS render to nothing and the executed state from the master will exit but with and error status code and its not that case here.

I'm hesitant on #57356 as well as I keep most of my states without jinja references and filter via grains in the top.sls file.

It also fails if I run just a single state file. For example, my vscode state:

salt-call state.apply vscode

cat vscode/init.sls 
vscode_repo:
  pkgrepo.managed:
    - name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
    - file: /etc/apt/sources.list.d/vscode.list
    - key_url: https://packages.microsoft.com/keys/microsoft.asc
    - refresh_db: true

vscode:
  pkg.installed:
    - name: code
    - require:
        - pkgrepo: vscode_repo

Output is similar to other fails:

local:
----------
          ID: vscode_repo
    Function: pkgrepo.managed
        Name: deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main
      Result: True
     Comment: Package repo 'deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main' already configured
     Started: 13:02:08.425043
    Duration: 107.504 ms
     Changes:   
----------
          ID: vscode
    Function: pkg.installed
        Name: code
      Result: True
     Comment: All specified packages are already installed
     Started: 13:02:09.400992
    Duration: 73.534 ms
     Changes:   

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time: 181.038 ms
^C

CTRL-C after waiting a couple of minutes

This is either not a duplicate of #57456, or that issue's title is misleading. As far as I can tell, applying a highstate via salt-call which contains pkg.installed will prevent salt-call from shutting down cleanly. Here is a vagrant file with the minimum steps required to reproduce the bug:

Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-20.04"
  config.vm.provision "shell", inline: <<-SHELL

wget -qO - https://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add -
echo 'deb http://repo.saltstack.com/py3/ubuntu/20.04/amd64/latest focal main' > /etc/apt/sources.list.d/saltstack.list
echo '127.0.1.1 salt' >> /etc/hosts
apt-get update
apt-get install -y salt-master
echo 'autosign_file: /etc/salt/autosign.conf' > /etc/salt/master
echo '*' > /etc/salt/autosign.conf
chmod 600 /etc/salt/autosign.conf
systemctl restart salt-master
apt-get install -y salt-minion
mkdir /srv/salt
cat << EOT > /srv/salt/top.sls
base:
  "*":
    - bug
EOT
     cat << EOT > /srv/salt/bug.sls
this will block salt-call from exiting:
  pkg.installed:
    - name: cowsay
EOT
salt-call state.apply -l debug

  SHELL
end

Expected behavior:

Running vagrant up should install salt, highstate, install the cowsay package, then exit.

Actual behavior:

Package is installed as expected, but salt-call never exits, causing vagrant up to also hang.

that issue's title is misleading

The title describes the issue in 3000.1 (empty sls / fileserver? hang) that shifted into this one (pkg state? hang) in 3001.

Sorry, I just meant that it doesn't appear to be related to an empty formula. Applying a highstate with an empty formula works just fine in my tests.

@mgdotson thanks for the bug report. I also got this on a 20.04 box as well.

salt-call --local state.apply does not hang!
salt-call state.apply hangs!

I am seeing this on fedora 32 as well.

Reproduced in 20.04 Docker container. Working on that.

Got a lot of "Message timed out" on 3000.3 & 3001.
using --async help.

@jpmenil I don't think it could be related to this one. Please try to find a similar issue or create a new one with detailed description.

Confirmed this is related to pkg.installed, and also confirmed if you use --local against the same set of states it doesn't hang. Running 20.04 with current pkgs as of today.

Anyone running anything other than python3-apt 2.0.0?

Edit: Note that highstate/states executed via the salt-minion schedule do not appear affected, this is only salt-call when --local is not used.

All the procs left are ZMQ showing an epoll_wait on an fd. So likely related to a pipe or local socket used by ZMQ.

---/usr/bin/python(146114)-+-{ZMQbg/IO/0}(146194)
                           |-{ZMQbg/IO/0}(146425)
                           |-{ZMQbg/Reaper}(146193)
                           `-{ZMQbg/Reaper}(146424)

epoll_wait(14,
epoll_wait(35,
epoll_wait(12,
epoll_wait(33,

# ls -l /proc/147046/fd/14 /proc/147279/fd/35 /proc/147045/fd/12 /proc/147278/fd/33
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147046/fd/14 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:34 /proc/147279/fd/35 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147045/fd/12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jul 20 22:35 /proc/147278/fd/33 -> 'anon_inode:[eventpoll]'
ii  libzmq5:amd64           4.3.2-2ubuntu1
ii  python3-zmq             18.1.1-3

@JaseFace yep and I'm working in this direction. Thank you for analyze.

Does transport: tcp help? The issue seems to be related to ZMQ: https://github.com/saltstack/salt/issues/57456#issuecomment-663153437

yes, returns instantly with transport set to tcp.

we did not get this fixed for the point release, other priorities came up, but still in progress and will go for Magnesium

this means salt isn't usable for ubuntu 20.04? As you cannot run mixed clients tcp and not tcp you cannot even upgrade your infra

@disaster123 salt is still mostly usable and I've personally started upgrading infrastructure. It appears the bug only affects CLI use of salt-call. Both salt on the master and scheduled states on the minion still work, so when salt does hang, you can just Ctrl-C once it's finished to exit.

That being said, the bug is quite annoying, and there are some legitimately broken workflows, such as scripted use of salt-call. It's disappointing that we didn't get a fix for the 3001 version release.

Very disappointing...!

@sagetherage how to solve this?

With transport: tcp salt-call itself is working BUT now i'm running into the next issue

    [ERROR   ] Exception parsing response
    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/salt/transport/tcp.py", line 1214, in _stream_return
        for framed_msg in unpacker:
      File "msgpack/_unpacker.pyx", line 562, in msgpack._cmsgpack.Unpacker.__next__
      File "msgpack/_unpacker.pyx", line 493, in msgpack._cmsgpack.Unpacker._unpack
    ValueError: 1048720 exceeds max_str_len(1048576)

it seems unpossible to run salt on ubuntu currently. tcp isn't working either.

I tried to work around this by downgrading zmq + pyzmq but it still does not work see: https://github.com/saltstack/salt/issues/57456#issuecomment-672827943

Was anybody able to fix this? Even downgrading zmq did not work for me.

@disaster123 both @cmcmarrow and @krionbsd are working on these issues that are not exactly the same but seem related or similar. Not sure they have a fix for all, but I will let them answer on specifics.

Hi, I am having the same behaviour on CentOS 8.2 where every state hangs at closing MQ. States run perfectly fine in a CentOS 7 based environment with an older version of PyZMQ. I am on Salt 2019.2.5 and gather between this bug report and #57456 the fixes are reduce security or downgrade PyZMQ to 18.0.0 - are these the only options until an official fix is released?

[root@controller ~]# salt-call --versions-report
Salt Version:
           Salt: 2019.2.5

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr 16 2020, 01:36:27)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 19.0.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.2
            ZMQ: 4.3.2

System Versions:
           dist: centos 8.2.2004 Core
         locale: UTF-8
        machine: x86_64
        release: 4.18.0-193.14.2.el8_2.x86_64
         system: Linux
        version: CentOS Linux 8.2.2004 Core
local:
  Name: vm-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:17.543186 Duration: 1730.06 ms
  Name: mgmt-pkgs - Function: pkg.latest - Result: Clean Started: - 13:08:19.273469 Duration: 392.715 ms

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time:   2.123 s
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

Regards,
Matt.

@ocfmatt we are working on a long term solution, but since we do not maintain this library it is taking some time. It was one suggestion to try to downgrade, but we realize that isn't really a good idea, either, but looking for a real fix to this, still.

Sorry for the lack of communication. The zmq hang should be merged soon. The reason why salt-call hanged was because we left the clean up to garbage collector on exit. The zmq library sometimes closes its objects out of order when left to its own. But if closed by salt this can be worked around. I tried to patch this bug for zmq but I could not figure out a good way to do it. So we just put in a patch for salt.

I would be grateful if you guys check if my fix zmq hang branch stops your environment from hanging. I would hate to see you guys wait for another release because I missed something.

Hey @cmcmarrow, thanks for this patch. I've tested it on ~70 machines that exhibited the ZMQ hang, and it has resolved the problem. I'll be leaving them with the patch to burn in further, but it's pretty straightforward, I doubt anything will come up.

Hi @cmcmarrow, will these commits work in 2019 varients or are they just for 300X versions?

Was this page helpful?
0 / 5 - 0 ratings