Salt: [BUG] Empty formula makes `salt-call` to hang forever.

Created on 25 May 2020  路  48Comments  路  Source: saltstack/salt

Description

Empty formula makes salt-call to hang forever.

Setup

We have a simple formula, that allows to add some custom states during the highstate, based on the minion's FQDN:
custom_minion_state/init.sls:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

In 99% of the cases that renders to the empty SLS file, as there is no customization necessary.

Steps to Reproduce the behavior

Applying that formula to the host, which doesn't require any customization produces:

# salt-call state.apply custom_minion_state
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms

For Salt 2018.3.3 salt-call immediately returns to the shell. But for Salt 3000.3 it hangs after this output forever(?). Well, long enough to lose the temper and kill it with Ctrl-C.

Expected behavior
Application of the empty state should exit immediately after the state has been evaluated and applied.

Versions Report

salt --versions-report
Salt Version:
Salt: 3000.3

Dependency Versions:
cffi: 1.11.5
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: 4.2.0
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: 2.14
pycrypto: Not Installed
pycryptodome: 3.9.7
pygit2: Not Installed
Python: 3.6.8 (default, Nov 21 2019, 19:31:34)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 19.0.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2

System Versions:
dist: centos 8.1.1911 Core
locale: UTF-8
machine: x86_64
release: 4.18.0-147.8.1.el8_1.x86_64
system: Linux
version: CentOS Linux 8.1.1911 Core

Additional context
Running the same command with -l debug gives a bit more information:

...
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Gathering pillar data for state run
[DEBUG   ] Finished gathering pillar data for state run
[INFO    ] Loading fresh modules for state activity
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] Could not find file 'salt://custom_minion_state.sls' in saltenv 'base'
[DEBUG   ] In saltenv 'base', looking at rel_path 'custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' to resolve 'salt://custom_minion_state/init.sls'
[DEBUG   ] compile template: /var/cache/salt/minion/files/base/custom_minion_state/init.sls
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] LazyLoaded cp.cache_dir
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001:67c:2e8:11::c10:13b7]:4506
[INFO    ] Caching directory 'custom_minion_state/prometheus/' for environment 'base'
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'jinja' renderer: 0.04874920845031738
[DEBUG   ] Rendered data from file: /var/cache/salt/minion/files/base/custom_minion_state/init.sls:
# vim: sts=2 ts=2 sw=2 et ai ft=sls
#
# This state is loaded from top.sls
#

[DEBUG   ] Results of YAML rendering:
{}
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/custom_minion_state/init.sls' using 'yaml' renderer: 0.00016641616821289062
[DEBUG   ] File /var/cache/salt/minion/accumulator/140045102329928 does not exist, no need to cleanup
[DEBUG   ] LazyLoaded state.check_result
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'prometheus', 'tcp://[2001::13b7]:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[2001::13b7]:4506
[DEBUG   ] Trying to connect to: tcp://[2001::13b7]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded highstate.output
local:

Summary for local
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

Running through strace shows:

getpid()                                = 100123
getpid()                                = 100123
write(1, "\33[0;32mlocal:\33[0;0m\n\33[0;36m\nSumm"..., 201) = 201
write(1, "\n", 1)                       = 1
getpid()                                = 100123
epoll_ctl(20, EPOLL_CTL_DEL, 28, 0x7ffda432bbcc) = 0
getpid()                                = 100123
getpid()                                = 100123
write(24, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
poll([{fd=23, events=POLLIN}], 1, -1)   = 1 ([{fd=23, revents=POLLIN}])
getpid()                                = 100123
read(23, "\1\0\0\0\0\0\0\0", 8)         = 8
getpid()                                = 100123
write(26, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f89908ed9d0, FUTEX_WAIT, 100201, NULL) = 0
close(27)                               = 0
close(26)                               = 0
close(25)                               = 0
close(24)                               = 0
close(23)                               = 0
epoll_ctl(20, EPOLL_CTL_DEL, 21, 0x7ffda432c0dc) = 0
close(21)                               = 0
close(22)                               = 0
close(20)                               = 0
close(5)                                = 0
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, {sa_handler=0x7f89a296d610, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f89a266fdc0}, 8) = 0
munmap(0x7f8992c15000, 262144)          = 0
munmap(0x7f8992c55000, 262144)          = 0
close(8)                                = 0
close(9)                                = 0
close(31)                               = 0
close(32)                               = 0
close(7)                                = 0
munmap(0x7f8993522000, 262144)          = 0
munmap(0x7f89934e2000, 262144)          = 0
close(30)                               = 0
getpid()                                = 100123
getpid()                                = 100123
getpid()                                = 100123
write(38, "\1\0\0\0\0\0\0\0", 8)        = 8
getpid()                                = 100123
poll([{fd=33, events=POLLIN}], 1, -1)   = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

Applying the same state from the master completes immediately, but with the note that state completed with the ERROR.

master# salt 'prometheus*' state.apply custom_minion_state -l debug
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://[::]:4506
[DEBUG   ] Trying to connect to: tcp://[::]:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/02/a26ac9e846136c8148eb8e74c4e1507421289c141e915aba2f429e146a8134/.minions.p
[DEBUG   ] get_iter_returns for jid 20200525163928599898 sent to {'prometheus'} will timeout at 18:39:33.605811
[DEBUG   ] jid 20200525163928599898 return from prometheus
[DEBUG   ] return event: {'prometheus': {'ret': {}, 'out': 'highstate', 'retcode': 2, 'jid': '20200525163928599898'}}
[DEBUG   ] LazyLoaded highstate.output
prometheus:

Summary for prometheus
-----------
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms
[DEBUG   ] jid 20200525163928599898 found all minions {'prometheus'}
[DEBUG   ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code
Magnesium Regression State Compiler Upstream Bug ZMQ salt-call severity-high

Most helpful comment

@waynew @max-arnold Works for me after downgrading pyzmq to 18.0.1 and started hanging with pyzmq>=18.1.0.

All 48 comments

I ran into this as well. Indeed if you add any bit of random data to the sls outside of the if statement, so it will never be empty, will make it work.

apologies! I can't seem to get the labels correct, today, my mistake in removing P3, merely wanted to add the severity with the definition that includes "hanging" which ATM is Critical

I can seeing the same thing on Ubuntu 20.04. This also smells like #56332 .

It seems anything that touches salt:// hangs.

Hangs, where testfile is an empty file:

 test_file:
   file.managed:
     - name: /tmp/testfile
     - source: salt://test/files/testfile

Doesn't hang:

volume_dir:
  file.directory:
    - name: /opt
    - user: root
    - group: root
    - mode: 0755

Doesn't hang:

 test_file:
   file.managed:
     - name: /tmp/testfile
     - contents: ''

Doesn't hang:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}

Hangs:

{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

Hangs:

{%- if salt['cp.cache_dir']("salt://") %}
{%- endif %}

Doesn't hang:

{%- if '' == '' %}
{%- endif %}

Hangs and salt://testtest doesn't exist:

{%- if salt['cp.cache_dir']("salt://testtest") %}
{%- endif %}

Hangs and salt://saltstack does exist:

{%- if salt['cp.cache_dir']("salt://saltstack") %}
{%- endif %}
Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: 4.1.0
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.4.4
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.6.1
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-1009-aws
         system: Linux
        version: Ubuntu 20.04 focal

I thought this might relate to gitfs, but a simple git clone with file_roots does the same thing. I also tried stripping it down. No pillars, no custom grains, a super basic top.sls, and an example state from above. Even the most basic of setups hangs with 3000.3, py3, and salt-call.

Ubuntu 18.04, salt 3000.3 - no hang.
Ubuntu 20.04 salt 3001rc1 - hangs.

It's only a state.apply that does this, other functions I've used have been fine.

Ubuntu 18.04, salt 3000.3 - no hang.
Ubuntu 20.04 salt 3001rc1 - hangs.

Can you provide, at minimum, verbose versions output for both platforms? I feel that this is PY2 vs PY3 difference that makes this difference.

I thought this might relate to gitfs, but a simple git clone with file_roots does the same thing. I also tried stripping it down. No pillars, no custom grains, a super basic top.sls, and an example state from above. Even the most basic of setups hangs with 3000.3, py3, and salt-call.

I feel that it's somewhere on the low level, in the way how event loop is handled.

It looks like that the final poll() call doesn't have any timeout setting and waits only for any activity on the fds. As there is nothing to send back due empty state, it hangs forever.

Indeed:
int poll(struct pollfd *fds, nfds_t nfds, int timeout);

poll([{fd=33, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

The timeout is set to -1, which is infinity, which we observe in this case.

Next question where is that poll() happens in the Python code...

PY2 vs PY3 difference

Nope


Works fine

Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10
        libgit2: 0.26.0
       M2Crypto: 0.32.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.26.2
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.2
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: ISO-8859-1
        machine: x86_64
        release: 4.15.0-101-generic
         system: Linux
        version: Ubuntu 18.04 bionic



Hangs

Salt Version:
           Salt: 3001rc1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.7.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10.1
        libgit2: 0.28.3
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.6.1
         pygit2: 1.0.3
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: 0.4.5
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: iso8859-1
        machine: x86_64
        release: 5.4.0-33-generic
         system: Linux
        version: Ubuntu 20.04 focal

In case it's relevant, with debug logging the end is

Total states run:     140
Total run time:   107.656 s
[DEBUG] Closing AsyncZeroMQReqChannel instance

Is this still a thing?

I tried to reproduce with code from 2020-06-09 git and looks OK:

root@marvin:/srv/salt/empty_formula# cat init.sls 
{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}
root@marvin:/srv/salt/empty_formula# salt-call state.apply empty_formula
local:

Summary for local                                                                                                                   
-----------                                                                                                                         
Succeeded: 0
Failed:   0
-----------
Total states run:    0                                                                                                              
Total run time:  0.000 ms

The versions being used are:

salt --versions-report
Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: 1.1.3
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.9.7
         pygit2: Not Installed
         Python: 3.8.3 (default, May 15 2020, 05:51:00)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: slackware 14.2 current
         locale: utf-8
        machine: i686
        release: 5.4.45
         system: Linux
        version: Slackware 14.2 current

I can't reproduce it on current master, @b-a-t please confirm

Testing 3001 release on Ubuntu 20.04.
It still hangs after every state.apply.

I've tried this a couple of times and haven't been able to repro. I didn't try on Ubuntu 20.04, but I'll give that a shot tomorrow and see if I can repro.

Tested with current master on a freshly installed and updated Ubuntu 20.04 and can't reproduce this issue:

# salt-call state.apply lala
local:

Summary for local                                                                                                                   
-----------                                                                                                                         
Succeeded: 0
Failed:   0
-----------
Total states run:    0
Total run time:  0.000 ms

The lala/init.sls state:

# cat lala/init.sls
{%- set include_dir = grains['fqdn'] | replace('.', '_') %}
{#- test: slspath + '/' + include_dir #}
{%- if salt['cp.cache_dir']("salt://{0}/{1}".format(slspath, include_dir)) %}
include:
  - .{{ include_dir }}
{%- endif %}

salt --versions-report

Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.9.7
         pygit2: Not Installed
         Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
   python-gnupg: Not Installed
         PyYAML: 5.3.1
          PyZMQ: 19.0.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: ubuntu 20.04 focal
         locale: utf-8
        machine: x86_64
        release: 5.4.0-26-generic
         system: Linux
        version: Ubuntu 20.04 focal

I installed most Python modules using PIP and not the Ubuntu packaged ones.

Same here - trying with Ubuntu 20.04 Docker container and everything works. I cannot reproduce this on 3001.

For anyone who was able to reproduce this, was this under Docker? Can you repro under docker? Or vagrant?

I'm simply not able to reproduce this - I'm not sure if I'm just missing some setting in master/minion config, mine are very sparse, just enough to connect. I added gitfs to see if that would let me repro, but still nothing.

Hmm, I've installed 3001 on 8 separate 20.04 server VMs (kvm) and they all do it.
All have the same config, but the highstate is complex and differs between them in various ways.
Shell is bash, connected via SSH.

Master config

state_top_saltenv: base
top_file_merging_strategy: same

fileserver_backend:
  - gitfs
gitfs_privkey: /etc/salt/master.d/salt-git-rsa
gitfs_pubkey: /etc/salt/master.d/salt-git-rsa.pub
gitfs_refspecs:
  - '+refs/heads/*:refs/remotes/origin/*'
  - '+refs/tags/*:refs/tags/*'
  - '+refs/pull/*/head:refs/remotes/origin/pr/*'
  - '+refs/pull/*/merge:refs/remotes/origin/merge/*'
gitfs_remotes:
  - https://github.com/saltstack-formulas/apt-formula:
    - all_saltenvs: v0.10.1
  - https://github.com/saltstack-formulas/elasticsearch-formula:
    - all_saltenvs: f70e9da
  - https://github.com/saltstack-formulas/kibana-formula:
    - all_saltenvs: merge/22
  - https://github.com/saltstack-formulas/letsencrypt-formula:
    - all_saltenvs: 2081658
  - https://github.com/saltstack-formulas/logstash-formula:
    - all_saltenvs: bab7586
  - https://github.com/saltstack-formulas/openssh-formula:
    - all_saltenvs: v0.43.3
  - https://github.com/saltstack-formulas/ufw-formula:
    - all_saltenvs: v0.5.4
  - https://github.com/saltstack-formulas/zabbix-formula:
    - all_saltenvs: v1.0.2
  - [email protected]:example/saltstack.git:
    - root: salt

pillarenv_from_saltenv: true
git_pillar_privkey: /etc/salt/master.d/salt-git-rsa
git_pillar_pubkey: /etc/salt/master.d/salt-git-rsa.pub
git_pillar_update_interval: 300
ext_pillar:
  - git:
    - master [email protected]:example/saltstack.git:
      - root: pillar
      - env: base
    - __env__ [email protected]:example/saltstack.git:
      - root: pillar

state_output: mixed_id
state_verbose: false

log_fmt_console: '[%(levelname)s] %(message)s %(jid)s'
log_fmt_logfile: '%(asctime)s,%(msecs)03d [%(name)-17s][%(levelname)-8s] %(message)s %(jid)s'
logstash_udp_handler:
  host: log-server
  port: 10514
  version: 1
  msg_type: saltstack

peer:
  .*:
    - x509.sign_remote_certificate

Minion config

saltenv: base
pillarenv_from_saltenv: true

state_output: mixed_id
state_verbose: false

log_fmt_console: '[%(levelname)s] %(message)s %(jid)s'
log_fmt_logfile: '%(asctime)s,%(msecs)03d [%(name)-17s][%(levelname)-8s] %(message)s %(jid)s'
logstash_udp_handler:
  host: log-server
  port: 10514
  version: 1
  msg_type: saltstack

enable_legacy_startup_events: false
grains_cache: true
use_superseded:
  - module.run

If I have time and luck and find a minimal reproduction I'll share it.

I tested the hang on Ubuntu 20.04 with Salt 3001, and I don't see it anymore on state.applys that touch salt:// . @OrangeDog I suspect you have a different bug.

It certainly looks that way.

I'm seeing this kind of hang as well on a full run on the master with salt-call state.apply test=True. This is on a fresh 20.04 install:
```Salt Version:
Salt: 3001

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: Not Installed
gitdb: 2.0.6
gitpython: 3.0.7
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.8.2 (default, Apr 27 2020, 15:53:34)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 18.1.1
smmap: 2.0.5
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2

System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-37-generic
system: Linux
version: Ubuntu 20.04 focal


Actually I've been working to narrow down the issue and I can run some state.sls_id and state.apply calls on the master without hanging. But one minimal example that does hang is a state in my salt-minion formula:

salt_repo_required_packages:
pkg.installed:
- pkgs:
- python-tornado
- python-pycurl
- python3-tornado
- python3-pycurl
- python3-pip
```

where output from salt-call state.sls_id salt_repo_required_packages common.ubuntu.salt-minion test=True shows the same zmq error:

[DEBUG ] Closing AsyncZeroMQReqChannel instance

Also get hit by this on Ubuntu 20.04 with the following state (applied through salt-call state.apply mystate test=true):

Build-Depends packages:
  pkg.installed:
    - names:
        - debhelper
        - cpio
        - libassuan-dev
        - libgpgme-dev
        - libseccomp-dev
        - libsystemd-dev
        - libbtrfs-dev
        - btrfs-progs
        - dh-golang
        - golang
        - libglib2.0-dev
        - go-md2man
        - git
        - libapparmor-dev
        - libdevmapper-dev

Doesn't happen on 18.04

Further testing seems to show this only seems to happen with pkg states when applied from within SLS files, i.e. salt-call pkg.install foo doesn't hang while an equivalent SLS state does. Also, this only appears to happen when run against a master, using --local doesn't hang.

salt-call state.low '{"state":"pkg","fun":"installed","name":"foo","__id__":"foo"}' doesn't hang either.

Not sure I'm having the exact same issue but I'm experiencing something very much similar on Debian 10 with 3001.

Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.8.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: 3.9.7
         pygit2: Not Installed
         Python: 3.8.3 (default, May 14 2020, 11:03:12)
   python-gnupg: Not Installed
         PyYAML: 5.3.1
          PyZMQ: 18.1.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.2

System Versions:
           dist: debian testing bullseye
         locale: utf-8
        machine: x86_64
        release: 5.7.0-1-amd64
         system: Linux
        version: Debian GNU/Linux testing bullseye

Running highstate -l debug on the minion freezes on

[DEBUG   ] Closing AsyncZeroMQReqChannel instance

strace'd:

poll([{fd=30, events=POLLIN}], 1, -1

Looks like we didn't get this fix, yet so moving it to commit to fix in Magnesium release.

IMO, the issue is pretty serious. Could it be considered for 3001.2? Waiting till 3002 (October?) is too much...

Did a couple of tests with different python versions (using pyenv and pip-installed Salt 3001) on Ubuntu 20.04:

  • 3.6.11 - ok
  • 3.7.8 - ok
  • 3.8.0, 3.8.2 (default on 20.04), 3.8.4 - hangs!
  • 3.9.0b4 - fails with AttributeError: _scope_id inside of ipaddress.py

When I place the ipdb breakpoint inside of salt-call:

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    import ipdb; ipdb.set_trace()
    sys.exit(
        load_entry_point('salt==3001', 'console_scripts', 'salt-call')()
    )

and step over the load_entry_point with n a couple of times, it hangs here:

/usr/lib/python3.8/threading.py:_shutdown()

    while True:
        with _shutdown_locks_lock:
            locks = list(_shutdown_locks)   # <---
            _shutdown_locks.clear()

but it is a debugger-induced problem and it hangs somewhere else without it. The interesting thing is that print(len(_shutdown_locks)) keeps growing (~1000) while program is stopped in the debugger session.

The hanging issue could be related to https://bugs.python.org/issue6721 (a pure guess).

Just found this one while digging with the similar issue (Ubuntu 20.04, salt 3001). Also checked https://github.com/saltstack/salt/issues/53319 and https://github.com/saltstack/salt/issues/57954, but this one seems to be the most relevant.

My state is pretty short:

unbound package:
  pkg.latest:
    - name: unbound

unbound service:
  service.running:
    - name: unbound
    - enable: True
    - full_restart: True
    - reload_grains: True
  require:
    - pkg: unbound package

Minion output:

[INFO    ] Loading fresh modules for state activity
[INFO    ] Running state [unbound] at time 16:25:02.058393
[INFO    ] Executing state pkg.latest for [unbound]
[INFO    ] Executing command ['dpkg-query', '--showformat', '${Status} ${Package} ${Version} ${Architecture}', '-W'] in directory '/root'
[INFO    ] Executing command ['apt-get', '-q', 'update'] in directory '/root'
[INFO    ] Executing command ['apt-cache', '-q', 'policy', 'unbound'] in directory '/root'
[INFO    ] Package unbound is already up-to-date
[INFO    ] Completed state [unbound] at time 16:25:04.260411 (duration_in_ms=2202.018)
[INFO    ] Running state [unbound] at time 16:25:04.262321
[INFO    ] Executing state service.running for [unbound]
[INFO    ] Executing command ['systemctl', 'status', 'unbound.service', '-n', '0'] in directory '/root'
[INFO    ] Executing command ['systemctl', 'is-active', 'unbound.service'] in directory '/root'
[INFO    ] Executing command ['systemctl', 'is-enabled', 'unbound.service'] in directory '/root'
[INFO    ] The service unbound is already running
[INFO    ] Completed state [unbound] at time 16:25:04.510928 (duration_in_ms=248.606)
local:

Summary for local
------------
Succeeded: 2
Failed:    0
------------
Total states run:     2
Total run time:   2.451 s
^C

Last line in debug:
2020-07-22 16:25:04,526 [salt.transport.zeromq:285 ][DEBUG   ][168205] Closing AsyncZeroMQReqChannel instance

This is 100% reproducible in my current setup.

Thank you @defanator, this is a critical issue and we attempted to fix it the 3001.1 point release, but other competing priorities and some core team members not being able to reproduce it, caused us to delay.

The issue and the fix is moved to be committed in the Magnesium release. The engineer assigned has reproduced it and will provide a fix. We tried here and didn't succeed.

@sagetherage thanks!

@cmcmarrow please let me know if there's anything I could provide in order to help address this one as soon as possible. I'd also be happy to run a few tests once a fix is available. TIA!

It hangs at the garbage collection stage:

(gdb) bt
#0  0x00007ffff7ee596f in __GI___poll (fds=0x7fffffffde60, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff661756b in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#2  0x00007ffff65f22be in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#3  0x00007ffff65e015c in ?? () from /lib/x86_64-linux-gnu/libzmq.so.5
#4  0x00007ffff66387ce in zmq_ctx_term () from /lib/x86_64-linux-gnu/libzmq.so.5
#5  0x00007ffff600b36b in ?? () from /usr/lib/python3/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#6  0x00000000005a59b9 in subtype_dealloc (self=<optimized out>) at ../Objects/typeobject.c:1289
#7  0x00000000005cb374 in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#8  _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at ../Include/object.h:478
#9  _Py_XDECREF (op=<optimized out>) at ../Include/object.h:541
#10 dict_dealloc (mp=0x7ffff4acf640) at ../Objects/dictobject.c:1990
#11 0x00000000005a1bae in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#12 _Py_DECREF (filename=0x87a8e0 "../Objects/typeobject.c", lineno=1140, op=<optimized out>) at ../Include/object.h:478
#13 subtype_clear (self=<AsyncReqMessageClient at remote 0x7ffff4ab6b20>) at ../Objects/typeobject.c:1140
#14 0x00000000004ec8b9 in delete_garbage (state=<optimized out>, state=<optimized out>, old=0x934840 <_PyRuntime+416>, collectable=0x7fffffffe1c0) at ../Modules/gcmodule.c:948
#15 collect.constprop.0 (generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1,
    state=0x9347f8 <_PyRuntime+344>) at ../Modules/gcmodule.c:1123
#16 0x000000000066babb in _PyGC_CollectNoFail () at ../Modules/gcmodule.c:1866
#17 0x000000000067dd60 in PyImport_Cleanup () at ../Python/import.c:598
#18 0x0000000000678450 in Py_FinalizeEx () at ../Python/pylifecycle.c:1240
#19 0x0000000000677138 in Py_Exit (sts=0) at ../Python/pylifecycle.c:2295
#20 0x000000000067716b in handle_system_exit () at ../Python/pythonrun.c:658
#21 0x0000000000677392 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x95a270) at ../Python/pythonrun.c:763
#22 PyErr_PrintEx (set_sys_last_vars=1) at ../Python/pythonrun.c:763
#23 0x00000000006775b4 in PyErr_Print () at ../Python/pythonrun.c:769
#24 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffffe3e8) at ../Python/pythonrun.c:434
#25 0x00000000006ae99e in pymain_run_file (cf=0x7fffffffe3e8, config=0x959620) at ../Modules/main.c:381
#26 pymain_run_python (exitcode=0x7fffffffe3e0) at ../Modules/main.c:565
#27 Py_RunMain () at ../Modules/main.c:644
#28 0x00000000006aed29 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:698
#29 0x00007ffff7df70b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffffffe5c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe5b8) at ../csu/libc-start.c:308
#30 0x00000000005f62ee in _start () at ../Objects/unicodeobject.c:5388

Potential references (links are intentionally broken to avoid spamming upstream issues):

https://github.com/zeromq/ pyzmq/issues/1167
https://github.com/zeromq/ pyzmq/issues/1003

Workaround: set transport: tcp in both master and minion config files.

@max-arnold yeah, same here:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fefa16cf96f in __GI___poll (fds=fds@entry=0x7fffa1c23160, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007fefa16cf96f in __GI___poll (fds=fds@entry=0x7fffa1c23160, nfds=nfds@entry=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fefa004a56b in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffa1c23160) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  zmq::signaler_t::wait (this=this@entry=0x2534658, timeout_=<optimized out>) at src/signaler.cpp:242
#3  0x00007fefa00252be in zmq::mailbox_t::recv (this=this@entry=0x25345f0, cmd_=cmd_@entry=0x7fffa1c23240, timeout_=timeout_@entry=-1) at src/mailbox.cpp:81
#4  0x00007fefa001315c in zmq::ctx_t::terminate (this=this@entry=0x25344c0) at src/ctx.cpp:181
#5  0x00007fefa006b7ce in zmq_ctx_term (ctx_=0x25344c0) at src/zmq.cpp:156
#6  0x00007fef9fa3e36b in ?? () from /usr/lib/python3/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#7  0x00000000005a59b9 in subtype_dealloc (self=<optimized out>) at ../Objects/typeobject.c:1289
#8  0x00000000005cb374 in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#9  _Py_DECREF (filename=<synthetic pointer>, lineno=541, op=<optimized out>) at ../Include/object.h:478
#10 _Py_XDECREF (op=<optimized out>) at ../Include/object.h:541
#11 dict_dealloc (mp=0x7fef9c4e9c00) at ../Objects/dictobject.c:1990
#12 0x00000000005a1bae in _Py_Dealloc (op=<optimized out>) at ../Objects/object.c:2215
#13 _Py_DECREF (filename=0x87a8e0 "../Objects/typeobject.c", lineno=1140, op=<optimized out>) at ../Include/object.h:478
#14 subtype_clear (self=<AsyncReqMessageClient at remote 0x7fef9c4d1880>) at ../Objects/typeobject.c:1140
#15 0x00000000004ec8b9 in delete_garbage (state=<optimized out>, state=<optimized out>, old=0x934840 <_PyRuntime+416>, collectable=0x7fffa1c234d0) at ../Modules/gcmodule.c:948
#16 collect.constprop.0 (generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1, state=0x9347f8 <_PyRuntime+344>)
    at ../Modules/gcmodule.c:1123
#17 0x000000000066babb in _PyGC_CollectNoFail () at ../Modules/gcmodule.c:1866
#18 0x000000000067dd60 in PyImport_Cleanup () at ../Python/import.c:598
#19 0x0000000000678450 in Py_FinalizeEx () at ../Python/pylifecycle.c:1240
#20 0x0000000000677138 in Py_Exit (sts=0) at ../Python/pylifecycle.c:2295
#21 0x000000000067716b in handle_system_exit () at ../Python/pythonrun.c:658
#22 0x0000000000677392 in _PyErr_PrintEx (set_sys_last_vars=1, tstate=0x1a77200) at ../Python/pythonrun.c:763
#23 PyErr_PrintEx (set_sys_last_vars=1) at ../Python/pythonrun.c:763
#24 0x00000000006775b4 in PyErr_Print () at ../Python/pythonrun.c:769
#25 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffa1c236f8) at ../Python/pythonrun.c:434
#26 0x00000000006ae99e in pymain_run_file (cf=0x7fffa1c236f8, config=0x1a765b0) at ../Modules/main.c:381
#27 pymain_run_python (exitcode=0x7fffa1c236f0) at ../Modules/main.c:565
#28 Py_RunMain () at ../Modules/main.c:644
#29 0x00000000006aed29 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:698
#30 0x00007fefa15e10b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffa1c238d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffa1c238c8) at ../csu/libc-start.c:308
#31 0x00000000005f62ee in _start () at ../Objects/unicodeobject.c:5388
(gdb) 

zmq is waiting for something. forever.

I'm not sure about this particular case, but I've been hunting down the one where an incorrect state is provided, e.g.

do something:
    test.ping

vs

do something:
    test.nop

The first one stalls out, at a poll(10<eventfd. The second works fine. I was able to track things down into the bowels of the loader - if, within _inner_load something checks for a non-existent key it will trigger the failure case. In any module. So it's not necessarily that test.ping doesn't exist - you can change the code around to stop failing on that issue and it will "fail" properly. But the _inner_load will try and load every Salt module looking for a matching state name, and if any of the __virtual__ functions do 'something.nonexistent' in __salt__ it will then stall at exit on the same poll call.

I'm pretty sure that it is something within zeromq that's calling it, because if I simply manually do an os.close(10) (or whatever the file descriptor is) I get a segfault within ZMQ code, heh.

I don't know if the loader is a red herring or not, though.

I was also able to workaround it at the end of def sls by just doing a st_.client.destroy. That's probably not an ideal workaround, but it did stop the freezing.

馃し There are a lot of things that I know at this point, but I think there are actually more things that I don't know than when I started 馃槤

@waynew @max-arnold Works for me after downgrading pyzmq to 18.0.1 and started hanging with pyzmq>=18.1.0.

@DmitryKuzmenko @waynew @max-arnold @cmcmarrow unfortunately I can't invest much time into further debugging of this one, but a number of quick experiments shows that downgrading pyzmq to 18.0.1 also resolves an issue for us (Ubuntu 20.04 / Python 3.8.2 / libzmq 4.3.2). We've switched to TCP transport until there's no clean fix.

same here

[Thread 0x7fffeb7fe700 (LWP 22345) exited]
[Thread 0x7fffeafbd700 (LWP 22346) exited]
^C--Type <RET> for more, q to quit, c to continue without paging--Quit
(gdb) info threads
  Id   Target Id                                   Frame
* 1    Thread 0x7ffff7bf9740 (LWP 22285) "python3" 0x00007ffff7eda96f in __GI___poll (fds=0x7fffffffd930, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  20   Thread 0x7fffcaffd700 (LWP 22336) "python3" 0x00007ffff7ee743e in epoll_wait (epfd=13, events=0x7fffcaffc1d0, maxevents=256, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  21   Thread 0x7ffff0f06700 (LWP 22337) "python3" 0x00007ffff7ee743e in epoll_wait (epfd=16, events=0x7ffff0f051d0, maxevents=256, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
(gdb) bt
#0  0x00007ffff7eda96f in __GI___poll (fds=0x7fffffffd930, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff5c79945 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#2  0x00007ffff5c4fc21 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#3  0x00007ffff5c2f163 in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#4  0x00007ffff5ca5cfa in zmq_ctx_term () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#5  0x00007ffff5ca5eb7 in zmq_ctx_destroy () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-1358af2c.so.5.2.2
#6  0x00007ffff507d0fb in ?? () from /usr/local/lib/python3.8/dist-packages/zmq/backend/cython/context.cpython-38-x86_64-linux-gnu.so
#7  0x00000000005a59b9 in ?? ()
#8  0x00000000005cb374 in ?? ()
#9  0x00000000005a1bae in ?? ()
#10 0x00000000004ec8b9 in ?? ()
#11 0x000000000066babb in _PyGC_CollectNoFail ()
#12 0x000000000067dd60 in PyImport_Cleanup ()
#13 0x0000000000678450 in Py_FinalizeEx ()
#14 0x00000000006aeab9 in Py_RunMain ()
#15 0x00000000006aed29 in Py_BytesMain ()
#16 0x00007ffff7dec0b3 in __libc_start_main (main=0x4ebd20 <main>, argc=4, argv=0x7fffffffe018, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe008)
    at ../csu/libc-start.c:308
#17 0x00000000005f62ee in _start ()
(gdb)

This hack will stop the hang but will skip cleanup and may leak resources.
salt/scripts/salt-call.py

from salt.scripts import salt_call
from os import _exit as e
if __name__ == "__main__":
    salt_call()
    e(0)

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

Even though this is an upstream bug we are working on and patch and will open a pull request with libzmq

The bug does not sit in pyzmq but rather libzmq. I built pyzmq 18.0.0 with zmq 4.3.2 and got the hang! The break happened when they went from zmq 4.3.1 to 4.3.2. Note: When I ran libzmq tests I go no fails.

i'm sorry but i downgraded from libzmq 4.3.2 to libzmq 4.3.1 and it still hangs.

@disaster123 I think you need to rebuild pyzmq against the downgraded version.

may be #57856 is still a different bug.

I did:

  • downgrade libzmq5
  • downgrade libzmq3-dev
  • rebuild pyzmq
  • verified python3 reports ZMQ 40301 as zmq version

but salt still hangs at the end of execution.

OK working fix is just to install pyzmq=18.0.1 no matter which version of libzmq is installed it's working with 4.3.2 and 4.3.1 for me. Important is that you use:

pip3 install --no-binary=:all: pyzmq=18.0.1

@disaster123 thanks for info. What state are you using to test this? I want to see if your state cause the hang on my box?

@cmcmarrow i'm coming from https://github.com/saltstack/salt/issues/57856 where salt-call hangs on EVERY state. So for me it didn't matter what kind of state it is.

@cmcmarrow where is the upstream issue being tracked?

* 3.9.0b4 - fails with `AttributeError: _scope_id` inside of `ipaddress.py`

I solved this issue (https://github.com/fepitre/salt/commit/4c5e18bfd092d9003d12c89131f787a57d54cf38) by defining the attribute for compatibility with newer python3.9 https://github.com/python/cpython/commit/21da76d1f1b527d62b2e9ef79dd9aa514d996341.

Was this page helpful?
0 / 5 - 0 ratings