Salt: Beacon fileserver operations cause scheduled jobs with fileserver operations to hang

Created on 17 Oct 2016 · 5Comments · Source: saltstack/salt

Description of Issue/Question

If you create a beacon which fetches any files from the master, any scheduled jobs you have which utilize the fileserver will hang at 100% CPU forever.

Setup

/srv/salt/_beacons/hangtest.py

import logging
log = logging.getLogger(__name__)
def beacon(config):
    log.info('Running hangtest beacon')
    __salt__['cp.cache_file']('salt://_beacons/hangtest.py')
    return []

/srv/pillar/top.sls

base:
  '*':
    - hangtest

/srv/pillar/hangtest.sls

beacons:
  hangtest:
    interval: 1

schedule:
  cache_file_for_hang:
    function: cp.cache_file
    seconds: 10
    args:
      - salt://_beacons/hangtest.py
    return_job: False

Setup:

# salt '*' saltutil.sync_all
# salt '*' saltutil.refresh_pillar
# service salt-minion restart

The second run of the scheduled job (about 10 seconds after minion start) should hang forever at 100% CPU. Note that service salt-minion stop will not stop that process either.

Removing the cp.cache_file from the beacon fixes the issue. Scheduled jobs that don't use the fileserver (such as test.ping) will not hang even with fileserver operations in the beacon.

Versions Report

Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.5 (default, Jun 17 2014, 18:11:42)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.1.1503 Core
        machine: x86_64
        release: 3.10.0-229.el7.x86_64
         system: Linux
        version: CentOS Linux 7.1.1503 Core

(same master/minion -- it's a single box)

Beacon Bug Core File Servers P3 severity-high

Source

basepi

All 5 comments

@basepi whoa nice find! I am able to replicate this and have provided a docker container below for anyone that wants to quickly replicate the issue:

docker run -it -v /home/ch3ll/git/salt/:/testing/ ch3ll/issues:37059 /bin/bash (where /home/ch3ll/git/salt is a local git clone of salt)
salt-master -d; salt-minion -d
top

and you will see the salt-minion process at 100%:

%Cpu(s): 13.0 us,  0.7 sy,  0.0 ni, 86.2 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16333752 total,  2899316 free,  3958776 used,  9475660 buff/cache
KiB Swap:  8191996 total,  8191520 free,      476 used. 10989368 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
  304 root      20   0  512056  34560   3080 R 100.0  0.2   0:04.37 salt-minion                                                                                            
   40 root      20   0  657888  33432   3628 S   0.3  0.2   0:00.01 salt-master                                                                                            
   48 root      20   0  370908  42188   6324 S   0.3  0.3   0:01.27 salt-master                                                                                            
  169 root      20   0  512056  38116   6632 S   0.3  0.2   0:00.31 salt-minion                                                                                            
    1 root      20   0   11784   2844   2460 S   0.0  0.0   0:00.02 bash                                                                                                   
   31 root      20   0  354764  36064   6316 S   0.0  0.2   0:00.03 salt-master                                                                                            
   32 root      20   0  335704  30836   2388 S   0.0  0.2   0:00.00 salt-master                                                                                            
   33 root      20   0  362164  32828   3628 S   0.0  0.2   0:00.00 salt-master                                                                                            
   36 root      20   0  357620  32496   3024 S   0.0  0.2   0:00.00 salt-master                                                                                            
   38 root      20   0  357968  36512   3652 S   0.0  0.2   0:00.32 salt-master                                                                                            
   39 root      20   0  354764  32240   2488 S   0.0  0.2   0:00.01 salt-master                                                                                            
   47 root      20   0  371072  42640   5728 S   0.0  0.3   0:01.13 salt-master                                                                                            
   49 root      20   0  370908  42136   6324 S   0.0  0.3   0:01.10 salt-master                                                                                             
   50 root      20   0  370936  42432   6552 S   0.0  0.3   0:00.94 salt-master                                                                                             
   51 root      20   0  369872  41236   5500 S   0.0  0.3   0:01.25 salt-master                                                                                             
  170 root      20   0  421856  28096   2760 S   0.0  0.2   0:00.00 salt-minion                                                                                             
  301 root      20   0   51884   3756   3208 R   0.0  0.0   0:00.00 top

Ch3LL on 18 Oct 2016

👍1

👍

basepi on 18 Oct 2016

Caught! Working on a fix.

DmitryKuzmenko on 23 Nov 2016

Awesome @DmitryKuzmenko! :)

basepi on 23 Nov 2016

😄1

I've described the issue in the PR #37899.
I've been thinking much about the best way to fix this... It's not obvious enough because we have no centralized place that control ZeroMQ sockets that minion creates. So we can't just close everything on any fork.
But recreating the functions in schedule looks safe and correct solution for now.

DmitryKuzmenko on 26 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings