If you create a beacon which fetches any files from the master, any scheduled jobs you have which utilize the fileserver will hang at 100% CPU forever.
/srv/salt/_beacons/hangtest.py
import logging
log = logging.getLogger(__name__)
def beacon(config):
log.info('Running hangtest beacon')
__salt__['cp.cache_file']('salt://_beacons/hangtest.py')
return []
/srv/pillar/top.sls
base:
'*':
- hangtest
/srv/pillar/hangtest.sls
beacons:
hangtest:
interval: 1
schedule:
cache_file_for_hang:
function: cp.cache_file
seconds: 10
args:
- salt://_beacons/hangtest.py
return_job: False
Setup:
# salt '*' saltutil.sync_all
# salt '*' saltutil.refresh_pillar
# service salt-minion restart
The second run of the scheduled job (about 10 seconds after minion start) should hang forever at 100% CPU. Note that service salt-minion stop will not stop that process either.
Removing the cp.cache_file from the beacon fixes the issue. Scheduled jobs that don't use the fileserver (such as test.ping) will not hang even with fileserver operations in the beacon.
Salt Version:
Salt: 2016.3.3
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Jun 17 2014, 18:11:42)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: centos 7.1.1503 Core
machine: x86_64
release: 3.10.0-229.el7.x86_64
system: Linux
version: CentOS Linux 7.1.1503 Core
(same master/minion -- it's a single box)
@basepi whoa nice find! I am able to replicate this and have provided a docker container below for anyone that wants to quickly replicate the issue:
docker run -it -v /home/ch3ll/git/salt/:/testing/ ch3ll/issues:37059 /bin/bash (where /home/ch3ll/git/salt is a local git clone of salt)salt-master -d; salt-minion -dtopand you will see the salt-minion process at 100%:
%Cpu(s): 13.0 us, 0.7 sy, 0.0 ni, 86.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16333752 total, 2899316 free, 3958776 used, 9475660 buff/cache
KiB Swap: 8191996 total, 8191520 free, 476 used. 10989368 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
304 root 20 0 512056 34560 3080 R 100.0 0.2 0:04.37 salt-minion
40 root 20 0 657888 33432 3628 S 0.3 0.2 0:00.01 salt-master
48 root 20 0 370908 42188 6324 S 0.3 0.3 0:01.27 salt-master
169 root 20 0 512056 38116 6632 S 0.3 0.2 0:00.31 salt-minion
1 root 20 0 11784 2844 2460 S 0.0 0.0 0:00.02 bash
31 root 20 0 354764 36064 6316 S 0.0 0.2 0:00.03 salt-master
32 root 20 0 335704 30836 2388 S 0.0 0.2 0:00.00 salt-master
33 root 20 0 362164 32828 3628 S 0.0 0.2 0:00.00 salt-master
36 root 20 0 357620 32496 3024 S 0.0 0.2 0:00.00 salt-master
38 root 20 0 357968 36512 3652 S 0.0 0.2 0:00.32 salt-master
39 root 20 0 354764 32240 2488 S 0.0 0.2 0:00.01 salt-master
47 root 20 0 371072 42640 5728 S 0.0 0.3 0:01.13 salt-master
49 root 20 0 370908 42136 6324 S 0.0 0.3 0:01.10 salt-master
50 root 20 0 370936 42432 6552 S 0.0 0.3 0:00.94 salt-master
51 root 20 0 369872 41236 5500 S 0.0 0.3 0:01.25 salt-master
170 root 20 0 421856 28096 2760 S 0.0 0.2 0:00.00 salt-minion
301 root 20 0 51884 3756 3208 R 0.0 0.0 0:00.00 top
馃憤
Caught! Working on a fix.
Awesome @DmitryKuzmenko! :)
I've described the issue in the PR #37899.
I've been thinking much about the best way to fix this... It's not obvious enough because we have no centralized place that control ZeroMQ sockets that minion creates. So we can't just close everything on any fork.
But recreating the functions in schedule looks safe and correct solution for now.