Description
Sometimes, when stopping the salt-minion with systemd, not all processes exit. This will prevent the minion from starting again.
Setup
1 salt master, 1 salt minion, same machine
Steps to Reproduce the behavior
This bug does not happen every time, only sometimes, so I created a bash script to repeatedly stop / start minion while checking extraneous processes with ps in between. This is the file I used:
# stray_process.sh
while true
do
echo systemctl stop salt-minion
out=$(systemctl stop salt-minion)
echo $out
sleep 1
echo ps -aux | grep salt-minion | grep -v grep
processes=$(ps -aux | grep salt-minion | grep -v grep | tr -d '\n' | xargs)
echo $processes
sleep 2
if [[ -n $processes ]]; then
echo "stray processes found $(echo $processes | awk '{print $2}') break"
break
fi
echo systemctl start salt-minion
out=$(systemctl start salt-minion)
sleep 1
done
Run script
# ./stray_process.sh
Example of stray process found by running script:
...
systemctl start salt-minion
systemctl stop salt-minion
root 45290 0.0 2.6 124284 26884 ? S 21:23 0:00 /usr/bin/python3 /usr/bin/salt-minion
stray processes found 45290 break
root@salt-test:~#
Expected behavior
When using systemctl stop, all minion process should exit. When issuing systemctl start the minion should be allowed to start (instead it will exit or fail due to presence of stray process(es))
Versions Report
root@salt-test:~# salt --versions-report
Salt Version:
Salt: 3002.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: Not Installed
gitdb: 2.0.6
gitpython: 3.0.7
Jinja2: 2.10.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.8.5 (default, Jul 28 2020, 12:59:40)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 18.1.1
smmap: 2.0.5
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2
System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-51-generic
system: Linux
version: Ubuntu 20.04 focal
Some trial and error reveals that adding this to the systemd salt-minion.service and salt-master.service files allows them to be restarted an arbitrary number of times without a stray process:
TimeoutSec=5
ExecStopPost=pkill -fc -9 salt-minion
The above is added to the bottom of the [Service] section of the unit.
Further investigation revealed that the stray process is (almost?) always
/usr/bin/python3 /usr/bin/salt-minion KeepAlive MultiprocessingLoggingQueue
It's been a long time since I've been in that code, I do recall that the MP logger is somewhat "special". I am wondering if when the service cycles too fast something about it makes it think it's not supposed to die.
In any case, the unit file change seems to clear this up for this particular customer. Anyone else on the core team want to comment on whether this represents something that needs a deeper look?
Maybe @s0undt3ch ?
Related to this is the fact, that the salt-minion.service unit uses KillMode=process, which is not quite optimal, as it allows left-over processes to linger around after the service has been stopped.
The default and proper value for KillMode= is control-group, which simply advises the Kernel to tear down the whole service's cgroup, so there's basically no way for any process to be around after the service terminated.
There must be a reason why someone set it to process, but this should also be considered to be just a workaround to be reverted at some point and in a perfect world, there should be no processes around after MainPID terminated.
EDIT: Just found the PR which introduced KillMode=process: #44427
See also the upstream documentation:
Specifies how processes of this unit shall be killed. One of control-group, mixed, process, none.
If set to control-group, all remaining processes in the control group of this unit will be killed on unit stop (for services: after the stop command is executed, as configured with ExecStop=). If set to mixed, the SIGTERM signal (see below) is sent to the main process while the subsequent SIGKILL signal (see below) is sent to all remaining processes of the unit's control group. If set to process, only the main process itself is killed (not recommended!). If set to none, no process is killed (strongly recommended against!). In this case, only the stop command will be executed on unit stop, but no process will be killed otherwise. Processes remaining alive after stop are left in their control group and the control group continues to exist after stop unless empty.
Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.
Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=). Optionally, this is immediately followed by a SIGHUP (if enabled with SendSIGHUP=). If processes still remain after the main process of a unit has exited or the delay configured via the TimeoutStopSec= has passed, the termination request is repeated with the SIGKILL signal or the signal specified via FinalKillSignal= (unless this is disabled via the SendSIGKILL= option). See kill(2) for more information.
Defaults to control-group.
this may be related to #57934