Salt: [BUG] Salt sometimes does not exit gracefully with systemd and will prevent restart

Created on 8 Dec 2020  路  4Comments  路  Source: saltstack/salt

Description
Sometimes, when stopping the salt-minion with systemd, not all processes exit. This will prevent the minion from starting again.

Setup
1 salt master, 1 salt minion, same machine

Steps to Reproduce the behavior
This bug does not happen every time, only sometimes, so I created a bash script to repeatedly stop / start minion while checking extraneous processes with ps in between. This is the file I used:

# stray_process.sh
while true
do
  echo systemctl stop salt-minion
  out=$(systemctl stop salt-minion)
  echo $out
  sleep 1
  echo ps -aux | grep salt-minion | grep -v grep
  processes=$(ps -aux | grep salt-minion | grep -v grep | tr -d '\n' | xargs)
  echo $processes
  sleep 2
  if [[ -n $processes ]]; then
    echo "stray processes found $(echo $processes | awk '{print $2}') break"
    break
  fi
  echo systemctl start salt-minion
  out=$(systemctl start salt-minion)
  sleep 1
done

Run script

# ./stray_process.sh

Example of stray process found by running script:

...
systemctl start salt-minion                                                                                                                                                              
systemctl stop salt-minion                                                                                                                                                               

root 45290 0.0 2.6 124284 26884 ? S 21:23 0:00 /usr/bin/python3 /usr/bin/salt-minion                                                                                                     
stray processes found 45290 break                                                                                                                                                        
root@salt-test:~#                 

Expected behavior
When using systemctl stop, all minion process should exit. When issuing systemctl start the minion should be allowed to start (instead it will exit or fail due to presence of stray process(es))

Versions Report

root@salt-test:~# salt --versions-report
Salt Version:
          Salt: 3002.2

Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.7.3
     docker-py: Not Installed
         gitdb: 2.0.6
     gitpython: 3.0.7
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: Not Installed
        Python: 3.8.5 (default, Jul 28 2020, 12:59:40)
  python-gnupg: 0.4.5
        PyYAML: 5.3.1
         PyZMQ: 18.1.1
         smmap: 2.0.5
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.2

System Versions:
          dist: ubuntu 20.04 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-51-generic
        system: Linux
       version: Ubuntu 20.04 focal
Aluminium Bug phase-plan severity-medium status-to-do

All 4 comments

Some trial and error reveals that adding this to the systemd salt-minion.service and salt-master.service files allows them to be restarted an arbitrary number of times without a stray process:

TimeoutSec=5
ExecStopPost=pkill -fc -9 salt-minion

The above is added to the bottom of the [Service] section of the unit.

Further investigation revealed that the stray process is (almost?) always

/usr/bin/python3 /usr/bin/salt-minion KeepAlive MultiprocessingLoggingQueue

It's been a long time since I've been in that code, I do recall that the MP logger is somewhat "special". I am wondering if when the service cycles too fast something about it makes it think it's not supposed to die.

In any case, the unit file change seems to clear this up for this particular customer. Anyone else on the core team want to comment on whether this represents something that needs a deeper look?

Maybe @s0undt3ch ?

Related to this is the fact, that the salt-minion.service unit uses KillMode=process, which is not quite optimal, as it allows left-over processes to linger around after the service has been stopped.

The default and proper value for KillMode= is control-group, which simply advises the Kernel to tear down the whole service's cgroup, so there's basically no way for any process to be around after the service terminated.

There must be a reason why someone set it to process, but this should also be considered to be just a workaround to be reverted at some point and in a perfect world, there should be no processes around after MainPID terminated.

EDIT: Just found the PR which introduced KillMode=process: #44427

See also the upstream documentation:

Specifies how processes of this unit shall be killed. One of control-group, mixed, process, none.

If set to control-group, all remaining processes in the control group of this unit will be killed on unit stop (for services: after the stop command is executed, as configured with ExecStop=). If set to mixed, the SIGTERM signal (see below) is sent to the main process while the subsequent SIGKILL signal (see below) is sent to all remaining processes of the unit's control group. If set to process, only the main process itself is killed (not recommended!). If set to none, no process is killed (strongly recommended against!). In this case, only the stop command will be executed on unit stop, but no process will be killed otherwise. Processes remaining alive after stop are left in their control group and the control group continues to exist after stop unless empty.

Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources.

Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=). Optionally, this is immediately followed by a SIGHUP (if enabled with SendSIGHUP=). If processes still remain after the main process of a unit has exited or the delay configured via the TimeoutStopSec= has passed, the termination request is repeated with the SIGKILL signal or the signal specified via FinalKillSignal= (unless this is disabled via the SendSIGKILL= option). See kill(2) for more information.

Defaults to control-group.

this may be related to #57934

Was this page helpful?
0 / 5 - 0 ratings

Related issues

twangboy picture twangboy  路  3Comments

golmaal picture golmaal  路  3Comments

sfozz picture sfozz  路  3Comments

Oloremo picture Oloremo  路  3Comments

sagetherage picture sagetherage  路  3Comments