Salt: minion not returning, apt-get defunct

Created on 14 Jan 2014  路  32Comments  路  Source: saltstack/salt

I am using SaltStack 0.17.4 on Debian 7.3 for master and minion with great success, but have run into a problem.

Running "salt-call -l debug state.highstate" on the minion is successful, but when running "salt 'host' state.highstate" from the master, or using the daemon with "startup_states: highstate" on the minion causes apt-get to become defunct, and I suspect one of the package scripts might be causing this.

I am trying to install Proxmox VE as per http://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Wheezy and have traced the problem to the following section in my state file:

pve:
  pkg:
    - installed
    - pkgs:
      - proxmox-ve-2.6.32
      - lvm2
      - postfix
      - ksm-control-daemon
      - vzprocps
      - open-iscsi
      - bootlogd
salt --versions-report:
Salt: 0.17.4
Python: 2.7.3 (default, Jan  2 2013, 13:56:14)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.6
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.3

Any thoughts?

Bug P4 Platform State Module severity-medium stale

Most helpful comment

@tsia for mysql 5.7 we were able to fix that by changing init script a bit, it's around line 162 on ubuntu

             su - mysql -s /bin/bash -c "(mysqld_safe 2>&1)> /dev/null &"

All 32 comments

Can you show us the error messages plus minion debug log?

Unfortunately I could not find any error messages per se, since the minion never returns, but the output of "salt-call -l debug state.highstate" (which works) is available at https://gist.github.com/jakwas/8450721

In an effort to narrow it down, I ran high.state without the suspect packages (as listed above), then enabled "log_level_logfile: debug" on the minion, enabled the suspect packages, and ran "salt '*' state.highstate" from the master again. Output for /var/log/salt/minion : https://gist.github.com/jakwas/8453431

The process that becomes defunct is "apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef install ksm-control-daemon lvm2 postfix vzprocps bootlogd open-iscsi proxmox-ve-2.6.32".

Reverting to the previous state and running "apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef install ksm-control-daemon lvm2 postfix vzprocps bootlogd open-iscsi proxmox-ve-2.6.32" on the console, works fine : https://gist.github.com/jakwas/8454471

Is one of these packages not properly obeying the "no user input" rules we've defined? Does that command work when you run it on the shell yourself, with no intervention on your part?

Running the command myself works fine, and the only package that prompts for input is postfix. Installing postfix manually before running state.highstate does not help though.

I have traced the problem to the "ksm-control-daemon" package. Running "apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef install ksm-control-daemon" yields the following result : https://gist.github.com/jakwas/8473294

I'm struggling to see the problem in that gist. It may just be the fact that I'm exhausted, however.

Please let me know if I can assist in any way.

Can you point out what I'm missing in that last gist? You set it up as "this will show you why it's the ksm-control-daemon that is the problem", but I'm not seeing anything in there that points to a failure.

Sorry, what I meant to say is that installing the "ksm-control-daemon" package using "apt-get -q -y -o DPkg::Options::=--force-confold -o DPkg::Options::=--force-confdef install ksm-control-daemon" manually on the console works just fine, but when trying to install the same package via Salt, the same apt-get command becomes defunct.

Ah, thank you for clarifying. I assume you're running salt as root?

Yes, that is correct.

I realize this is a corner case and not exactly critical, but I still have this problem with Salt 2014.1.1

Interesting enough though, if I stop the "ksmtuned" daemon after apt-get goes defunct, the minion returns normally.

Thanks for the update! I don't think anyone's been able to spend any time looking into this one yet. Side effect of a long backlog of bugs to fix. Sorry for the inconvenience this is causing you. It's still on our list!

I'm getting this using a 2014.7 fork (contains some fixes that have been pulled in) and in the daily package when installing bind via saltstack-formulas/bind-formula.

I've tracked it down to the Popen.communicate() call in timed_subprocess.py:34.

I have some code to replicate it on Ubuntu 14.04.2:

import subprocess

print 'RUNNING'
process = subprocess.Popen(['apt-get',
                            '-q',
                            '-y',
                            '-o',
                            'DPkg::Options::=--force-confold',
                            '-o',
                            'DPkg::Options::=--force-confdef',
                            'install',
                            'bind9',
                            'dnssec-tools',
                            'bind9utils'],
                           env={'DEBIAN_FRONTEND': 'noninteractive',
                                'UCF_FORCE_CONFFOLD': '1',
                                'APT_LISTBUGS_FRONTEND': 'none',
                                'APT_LISTCHANGES_FRONTEND': 'none',
                                'PATH':'/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'},
                           stdout=-1,
                           stderr=-2,
                           close_fds=True)

print 'COMMUNICATING'
stdout, stderr = process.communicate()
print stdout
print stderr

Not sure what should be done here. Going to do some reading up on subprocess as this is really frustrating.

Seems like dnssec-tools is the poorly behaved package for me. Don't see a way to change the way Popen.communicate() is used in the pkg.install case. Guess I'll hunt down what's different about dnssec-tools.

Somehow salt has a pipe to the rollerd daemon (started by an init script via postinst) in my case. Salt continues if I kill rollerd.

At this point I've been out of my element for quite some time. I'm going to try mucking around with the dnssec-tools package to see if I can make it behave, but if anyone has ideas/points, I'd love to hear them.

i have a similar issue with installing mysql-server. i am currently running a masterless setup (salt-minion 2015.5.3 (Lithium)).

when apt-get goes defunct, i can just do a service mysql restart in another terminal and salt continues normally.

This has worked before with mysql-server i think. maybe this broke with mysql 5.6 or so

@tsia for mysql 5.7 we were able to fix that by changing init script a bit, it's around line 162 on ubuntu

             su - mysql -s /bin/bash -c "(mysqld_safe 2>&1)> /dev/null &"

i noticed something interesting with mysql 5.7 and the mysql-formula:
https://github.com/saltstack-formulas/mysql-formula/issues/107

in the init script, it looks for the pid file by searching for "pid-file" in the output of my_print_defaults.
but the mysql-formula doesn't set that option. it sets "pid_file" and by default it only sets that option for the [mysqld] section and not for [mysqld_safe]. mysqld doesn't care but the init script waits forever. the fallback to $MYSQLDATA/$(hostname).pid doesn't work as well because the pid-file is located at /var/run/mysqld/mysqld.pid

This seems to be also affecting services that either are java-based (exhibit a: Rundeck), or use simplistic init.d scripts for startup that are written by users (exibit b: oracle DB startup scripts are usually scooped from top posts in google). In my experience the only java-based software that uses init scripts correctly is Jenkins, which can more or less reload and understands kill -3.
Considering that startup scripts are different on rhel, deb - and I would suppose other linux flavours as well (arch, suse, gentoo) - maintaining all of them by educating users of differences of startup of various systems looks like impossible task. They would also need to maintain broken scripts from other people, which adds complexity.
Oh, and 械here is also windows - while not affected by this bug, would mean that we cannot reuse existing approach. Unless we write a module to handle service module.

If using systemd scripts - everything works fine, but as soon as service script is init.d based there are various issues - services not starting (while appear started if executed via state), etc.
Switching to systemd is not always an option - while it is really easy to just add another unit, some systems do not support it (exhibit c/d: RHEL6/CentOS6).

I think this needs to be fixed - using something like su - mysql -s /bin/bash -c "(mysqld_safe 2>&1)> /dev/null &" means that we are not using the built-in service command, which breaks compatibility. And also breaks apt-get. Rebooting the minion to start the service is not the best idea in my opinion.
Or you can switch to MariaDB - but some software stops working (exhibit e: SonarQube that really wants only pure MySQL).

In the end - to me this is all about maintainability between different systems without going into specifics of each one. Using service.running should work on windows, rhel, deb or it defetas the purpose of having service state in the first place.

Personally I would be fine even with having an additional parameter, like force_detach or legacy or really_bad_service - as long as I can point people to documentation and say, "hey, look here, try that parameter".

Personally I would be fine even with having an additional parameter, like force_detach or legacy or really_bad_service - as long as I can point people to documentation and say, "hey, look here, try that parameter".

It sounds wrong to me adding another workaround and letting people investigate which mal-behaving startup script exhibits this behavior and needs a workaround - instead of fixing the problem where the problem lies, i.e. in the salt minion, not properly catching SIGCHLD signal and closing all pipes passed to the exited subprocess.

I'll repeat what I wrote in https://github.com/saltstack/salt/issues/33442#issuecomment-261711016

To make myself explicit: services which behave as described above (a process terminates but passes its descriptors to other processes before exiting) may indeed not be behaving as nice citizens, but the bug lies entirely in SaltStack.

When SaltStack spawns some process (i.e. runs a command) and creates a pipe to its stdout and stderr, SaltStack must establish a SIGCHLD handler. When the spawned process terminates, any I/O on such pipes must be terminated (even if the pipe is still open on the other side). Promptly reclaiming (waitpid) the exit status of a terminated child process also avoids having zombies lying around.

I agree with @MarkMartinec and vote for adding SIGCHLD handler to salt-minion.

@MarkMartinec I agree and all for fixing the true issue.

@mbochenk
so? how to fix in the centos6 ,i have the same question in centos6. i dont get it.

@fanne there is no fix available, but there are some workarounds suggested. For example, see #33442 or check mysql commit which broke compatibility: https://github.com/mysql/mysql-server/commit/684a165f28b3718160a3e4c5ebd18a465d85e97c#diff-26228c60f7753d111a177fa638fb5838R283 (>/dev/null 2>&1 & replaced by >/dev/null & )

@mbochenk
My situation is like this锛宨 use yum to install mysql-community by the official yum source.
the mysql official repo
And this is mysql salt file to install mysql-community-server-5.6
salt file
befor start mysqld,i replaced mysqld files,see salt replace mysqld which files modified by (>/dev/null 2>&1 & replaced by >/dev/null &) see this mysqld modified.
the mysqld file which is after yum install mysql-community-server generated, and then I to modify.
I finally executing the following command on the master side of things:

[root@salt_master files]# salt 'mysqlTest' state.sls lnmp_yum.mysql.mysql_installed
mysqlTest:
    Minion did not return. [No response]

Execute successfully,but nothing return.
I don't know how to deal with now.
Can you help me?thank you.

@mbochenk
hello,I have already solved the problem.
I used the KVM virtual machine, which allocated 512m of memory per virtual machine.
The problem was solved when I raised the memory.
Thank you for your help. Thank you.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

I guess I should have let StaleBot close this, but I certainly cannot imagine this still being an issue in modern versions of salt. If anyone has this issue still, please comment and we can re-open.

Was this page helpful?
0 / 5 - 0 ratings