Description
With masters started since few weeks, salt commands failed with "Minion did not return. [No response]" on some random minions. The minions who do not respond are not the same if the salt command is rerun.
I reproduce the issue when I target a single minion with a test.ping command.
Restarting the master service solve the issue, but it appear again after some days/weeks without restarting the service.
Setup
One master and few minions registred.
Steps to Reproduce the behavior
It's not easy to reproduced the issue, the master service must be started since several days/weeks, and it appears randomly.
I reproduced the issue with a single minion on a test.ping command.
Some responses seems to be dropped by the master event bus.
master logs
# salt -l debug myserver test.ping
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'myserver.dsone.3ds.com_master', u'tcp://127.0.0.1:4506', u'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] LazyLoaded local_cache.get_load
[DEBUG ] Reading minion list from /var/cache/salt/master/jobs/20/1aa1dda811f2bdb606bb78bba4ff9f3da4f8ad23a7da18d121a25ee34fb5b7/.minions.p
[DEBUG ] get_iter_returns for jid 20200618175516640328 sent to set(['myserver']) will timeout at 17:55:21.656635
[DEBUG ] Checking whether jid 20200618175516640328 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'myserver.dsone.3ds.com_master', u'tcp://127.0.0.1:4506', u'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Passing on saltutil error. Key 'u'retcode' missing from client return. This may be an error in the client.
[DEBUG ] return event: {'myserver': {u'failed': True}}
myserver:
Minion did not return. [No response]
[DEBUG ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code
master event bus
# salt-run state.event pretty=True
20200618175516640328 {
"_stamp": "2020-06-18T15:55:16.640731",
"minions": [
"myserver"
]
}
salt/job/20200618175516640328/new {
"_stamp": "2020-06-18T15:55:16.641871",
"arg": [],
"fun": "test.ping",
"jid": "20200618175516640328",
"minions": [
"myserver"
],
"missing": [],
"tgt": "myserver",
"tgt_type": "glob",
"user": "root"
}
20200618175521762170 {
"_stamp": "2020-06-18T15:55:21.762531",
"minions": [
"myserver"
]
}
salt/job/20200618175521762170/new {
"_stamp": "2020-06-18T15:55:21.763894",
"arg": [
"20200618175516640328"
],
"fun": "saltutil.find_job",
"jid": "20200618175521762170",
"minions": [
"myserver"
],
"missing": [],
"tgt": [
"myserver"
],
"tgt_type": "list",
"user": "root"
}
salt/job/20200618175521762170/ret/myserver {
"_stamp": "2020-06-18T15:55:21.861260",
"cmd": "_return",
"fun": "saltutil.find_job",
"fun_args": [
"20200618175516640328"
],
"id": "myserver",
"jid": "20200618175521762170",
"master_id": "myserver",
"retcode": 0,
"return": {},
"success": true
}
job lookup
[root@myserver ~]# salt-run -l info jobs.lookup_jid 20200618175516640328
myserver:
True
[INFO ] Runner completed: 20200619091453151012
minion logs
2020-06-18 17:55:16,652 [salt.minion :1482][INFO ][21477] User root Executing command test.ping with jid 20200618175516640328
2020-06-18 17:55:16,653 [salt.minion :1489][DEBUG ][21477] Command details {u'tgt_type': u'glob', u'jid': u'20200618175516640328', u'tgt': u'myserver', u'ret': u'', u'user': u'root', u'arg': [], u'fun': u'test.ping', u'master_id': u'myserver'}
2020-06-18 17:55:16,657 [salt.utils.process:860 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:8-Job-20200618175516640328 added
2020-06-18 17:55:16,716 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded jinja.render
2020-06-18 17:55:16,719 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded yaml.render
2020-06-18 17:55:16,721 [salt.minion :1609][INFO ][22892] Starting a new job 20200618175516640328 with PID 22892
2020-06-18 17:55:16,724 [salt.utils.lazy :107 ][DEBUG ][22892] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-06-18 17:55:16,742 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded test.ping
2020-06-18 17:55:16,743 [salt.loaded.int.module.test:124 ][DEBUG ][22892] test.ping received for minion 'myserver'
2020-06-18 17:55:16,743 [salt.minion :807 ][DEBUG ][22892] Minion return retry timer set to 10 seconds (randomized)
2020-06-18 17:55:16,744 [salt.minion :1937][INFO ][22892] Returning information for job: 20200618175516640328
2020-06-18 17:55:16,745 [salt.transport.zeromq:138 ][DEBUG ][22892] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506', u'aes')
2020-06-18 17:55:16,746 [salt.crypt :464 ][DEBUG ][22892] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506')
2020-06-18 17:55:16,747 [salt.transport.zeromq:209 ][DEBUG ][22892] Connecting the Minion to the Master URI (for the return server): tcp://10.81.105.213:4506
2020-06-18 17:55:16,748 [salt.transport.zeromq:1189][DEBUG ][22892] Trying to connect to: tcp://10.81.105.213:4506
2020-06-18 17:55:16,756 [salt.transport.zeromq:233 ][DEBUG ][22892] Closing AsyncZeroMQReqChannel instance
2020-06-18 17:55:16,758 [salt.minion :1787][DEBUG ][22892] minion return: {u'fun_args': [], u'jid': u'20200618175516640328', u'return': True, u'retcode': 0, u'success': True, u'fun': u'test.ping', u'master_id': u'myserver'}
2020-06-18 17:55:17,717 [salt.utils.process:869 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:8-Job-20200618175516640328 cleaned up
2020-06-18 17:55:21,775 [salt.minion :1482][INFO ][21477] User root Executing command saltutil.find_job with jid 20200618175521762170
2020-06-18 17:55:21,776 [salt.minion :1489][DEBUG ][21477] Command details {u'tgt_type': u'list', u'jid': u'20200618175521762170', u'tgt': [u'myserver'], u'ret': u'', u'user': u'root', u'arg': [u'20200618175516640328'], u'fun': u'saltutil.find_job', u'master_id': u'myserver'}
2020-06-18 17:55:21,779 [salt.utils.process:860 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:9-Job-20200618175521762170 added
2020-06-18 17:55:21,838 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded jinja.render
2020-06-18 17:55:21,841 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded yaml.render
2020-06-18 17:55:21,844 [salt.minion :1609][INFO ][22904] Starting a new job 20200618175521762170 with PID 22904
2020-06-18 17:55:21,847 [salt.utils.lazy :107 ][DEBUG ][22904] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-06-18 17:55:21,850 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded saltutil.find_job
2020-06-18 17:55:21,852 [salt.minion :807 ][DEBUG ][22904] Minion return retry timer set to 6 seconds (randomized)
2020-06-18 17:55:21,852 [salt.minion :1937][INFO ][22904] Returning information for job: 20200618175521762170
2020-06-18 17:55:21,853 [salt.transport.zeromq:138 ][DEBUG ][22904] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506', u'aes')
2020-06-18 17:55:21,854 [salt.crypt :464 ][DEBUG ][22904] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506')
2020-06-18 17:55:21,856 [salt.transport.zeromq:209 ][DEBUG ][22904] Connecting the Minion to the Master URI (for the return server): tcp://10.81.105.213:4506
2020-06-18 17:55:21,857 [salt.transport.zeromq:1189][DEBUG ][22904] Trying to connect to: tcp://10.81.105.213:4506
2020-06-18 17:55:21,865 [salt.transport.zeromq:233 ][DEBUG ][22904] Closing AsyncZeroMQReqChannel instance
2020-06-18 17:55:21,866 [salt.minion :1787][DEBUG ][22904] minion return: {u'fun_args': [u'20200618175516640328'], u'jid': u'20200618175521762170', u'return': {}, u'retcode': 0, u'success': True, u'fun': u'saltutil.find_job', u'master_id': u'myserver'}
2020-06-18 17:55:22,717 [salt.utils.process:869 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:9-Job-20200618175521762170 cleaned up
Expected behavior
Responses sent by minions must be returned by the command line.
Versions Report
salt --versions-report
Salt Version:
Salt: 2019.2.4
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: 0.26.3
libnacl: Not Installed
M2Crypto: 0.21.1
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: 2.19
pycrypto: 2.6.1
pycryptodome: 3.9.7
pygit2: 0.26.4
Python: 2.7.5 (default, Jun 11 2019, 14:33:56)
python-gnupg: 0.4.4
PyYAML: 3.10
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1127.8.2.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo
@malletp Thanks for the report. A couple questions about the setup here, are the salt mater and the minions on the same network or this going out onto the Internet? There are a number of Keepalive settings that you can add to the minion configuration to ensure it's maintaining it's connection to the master, https://docs.saltstack.com/en/latest/ref/configuration/minion.html#keepalive-settings. Finally, it looks like you're running 2019.3.4. We highly recommend that you upgrade to 2019.3.5 which has a major CVE fix in it.
@garethgreenaway Thanks for your response.
Master and minion are on the same network, nothing going onto internet.
Regarding keep alive settings, I have configured a minion and test again but I still reproduce the issue.
I don't think this is a connection issue. Minions properly receive jobs from master, and properly send results to the master. The results of the jobs are also properly stored in the job cache.
So for me this issue is more between the salt command and the master service. The salt command doesn't receive the result of the job.
I thought that the 2 major CVE were fixed in 2019.2.4, but yes we will upgrade to 2019.2.5.
@garethgreenaway I am having just about the exact same issue as this.
We upgraded our Master to v3000.3-1 (RHEL6) on May 20 and have had intermittent issues ever since.
This specific issue is the closest to what we are experiencing that I could find, out of the "Minion did not return" issues.
On the Master I can run a "salt minion test.ping" and I receive the error (paraphrased)
minion:
Minion did not return. [No response]
ERROR: Minions returned with non-zero exit code
[root@master ~]#
If I look at the job results at the Master I see that there is completion data. If I look at the logs on the Minion I see that it sent the completion data to the Master.
Earlier today the Master seemed to be working just fine, however after about an hour suddenly Minions would no longer return data to command-line instances. Once this occurs it seems the only way to clear the issue is to restart the Master process and wait a little while. Looking carefully through the logs and doing manual tests it seems that the disconnect really is between the command line interface and the job results.
Everything appears to work just fine, however when the Master is in this state, command-line results from Minion(s) do not work. You must instead rely on querying the job data out-of-band to determine any outcome. The Master will seemingly remain in this state until you restart it.
Update: we reverted our Master back to v3000.2-1 (RHEL6) on June 25 and have not had any issue since (as of July 20.)
I may be encountering the same issue, but I'm not sure. A clue might be in the response on the event bus. Note the "return": {}, in the OP:
salt/job/20200618175521762170/ret/myserver {
"_stamp": "2020-06-18T15:55:21.861260",
"cmd": "_return",
"fun": "saltutil.find_job",
"fun_args": [
"20200618175516640328"
],
"id": "myserver",
"jid": "20200618175521762170",
"master_id": "myserver",
"retcode": 0,
"return": {},
"success": true
}
I get the same "return": {}, on a test.ping when I see this issue, but when the test.ping works, I get "return": true,. Note I get a response from the minion each time, just with a different return value, empty dict or something else. The same thing happens with test.version. I can rerun ping/version repeatedly against the same minion, and it seems to succeed about half the time, and have this issue the other half. "return": {}, is always in the event bus on a failure. This issue even effects my minion that lives on my master. It also seems to be the case that when this happens, it can happen early.
As mentioned in the other comment, when I restart the master service, the problem seems to go away. I'm not sure how to re-trigger it. The return code seeming to matter makes me suspect a minion problem, but other signs point to the master. Perhaps there's a couple things at play here.
I'm also facing the same issue, but I don't have any more information to add here ;)
@nixjdm In fact, the "return": {}, is not the return of the test.ping. It's the return of the saltutil.find_job job and the minion says that this job is already finish. If you can have a look on logs of the minion's log in debug level, you can see that minion responds to the two jobs, but when we query the event bus from salt command line, we only the the second response.
2020-06-18 17:55:16,758 [salt.minion :1787][DEBUG ][22892] minion return: {u'fun_args': [], u'jid': u'20200618175516640328', u'return': True, u'retcode': 0, u'success': True, u'fun': u'test.ping', u'master_id': u'myserver'}
...
2020-06-18 17:55:21,866 [salt.minion :1787][DEBUG ][22904] minion return: {u'fun_args': [u'20200618175516640328'], u'jid': u'20200618175521762170', u'return': {}, u'retcode': 0, u'success': True, u'fun': u'saltutil.find_job', u'master_id': u'myserver'}
Is there a way to add some debug logs to the event bus, to see what event is received on the master and dispatch to command line ?
@malletp I see. I was misreading the bus. Thanks for the correction.
We do have a nearly same situation with some details.
WIndows firewall and Defender are both disabled,
salt minion is 3000.3,
multiprocessing: false does not help.
Resetting the network adapter during problem reproduction makes nothing, only minion restart.
Minion continues executing commands but master does not receive any response.
querying job status works fine
is everyone still seeing this on 3001.1 still?
I found out that in my case it's a bug in win_status function (connection was not detected when it was using ipv6 and connection check marked minion disconnected) and its still there. I made a fixed version of this module and placed it on master.
14 авг. 2020 г. 20:00 пользователь Megan Wilhite notifications@github.com написал:
is everyone still seeing this on 3001.1 still?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/saltstack/salt/issues/57750#issuecomment-674169735, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJBA2DUMS6RNJAXN2RB6NG3SAVUUNANCNFSM4OEUSMZA.
I can confirm that we also see it in our 3001.1 master environment - same picture as already perfectly described above.
I found out that in my case it's a bug in win_status function (connection was not detected when it was using ipv6 and connection check marked minion disconnected) and its still there. I made a fixed version of this module and placed it on master.
Just to cross reference, I believe the fix @trootnev referred to is in https://github.com/saltstack/salt/pull/58095.
Yup. This is my PR, but I don't have test environment at hand to satisfy test requirements. The fix is very simple, so I just use a custom module and happy with it.
20 авг. 2020 г. 21:14 пользователь Ken Crowell notifications@github.com написал:
I found out that in my case it's a bug in win_status function (connection was not detected when it was using ipv6 and connection check marked minion disconnected) and its still there. I made a fixed version of this module and placed it on master.
Just to cross reference, I believe the fix @trootnevhttps://github.com/trootnev referred to is in #58095https://github.com/saltstack/salt/pull/58095.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/saltstack/salt/issues/57750#issuecomment-677820182, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJBA2DX6UUREKWXPU6BZMGLSBVRXNANCNFSM4OEUSMZA.
this has been in progress, but we will not get the fix in for the 3002.2 release. We will continue to work on it and moving to the next release cycle Aluminium.
Thanks! No problem. We use custom module with our fix to this.
12 нояб. 2020 г. 02:27 пользователь Sage the Rage notifications@github.com написал:
this has been in progress, but we will not get the fix in for the 3002.2 release. We will continue to work on it and moving to the next release cycle Aluminium.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/saltstack/salt/issues/57750#issuecomment-725719443, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJBA2DT4QPM6YPK242L4AMDSPMMUTANCNFSM4OEUSMZA.
I can confirm that we also see it in our 3002.2. Hope next release should fix it.
Most helpful comment
I can confirm that we also see it in our 3002.2. Hope next release should fix it.