Running salt -b 50 '*' state.highstate on salt MasterOfMasters and getting an exception:
[ERROR ] An un-handled exception was caught by salt's global exception handler:
KeyError: u'ret'
Traceback (most recent call last):
File "/bin/salt", line 10, in <module>
salt_main()
File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 485, in salt_main
client.run()
File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 65, in run
self._run_batch()
File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 283, in _run_batch
for res in batch.run():
File "/usr/lib/python2.7/site-packages/salt/cli/batch.py", line 262, in run
ret[minion] = data['ret']
KeyError: u'ret'
Traceback (most recent call last):
File "/bin/salt", line 10, in <module>
salt_main()
File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 485, in salt_main
client.run()
File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 65, in run
self._run_batch()
File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 283, in _run_batch
for res in batch.run():
File "/usr/lib/python2.7/site-packages/salt/cli/batch.py", line 262, in run
ret[minion] = data['ret']
KeyError: u'ret'
I added a logging statement after for minion, data in six.iteritems(parts): in /usr/lib/python2.7/site-packages/salt/cli/batch.py to print the minion & its data value. When this exception happens, data is always {'failed': True}.
The minion that it fails on is random too.
MasterOfMaster(1) -> Syndic(3) -> Minion (800)
Minion's are configured with 3 masters and master_type: "failover" so they only are connected to a single syndic at a time.
Exception occurs regardless of running either of these:
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
master# salt --versions-report
Salt Version:
Salt: 2018.3.2
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: 0.26.3
libnacl: 1.6.1
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.1
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.6.3.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo
syndic# salt --versions-report
Salt Version:
Salt: 2018.3.2
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: 0.26.3
libnacl: 1.6.1
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.14.4.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo
minion# # salt-call --versions-report
Salt Version:
Salt: 2018.3.2
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.14.4.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo
Unsure of whether this is relevant or not, but the salt master-of-masters is configured with a master_job_cache using pgjsonb. The syndics are configured with both master_job_cache and returner.pgjsonb entries in their master configs.
I am seeing errors in the postgresql logs about duplicate keys in the jids table:
< 2018-10-25 16:18:55.848 CDT > ERROR: duplicate key value violates unique constraint "jids_pkey"
< 2018-10-25 16:18:55.848 CDT > DETAIL: Key (jid)=(20181025161855212684) already exists.
< 2018-10-25 16:18:55.848 CDT > STATEMENT: INSERT INTO jids
(jid, load)
VALUES ('20181025161855212684', '{"fun_args": null, "jid": "20181025161855212684", "return": {"tgt_type": "list", "jid": "20181025161641102114", "tgt": ["minion1", "minion2", "minion3", "minion4", "minion5", "minion6", "minion7", "minion8"], "pid": 11919, "ret": "", "user": "root", "arg": [{"test": true, "__kwarg__": true}], "fun": "state.highstate", "master_id": "salt-master-of-masters"}, "retcode": 0, "success": true, "_stamp": "2018-10-25T21:18:55.839286", "fun": "saltutil.find_job", "master_id": "salt-master-of-masters", "id": "minion1"}')
The existing load value for that jid:
{
"to": 10,
"arg": [
"20181025161641102114"
],
"cmd": "publish",
"fun": "saltutil.find_job",
"jid": "20181025161855212684",
"ret": "",
"tgt": [
"minion1",
"minion2",
"minion3",
"minion4",
"minion5",
"minion6",
"minion7",
"minion8"
],
"user": "root",
"kwargs": {
"raw": false,
"gather_job_timeout": 10
},
"tgt_type": "list"
}
This is just one example, but I am also seeing duplicate key errors for some of the state.highstate returns too.
Upgraded master, syndics, and minions to 2018.3.3 and the problem still exists.
can you please include a sanitized version of your configuration files?
Also to confirm if you remove the postgres returner does it work?
Does it occur if running just a simple salt '*' test.ping as well? or just a highstate?
can you please include a sanitized version of your configuration files?
We used the saltstack-formulas/salt-formula to configure the master, syndics, and minions. Here is the pillar data:
master-of-masters role:
salt:
master:
order_masters: True
salt-master role:
salt:
state_verbose: False
master:
fileserver_backend:
- git
- roots
master_sign_pubkey: True
master_use_pubkey_signature: True
log_level_logfile: info
file_recv: True
top_file_merging_strategy: same
gitfs_pubkey: << snipped >>
gitfs_privkey: << snipped >>
gitfs_remotes: << snipped >>
gitfs_saltenv_whitelist:
- base
- eng
- dev
- qa
- pre
- prod
env_order:
- base
- prod
- pre
- qa
- dev
- eng
file_roots:
base:
- /srv/salt/base
pillarenv_from_saltenv: True
ext_pillar:
- makostack: /srv/pillar/makostack/makostack.cfg
master_job_cache: pgjsonb
master_job_cache.pgjsonb.host: << snipped >>
master_job_cache.pgjsonb.user: << snipped >>
master_job_cache.pgjsonb.passwd: << snipped >>
master_job_cache.pgjsonb.db: << snipped >>
master_job_cache.pgjsonb.port: << snipped >>
autosign_file: /etc/salt/autosign.conf
file_ignore_regex:
- '/\.svn($|/)'
- '/\.git($|/)'
file_ignore_glob:
- '*.pyc'
- '*/somefolder/*.bak'
- '*.swp'
key_cache: 'sched'
unique_jid: True
nacl.config:
sk_file: /etc/salt/pki/master/nacl
pk_file: /etc/salt/pki/master/nacl.pub
reactors:
- 'minion_start':
- /srv/reactor/sync_grains.sls
- 'salt/key':
- /srv/reactor/salt_key.sls
salt_formulas:
git_opts:
default:
update: True
options:
rev: master
identity: << snipped >>
basedir_opts:
makedirs: True
list:
base:
- salt-formula
- systemd-formula
- postfix-formula
- chrony-formula
- samba-formula
- openssh-formula
salt-syndic role:
salt:
state_verbose: False
master:
syndic_master: salt-master-of-masters
master_id: salt-master-of-masters
returner.pgjsonb.host: salt-master-of-masters
returner.pgjsonb.user: <<snipped>>
returner.pgjsonb.passwd: <<snipped>>
returner.pgjsonb.db: <<snipped>>
returner.pgjsonb.port: <<snipped>>
salt-minion role:
salt:
minion:
master:
- __: overwrite
- syndic1
- syndic2
- syndic3
random_master: True
master_shuffle: True
master_type: failover
The syndic's and master-of-master also overwrite salt:minion:master to only the salt-master-of-masters.
Also to confirm if you remove the postgres returner does it work?
From where? The MasterOfMasters, the Syndics, or from all of them?
Does it occur if running just a simple
salt '*' test.pingas well? or just a highstate?
We've seen the KeyError: u'ret' when running cmd.run, cmd.script, state.sls, state.apply, and state.highstate. I haven't seen it from test.ping.
Sadly this exception also occurs without any master_job_cache or returner's defined to send through pgjsonb.
okay so you are certain this is not related to the pgjsonb returner now? Did you just recently start seeing this behavior after an upgrade?
Yes, certain its not related to pgjsonb returner.
Did an upgrade/migration of salt from 2016.11 to 2018.3. Minions were previously connecting to a single master, now they are connecting to multiple salt-syndics in failover. Previous salt-master is untouched.
OLD: Master(1) -> Minions(X)
NEW: Master(1) -> Syndic(3) -> Minions(X)
If I run the salt batch command on a syndic, I do not see this exception.
thnaks for all the follow up information. one more question and then i will try to replicate the issue. Can you supply either a state file or the cmd.run command you are running when you see this?
salt -b 40 '*' state.apply salt.minion
Where salt.minion is from https://github.com/saltstack-formulas/salt-formula with the last commit of ed9a1b1814b4b217e7e91f70a016768e75ec47eb.
Two other data points:
The exception occurs on the master-of-masters regardless of whether all minions are connected to the syndic or not.
The syndic has ~1000 minions connected to it.
thanks for teh further details. Looks like i'm able to replicate this issue but i was only able to replicate this on the first run of the states, never after. You said you can replicate with cmd.run can you share the command you are runnig when you run into this error?
I am working with roskens on this and was able to reproduce the problem with salt -b 60 '*' cmd.run 'uptime ; df -h /'
thanks. ping @DmitryKuzmenko can you put this on your backlog?
This little patch prevents the KeyError u'ret' from being returned and causing the batch to completely fail. I have not tested it with raw output.
$ git diff
diff --git a/salt/cli/batch.py b/salt/cli/batch.py
index e3a7bf9..65ba130 100644
--- a/salt/cli/batch.py
+++ b/salt/cli/batch.py
@@ -255,6 +255,10 @@ class Batch(object):
if self.opts.get('failhard') and data['ret']['retcode'] > 0:
failhard = True
+ if data.get('failed') is True:
+ log.debug('Minion %s failed to respond: data=%s', minion, data)
+ data = {'ret': 'Minion did not return. [Failed]'}
+
if self.opts.get('raw'):
ret[minion] = data
yield data
Connected minions still randomly fail to return results, and that's probably a separate issue to look into and address.
Hi guys,
Do we have any updates on this issue? We're exactly facing the same problem as described above.
Thanks.
This looks like https://github.com/saltstack/salt/issues/46876
Another bump as well. This has been coming up more and more frequently on our deployment.
Just came up in mine - hoping to not see it again.
It happened on my system as well. But with the batch config in master
batch_safe_limit: 50
batch_safe_size: 10
Command: salt * saltutil.sync_grains (target about 4k minions)
Salt master version 2019.2.1
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
Thank you for updating this issue. It is no longer marked as stale.
I have a PR open that should fix this, https://github.com/saltstack/salt/pull/54923, but no movement on it yet.
Most helpful comment
Another bump as well. This has been coming up more and more frequently on our deployment.