Salt: KeyError: u'ret' when running salt -b x ... on salt master-of-masters.

Created on 25 Oct 2018  路  23Comments  路  Source: saltstack/salt

Description of Issue/Question

Running salt -b 50 '*' state.highstate on salt MasterOfMasters and getting an exception:

[ERROR   ] An un-handled exception was caught by salt's global exception handler:
KeyError: u'ret'
Traceback (most recent call last):
  File "/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 485, in salt_main
    client.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 65, in run
    self._run_batch()
  File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 283, in _run_batch
    for res in batch.run():
  File "/usr/lib/python2.7/site-packages/salt/cli/batch.py", line 262, in run
    ret[minion] = data['ret']
KeyError: u'ret'
Traceback (most recent call last):
  File "/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 485, in salt_main
    client.run()
  File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 65, in run
    self._run_batch()
  File "/usr/lib/python2.7/site-packages/salt/cli/salt.py", line 283, in _run_batch
    for res in batch.run():
  File "/usr/lib/python2.7/site-packages/salt/cli/batch.py", line 262, in run
    ret[minion] = data['ret']
KeyError: u'ret'

I added a logging statement after for minion, data in six.iteritems(parts): in /usr/lib/python2.7/site-packages/salt/cli/batch.py to print the minion & its data value. When this exception happens, data is always {'failed': True}.

The minion that it fails on is random too.

Setup

MasterOfMaster(1) -> Syndic(3) -> Minion (800)

Minion's are configured with 3 masters and master_type: "failover" so they only are connected to a single syndic at a time.

Steps to Reproduce Issue

Exception occurs regardless of running either of these:

salt -b 50 '*' state.highstate

salt -b 50 '*' state.highstate --static

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

master# salt --versions-report
Salt Version:
Salt: 2018.3.2

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: 0.26.3
libnacl: 1.6.1
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.1
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.6.3.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo

syndic# salt --versions-report
Salt Version:
Salt: 2018.3.2

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: 0.26.3
libnacl: 1.6.1
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.14.4.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo

minion# # salt-call --versions-report
Salt Version:
Salt: 2018.3.2

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.28.2
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.5 (default, May 31 2018, 09:41:32)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-862.14.4.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo

Bug Confirmed info-needed severity-medium

Most helpful comment

Another bump as well. This has been coming up more and more frequently on our deployment.

All 23 comments

Unsure of whether this is relevant or not, but the salt master-of-masters is configured with a master_job_cache using pgjsonb. The syndics are configured with both master_job_cache and returner.pgjsonb entries in their master configs.

I am seeing errors in the postgresql logs about duplicate keys in the jids table:

< 2018-10-25 16:18:55.848 CDT > ERROR:  duplicate key value violates unique constraint "jids_pkey"
< 2018-10-25 16:18:55.848 CDT > DETAIL:  Key (jid)=(20181025161855212684) already exists.
< 2018-10-25 16:18:55.848 CDT > STATEMENT:  INSERT INTO jids
                       (jid, load)
                        VALUES ('20181025161855212684', '{"fun_args": null, "jid": "20181025161855212684", "return": {"tgt_type": "list", "jid": "20181025161641102114", "tgt": ["minion1", "minion2", "minion3", "minion4", "minion5", "minion6", "minion7", "minion8"], "pid": 11919, "ret": "", "user": "root", "arg": [{"test": true, "__kwarg__": true}], "fun": "state.highstate", "master_id": "salt-master-of-masters"}, "retcode": 0, "success": true, "_stamp": "2018-10-25T21:18:55.839286", "fun": "saltutil.find_job", "master_id": "salt-master-of-masters", "id": "minion1"}')

The existing load value for that jid:

{
  "to": 10,
  "arg": [
    "20181025161641102114"
  ],
  "cmd": "publish",
  "fun": "saltutil.find_job",
  "jid": "20181025161855212684",
  "ret": "",
  "tgt": [
    "minion1",
    "minion2",
    "minion3",
    "minion4",
    "minion5",
    "minion6",
    "minion7",
    "minion8"
  ],
  "user": "root",
  "kwargs": {
    "raw": false,
    "gather_job_timeout": 10
  },
  "tgt_type": "list"
}

This is just one example, but I am also seeing duplicate key errors for some of the state.highstate returns too.

Upgraded master, syndics, and minions to 2018.3.3 and the problem still exists.

can you please include a sanitized version of your configuration files?

Also to confirm if you remove the postgres returner does it work?

Does it occur if running just a simple salt '*' test.ping as well? or just a highstate?

can you please include a sanitized version of your configuration files?
We used the saltstack-formulas/salt-formula to configure the master, syndics, and minions. Here is the pillar data:

master-of-masters role:

salt:
  master:
    order_masters: True

salt-master role:

salt:
  state_verbose: False
  master:
    fileserver_backend:
      - git
      - roots

    master_sign_pubkey: True
    master_use_pubkey_signature: True

    log_level_logfile: info

    file_recv: True

    top_file_merging_strategy: same

    gitfs_pubkey: << snipped >>
    gitfs_privkey: << snipped >>
    gitfs_remotes: << snipped >>
    gitfs_saltenv_whitelist:
      - base
      - eng
      - dev
      - qa
      - pre
      - prod

    env_order:
      - base
      - prod
      - pre
      - qa
      - dev
      - eng

    file_roots:
      base:
        - /srv/salt/base
    pillarenv_from_saltenv: True
    ext_pillar:
      - makostack: /srv/pillar/makostack/makostack.cfg

    master_job_cache: pgjsonb
    master_job_cache.pgjsonb.host: << snipped >>
    master_job_cache.pgjsonb.user: << snipped >>
    master_job_cache.pgjsonb.passwd: << snipped >>
    master_job_cache.pgjsonb.db: << snipped >>
    master_job_cache.pgjsonb.port: << snipped >>

    autosign_file: /etc/salt/autosign.conf

    file_ignore_regex:
      - '/\.svn($|/)'
      - '/\.git($|/)'

    file_ignore_glob:
      - '*.pyc'
      - '*/somefolder/*.bak'
      - '*.swp'

    key_cache: 'sched'
    unique_jid: True

    nacl.config:
      sk_file: /etc/salt/pki/master/nacl
      pk_file: /etc/salt/pki/master/nacl.pub

    reactors:
      - 'minion_start':
        - /srv/reactor/sync_grains.sls
      - 'salt/key':
        - /srv/reactor/salt_key.sls

salt_formulas:
  git_opts:
    default:
      update: True
      options:
        rev: master
        identity: << snipped >>
  basedir_opts:
    makedirs: True
  list:
    base:
      - salt-formula
      - systemd-formula
      - postfix-formula
      - chrony-formula
      - samba-formula
      - openssh-formula

salt-syndic role:

salt:
  state_verbose: False
  master:
    syndic_master: salt-master-of-masters
    master_id: salt-master-of-masters
    returner.pgjsonb.host: salt-master-of-masters
    returner.pgjsonb.user: <<snipped>>
    returner.pgjsonb.passwd: <<snipped>>
    returner.pgjsonb.db: <<snipped>>
    returner.pgjsonb.port: <<snipped>>

salt-minion role:

salt:
  minion:
    master:
      - __: overwrite
      - syndic1
      - syndic2
      - syndic3
    random_master: True
    master_shuffle: True
    master_type: failover

The syndic's and master-of-master also overwrite salt:minion:master to only the salt-master-of-masters.

Also to confirm if you remove the postgres returner does it work?

From where? The MasterOfMasters, the Syndics, or from all of them?

Does it occur if running just a simple salt '*' test.ping as well? or just a highstate?

We've seen the KeyError: u'ret' when running cmd.run, cmd.script, state.sls, state.apply, and state.highstate. I haven't seen it from test.ping.

Sadly this exception also occurs without any master_job_cache or returner's defined to send through pgjsonb.

okay so you are certain this is not related to the pgjsonb returner now? Did you just recently start seeing this behavior after an upgrade?

Yes, certain its not related to pgjsonb returner.

Did an upgrade/migration of salt from 2016.11 to 2018.3. Minions were previously connecting to a single master, now they are connecting to multiple salt-syndics in failover. Previous salt-master is untouched.

OLD: Master(1) -> Minions(X)

NEW: Master(1) -> Syndic(3) -> Minions(X)

If I run the salt batch command on a syndic, I do not see this exception.

thnaks for all the follow up information. one more question and then i will try to replicate the issue. Can you supply either a state file or the cmd.run command you are running when you see this?

salt -b 40 '*' state.apply salt.minion

Where salt.minion is from https://github.com/saltstack-formulas/salt-formula with the last commit of ed9a1b1814b4b217e7e91f70a016768e75ec47eb.

Two other data points:

  1. The exception occurs on the master-of-masters regardless of whether all minions are connected to the syndic or not.

  2. The syndic has ~1000 minions connected to it.

thanks for teh further details. Looks like i'm able to replicate this issue but i was only able to replicate this on the first run of the states, never after. You said you can replicate with cmd.run can you share the command you are runnig when you run into this error?

I am working with roskens on this and was able to reproduce the problem with salt -b 60 '*' cmd.run 'uptime ; df -h /'

thanks. ping @DmitryKuzmenko can you put this on your backlog?

This little patch prevents the KeyError u'ret' from being returned and causing the batch to completely fail. I have not tested it with raw output.

$ git diff
diff --git a/salt/cli/batch.py b/salt/cli/batch.py
index e3a7bf9..65ba130 100644
--- a/salt/cli/batch.py
+++ b/salt/cli/batch.py
@@ -255,6 +255,10 @@ class Batch(object):
                     if self.opts.get('failhard') and data['ret']['retcode'] > 0:
                         failhard = True

+                if data.get('failed') is True:
+                    log.debug('Minion %s failed to respond: data=%s', minion, data)
+                    data = {'ret': 'Minion did not return. [Failed]'}
+
                 if self.opts.get('raw'):
                     ret[minion] = data
                     yield data

Connected minions still randomly fail to return results, and that's probably a separate issue to look into and address.

Hi guys,

Do we have any updates on this issue? We're exactly facing the same problem as described above.

Thanks.

Another bump as well. This has been coming up more and more frequently on our deployment.

Just came up in mine - hoping to not see it again.

It happened on my system as well. But with the batch config in master

batch_safe_limit: 50
batch_safe_size: 10

Command: salt * saltutil.sync_grains (target about 4k minions)

Salt master version 2019.2.1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

Thank you for updating this issue. It is no longer marked as stale.

I have a PR open that should fix this, https://github.com/saltstack/salt/pull/54923, but no movement on it yet.

Was this page helpful?
0 / 5 - 0 ratings