Salt: Maximum recursion limit hit related to requisites

Created on 22 Aug 2014  路  16Comments  路  Source: saltstack/salt

I recently upgraded master and minion from 2014.1.0 to 2014.1.10. And now I'm hitting

2014-08-22 11:11:25,697 [salt.minion      ][WARNING ] The minion function caused an exception
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/minion.py", line 797, in _thread_return
    return_data = func(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/salt/modules/state.py", line 394, in sls
    ret = st_.state.call_high(high_)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1715, in call_high
    ret = self.call_chunks(chunks)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1437, in call_chunks
    running = self.call_chunk(low, running, chunks)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1652, in call_chunk
    running = self.call_chunk(low, running, chunks)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1652, in call_chunk
...
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1652, in call_chunk
    running = self.call_chunk(low, running, chunks)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1562, in call_chunk
    status = self.check_requisite(low, running, chunks, True)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1482, in check_requisite
    req = trim_req(req)
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 94, in trim_req
    reqfirst = next(iter(req))
RuntimeError: maximum recursion depth exceeded while calling a Python object

I've reduced this down to:

my-application:
  file.touch:
    - name: /tmp/file-c

graceful-shutdown:
  file.touch:
    - name: /tmp/file-a
    - prereq:
      - test: configuration-file

configuration-file:
  test.touch:
    - name: /tmp/file-b
    - require:
      - test: my-application

Exeucting it using state.sls module triggers the exception.

I know there have been other instances of prereq issues (#8210, #8785), but I don't think neither of them covers this case which I believe is a bug with priority than those (since they define an explicit recursion, and I don't).

In production I seem to be hitting this when a state fails. I have one that sometimes intermittently fails. When it succeeds, I am not getting this stacktrace. I can't really explain why it only triggers on failure, but writing here maybe more for my own sake, don't know.

Master version:

$ salt --versions-report
           Salt: 2014.1.10
         Python: 2.7.3 (default, Feb 27 2014, 19:58:35)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0
            ZMQ: 3.2.2

Minion version:

 salt-minion --versions-report
           Salt: 2014.1.10
         Python: 2.7.3 (default, Feb 27 2014, 19:58:35)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0
            ZMQ: 3.2.2
Bug Core P2 severity-medium

Most helpful comment

FYI, I just open sourced a small utility that might help people in finding circular dependencies between states: https://github.com/JensRantil/salt-circular Might help someone else.

All 16 comments

Very interesting. I thought we had stamped out most of those but this looks like it might be another one. Thanks for the very clean test case. That should help make it much easier to find and fix this one!

Verified this problem persists in 2014.7.

@cachedout We recently upgraded to 2015.5.0 and am still seeing these recursions happen. It's very annoying and doesn't spread a culture of "things working" within our company. :-) I tested my minimal test case in the original issue description and can recreate the recursion limit every time using state.sls test=True.

I just hit this (or something like #20143) again and spent three days of my life trying various constellations of state configurations to avoid the infinite recursion. In my case I had quite a few newly added states so it was really hard for me to drill down what triggered this.

Out of every bug I've hit (and I've hit a few) in Salt this is definitely the most frustrating as it's really really hard to debug. Running the salt minion in non-daemon mode with DEBUG log level enabled did not help at all. I think this pull request really could be split into two different steps:

  1. Allow a developer to be able to debug what induces this.
  2. Fix the issue alltogether.
# salt-minion --versions-report
Salt Version:
           Salt: 2015.8.8.2

Dependency Versions:
         Jinja2: 2.6
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.3 (default, Jun 22 2015, 19:33:41)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 12.04 precise
        machine: x86_64
        release: 3.2.0-99-virtual
         system: Ubuntu 12.04 precise

Some general remarks regarding my issue:

  • Running state.highstate a second time makes my execution pass.
  • My current take is probably to add a bunch of logging to the chunk method to be able to figure which state is bringing it down. Possibly an assertion could be added that makes sure that the next call to the call_chunk isn't done with identical parameters.

Let me know if you have any input on how I can debug this further?

Let me know if you have any input on how I can debug this further?

Can I make logging even more verbose than DEBUG?

Workaround: Allright, I figured out what went wrong on my end. Here's what I did: I patched the line above the recursive call to /usr/lib/pymodules/python2.7/salt/state.py:call_chunk to do log the arguments to the call_chunk function it's just about to call:

log.trace("low=%s running=%s chunks=%s", low, running, chunks);

I then changed my minion configuration (both log_level and log_level_filename) to trace. I copied the output for low (I think!), started ipython and did a=<pasted copied text>;import pprint;pprint.pprint(a);. In the output I could see that I had a state that had a requisite that didn't exist. I corrected the requisite and voil谩 things are working.

I hope I'll never have to do this again!

I've hit this on 2016.11.3 with a very simple state. This is nginx/init.sls:

nginx_install:
  pkg.installed:
    - name: nginx

nginx_service:
  service.running:
    - name: nginx
    - prereq:
      - nginx_install

#nginx_config:
#  file.managed:
#    - name: /etc/nginx/nginx.conf
#    - mode: 

This is my topfile:

base:
 davinci-staging:
   - nginx

Traceback resulting from salt davinci-staging state.apply test=True, salt davinci-staging state.apply nginx test=True, and salt davinci-staging state.apply nginx.
Setting test=True does not matter in my case, the tracebacks are identical. Also, running the highstate several times did not make a difference.

davinci-staging:
    The minion function caused an exception: Traceback (most recent call last):
      File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1420, in _thread_return
        return_data = executor.execute()
      File "/usr/lib/python2.7/dist-packages/salt/executors/direct_call.py", line 28, in execute
        return self.func(*self.args, **self.kwargs)
      File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 543, in apply_
        return sls(mods, **kwargs)
      File "/usr/lib/python2.7/dist-packages/salt/modules/state.py", line 1039, in sls
        ret = st_.state.call_high(high_, orchestration_jid)
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2299, in call_high
        ret = dict(list(disabled.items()) + list(self.call_chunks(chunks).items()))
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 1815, in call_chunks
        running = self.call_chunk(low, running, chunks)
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2100, in call_chunk
        running = self.call_chunk(low, running, chunks)

<SNIP>

      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2100, in call_chunk
        running = self.call_chunk(low, running, chunks)
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 2100, in call_chunk
        running = self.call_chunk(low, running, chunks)
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 1997, in call_chunk
        low = self._mod_aggregate(low, running, chunks)
      File "/usr/lib/python2.7/dist-packages/salt/state.py", line 762, in _mod_aggregate
        agg_opt = self.functions['config.option']('state_aggregate')
      File "/usr/lib/python2.7/dist-packages/salt/loader.py", line 1086, in __getitem__
        func = super(LazyLoader, self).__getitem__(item)
      File "/usr/lib/python2.7/dist-packages/salt/utils/lazy.py", line 91, in __getitem__
        if self._missing(key):
RuntimeError: maximum recursion depth exceeded

Versions:

root@davinci-staging:~# salt-minion --versions-report
Salt Version:
           Salt: 2016.11.3

Dependency Versions:
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Jun 29 2016, 13:08:31)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.7 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.7 

I was able to work around my issue by using prereq_in. This state works fine:

nginx_install:
  pkg.installed:
    - name: nginx
    - prereq_in:
      - nginx_service

nginx_service:
  service.running:
    - name: nginx

#nginx_config:
#  file.managed:
#    - name: /etc/nginx/nginx.conf
#    - mode:

@cdunklau The problem with your sls is here:

    - prereq:
      - nginx_install

AFAIK that is invalid. It should be written as:

    - prereq:
      - pkg: nginx_install

Do that, and it works for me.

I've spent a couple of days pulling my hair out here over this bug. Following @JensRantil's lead, I dumped the low, running and chunks variables. In my case, I couldn't see any broken references - but there was so much data. Just one line from the log was 67kb, so I was unable to manually verify everything. Eventually I wrote a quick hacky script to see where the problem might be.

#!/usr/bin/env python

import sys
from collections import OrderedDict

import yaml

# add chunks line here
chunks={...}

def check_exists(state, name):
    for block in chunks:
        if block['state'] == state and name in (block['__id__'], block['name']):
            return True
    return False

def print_yaml(chunks):
    print(yaml.dump(chunks, default_flow_style=False, explicit_start=True))

def main(chunks):
    try:
        if sys.argv[1] == 'yaml':
            print_yaml(chunks)
            sys.exit(0)
    except IndexError:
        pass

    for block in chunks:
        for key, value in block.items():
            if 'require' in block:
                for requisite_type in ('require', 'watch', 'prereq', 'use', 'onchanges', 'onfail'):
                    for requisite in (requisite_type, ''.join([requisite_type, '_in'])):
                        for req in block.get(requisite, []):
                            state, name = req.keys()[0], req.values()[0]
                            if not check_exists(state, name):
                                print("MISSING! '{0}: {1}'".format(state, name))
                            #else:
                            #    print("FOUND! '{0}: {1}'".format(state, name))


if __name__ == "__main__":
    main(chunks)

Initially it pointed out a large number of sls: somefile references which I didn't handle in the script. I got rid of those just to be sure they weren't somehow related to the recursion. There were a few other minor things it picked up on too, but fixing them all proved to be in vain. Nothing seemed to help me avoid the infinite recursion of doom.

At this point, I decided to start deleting everything bit by bit in our staging environment until I got something working. This proved difficult as there were various things generated automatically via pillar, lots of things including lots of other things that were required to get things working, etc. and each test would take some time to run. However after a lot of effort, I got everything reduced down to the minimum.

Here is my minimum valid example (to the best of my knowledge anyway) which triggers the infinite recursion exception:

Install blog:
  pkg.installed:
    - name: blog

Catch-all for blog configuration settings:
  test.succeed_without_changes:
    - require:
      - pkg: Install blog

{% if True %}
Some specific blog configuration setting:
  test.succeed_without_changes:
    - require_in:
      - test: Catch-all for blog configuration settings
    - require:
      - pkg: Install blog
{% endif %}

Do this first:
  test.succeed_without_changes:
    - prereq:
      - pkg: Install blog
      - test: Catch-all for blog configuration settings

Do this last:
  test.succeed_without_changes:
    - onchanges:
      - test: Do this first
    - require:
      - pkg: Install blog
      - test: Catch-all for blog configuration settings

This somewhat describes the core of one of our environments. We have a package or some other deployment mechanism (the package blog in this example) which is installed or upgraded.

We then have a bunch of configuration blocks in our sls file to verify permissions and ownership, update configuration files included with the blog package with the correct settings, etc. In this example, I have included one such block called Some specific blog configuration setting. Some of these settings will only render given specific conditions, hence the {% if Test %} Jinja2 representing the possibility that that block may not exist.

When a block somewhere needs to depend on the blog package being completely installed, we depend on the Catch-all for blog configuration settings block (as opposed to the entire sls: state file, since there could be other things in there for other packages or whatever). All tests are set to require_in the Catch-all - we can't have Catch-all depend on tests since those tests may not render (and we certainly don't want to duplicate all of our other Jinja2 tests just for this block). We have to use require_in instead.

So just these three blocks are working all well and good, but then we decide that we need to put our application behind a load balancer. We don't want to be serving traffic from the blog application on the host while it is in the middle of an upgrade, so we have a custom execution module and a module.run call to publish.publish somewhere in a prereq block to remove the instance from the load balancer during the upgrade, and finally a similar block to add it back, using onchanges against the aforementioned block. The Do this first and Do this last blocks represent these steps.

So to clarify:

  1. Do this first has a prereq on the installation of blog, as well as any configuration changes (although the later won't work because we're just using test.succeed_without_changes... but it's just an example).
  1. Do this last has an onchanges against Do this first, so if the host was removed from the load balancer prior to an upgrade or configuration change, this ensures it is re-added. This block requires Catch-all for blog configuration settings to have completed first, since we don't want it running prior. I also depend on blog just to be sure (in case it was ever removed from the Catch-all block), but be hit an infinite recursion here regardless.

So what is the work-around? There isn't one that I can see, so there is only one bad option, and one ugly option.

  1. Remove the test: Catch-all for blog configuration settings requisite from the Do this first block. I guess this will cause any configuration changes to be made without triggering the Do this first/Do this last blocks. Not ideal, but better than nothing...

  2. Remove the requisite pkg: Install blog from the Some specific blog configuration setting block - but this is very dangerous! In my testing, this will cause the block to always trigger before the blog package is installed. Even adding - order: 1 to the Install blog block and - order: last to the Some specific blog configuration setting block doesn't seem to actually change the order.

I'm seeing this in 2016.11.4.

Closed via #42985

@cachedout Is https://github.com/saltstack/salt/issues/15171#issuecomment-323872701 also supposed to close this issue? It's still open...

Eh? This is #15171. Which comment are you linking to?

@cachedout The one that states "Closed via #42985".

FYI, I just open sourced a small utility that might help people in finding circular dependencies between states: https://github.com/JensRantil/salt-circular Might help someone else.

Was this page helpful?
0 / 5 - 0 ratings