Salt: Concurrency issues in reactor & orchestration with long running orchestrations simultaneously targeting different minions

Created on 8 Mar 2018 · 23Comments · Source: saltstack/salt

Description of Issue/Question

It seems that there is some kind of limit in the concurrent number of either reactors or orchestrations regardless of the number of worker threads configured in the saltmaster configuration file

Setup

I am showing the config used to reproduce the issue.
Reactor configured in master :

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

reactor: [ {"debug/minion-created": ["/srv/salt/reactor/debug_minion_creation.sls"]} ]

# cat /srv/salt/reactor/debug_minion_creation.sls
debug_orchestration:
  runner.state.orchestrate:
    - args:
      - mods: orch.debug
      - pillar:
          event_tag: {{ tag }}
          event_data: {{ data|json }}

###############################
# cat /srv/salt/orch/debug.sls 
{% import "orch/map.jinja" as v with context %}

debug_event_task_on_master:
  salt.function:
    - name: cmd.run
    - tgt: "saltmaster*"
    - ret: rawfile_json
    - arg:
      - "echo $HOSTNAME ORCHESTRATION minion={{v.minion_id}} >>/tmp/debug_orchestration.log"

debug_event_task_on_target:
  salt.function:
    - name: cmd.run
    - tgt: {{ v.minion_id }}
    - ret: rawfile_json
    - arg:
      - "echo $HOSTNAME ORCHESTRATION minion={{v.minion_id}} >>/tmp/debug_orchestration.log"



state_append_date:
  salt.state:
    - sls: append_date
    - tgt: {{ v.minion_id }}
    - ret: rawfile_json

# cat /srv/salt/append_date.sls 

debug_command_start:
  cmd.run:
    - name: "echo ORCH_STATE_START Minion $(date) >>/tmp/debug_orchestration.log"

just_wait:
  cmd.run:
    - name: sleep 1800

debug_command_end:
  cmd.run:
    - name: "echo ORCH_STATE_END Minion $(date) >>/tmp/debug_orchestration.log"

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)
Generate the event on all minions

salt "*"  cmd.run "salt-call  event.send --out json 'debug/minion-created'"

At this point, usually a maximum of 10 minions will have their orchestrations running. This is verifiable by either checking all the /tmp/debug_orchestration.log files on all minions involved OR by counting the lines in the debug_orchestration_log in the saltmaster itself. To be noted, on the saltmaster we do an append to that file on each run of the orchestration of any minion.
It seems on this simplified use case that once the first batch finishes, another batch will eventually start, but in our more complex real life setup we often end up with orchestrations not starting at all not sure yet how to also reproduce this, too. Maybe a longer yet state that will trigger some timeout in the master to discard the event and the queue of orchestrations to run? In any case, if the saltmaster has activity, the orchestrations do not start any more. For us, this is a significant issue because I cannot seem to find what parameter controls this concurrency bottleneck and it's not even clear to me if this is a reactor or orchestration related limit. The fact that in our scenario some orchestrations never start is even more problematic (but let's not focus on that last part right now as I cannot reliably reproduce the total loss of an orchestration, I just hope that once I can adjust the concurrency the apparent event loss will also go away)

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
salt-master-2017.7.4-1.el7.noarch

salt --versions-report

Salt Version:
           Salt: 2017.7.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: 3.4.3
         pygit2: Not Installed
         Python: 2.7.5 (default, Nov  6 2016, 00:28:07)
   python-gnupg: 0.3.8
         PyYAML: 3.12
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.3.1611 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-514.16.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.3.1611 Core

Pending Discussion

Source

GreatSnoopy

👍5

Most helpful comment

Hello there. Let me do some explaining here. :)

@GreatSnoopy has more or less sussed out the correct assessment of this issue. What's likely happening is that the queue which the master uses to track reactor jobs prior to execution is reaching capacity and silently dropping jobs. It's pretty rare for this to happen under normal operation but in cases where the reactions are orchestrations which are potentially long-running and you're dealing with a lot of traffic to the reactor, this becomes much more likely.

The tl;dr to this is to turn reactor_worker_hwm way up. To ensure that the queue limit is never reached, set this value to 0 in the master and the queue will only be bound by available memory. (I can't guarantee this won't produce leaks, though.) The default at present is 10,000. Doing so will allocate slightly more memory on reactor startup but obviously, the queue depth will be increased and the chance of a job being silently dropped will decrease.

The longer explanation is the way that the reactor works is to use a thread pool implementation that's contained inside salt.utils.process.ThreadPool. It uses Python's stdlib Queue to enqueue jobs which are picked up by standard Python threads. If the queue is full, False is simply returned by the firing method on the thread pool. However, it appears that this condition is not handled or logged. (Obviously, this is a problem.)

As such, there are a few things to say about the selection of proper values for the master reactor.

1) If you are in a situation where you expect many long-running jobs to be executed by the master, reactor_worker_hwm should be increased or even set to 0 to bound it only by available memory. If set to zero, a close eye should be kept on memory consumption.

2) If you are in a situation where you expect many long-running jobs and execution concurrency and performance are a concern, you may also increase the value for reactor_worker_threads. This will control the number of concurrent threads which are pulling jobs from the queue and executing them. Obviously, this bears a relationship to the speed at which the queue itself will fill up. The price to pay for this value is that each thread will contain a copy of Salt code needed to perform the requested action. (We call these "clients".) You may also get into a situation where the Python GIL begins to show its concurrency weaknesses, so it's perhaps not advisable not to blindly just turn this up too far.

I'm not going to get too far into the math that one would use to make an informed decision between how to tune one of those knobs in relation to the other, though I can recommend a great book by Neil Gunther on systems capacity planning which provides a good primer on problems like these. ;) In short, though, you need to ensure that you either turn up the number workers to consume from the queue so that it does not reach capacity -- which is probably what you want if your jobs don't run for a very long time and you have a high volume of reactions or you want to turn up the queue depth and let a smaller relative number of workers just churn through. (Of course you can do both with lots of memory and some bravery!)

Now, since tuning is hard and sort of tricky to do, especially given how things are now, what can we do to improve this situation:

1) I propose that we change the reactor to detect conditions where firing an async job returned False, which indicates that the queue was full. We should log this at log-level error.

2) We should better document what each tuning parameter does in the documentation so that people can make better decisions.

3) We should consider the addition of a dedicated thread which does nothing but reports on the size of the queue. We should fire master events for this queue monitor so that it can be picked up by Grafana. A logging/graphing engine to achieve this should be added that can be easily enabled.

OK, I think that's enough to say about this. Let me know if you have questions. :)

cachedout on 20 Jul 2018

👍4

All 23 comments

Could you take a look at this?

https://docs.saltstack.com/en/latest/ref/runners/all/salt.runners.queue.html#queued-runners

I wrote this originally to allow for queuing up jobs so that they aren't missed and so that you could control the number of runners that are run at once.

Basically, in your reactor, you would just insert the runners into a pgjsonb queue, and it would then pop them out and run the job when the scheduler is set to process it. That way, you make sure you don't lose jobs if too many orchestration states are running at once.

Thanks,
Daniel

gtmanfred on 8 Mar 2018

That looks something worth looking into, definitely - I will test it as soon as possible.
However:

I still have no idea where the current concurrency limit comes from, there is no parameter seeming to correspond to that value. We need transparency on the current limits as well as a way to tune that
We need more visibility on runners that do not run and why. Is it a reactor limit? A runners limit? Is this concurrency affected by other states running on other machines? That means the concurrency limit is given by adding the number of orchestrations and states or only by the number of orchestrations?
I see as a major deal breaker that there are no documented limits on reactors/runners nor on the actual service guarantees the reactors offer. Can I rely on a reactor to ALWAYS happen once the event was triggered? Otherwise, the current reactor system would be rather hit&miss experience.

GreatSnoopy on 9 Mar 2018

The only concurrency issue I am aware of is that there used to be a problem with running more than one orchestrate at once, because it was using the HighState class, which can only run one command per salt instance, because the class checks if another instance of states are running, and doesn't let any more be run, usually useful for when minions are targeted to run multiple states, it returns back something like

highstate is already running as 2838810(jid) since xxxxxx

But i think that that is fixed for Orchestration since then.

@saltstack/team-core are there any other limitations to the concurrency of the reactor/runner setup or any of the other things mentioned in the last update?

Thanks,
Daniel

gtmanfred on 9 Mar 2018

You are showing an example of trying to apply a highstate twice on the same minion. Different story.

Running multiple highstates is probably limited at one per minion, not per salt master. A single highstate per saltmaster - i highly doubt, that would be unusable for any larger installation.
In my case, it is not about multiple orchestrations per the same minion - rather multiple orchestrations - each targeting a different minion.

My example does not even use a highstate, btw. It just launches a simple orchestration on a bunch of different minions.

GreatSnoopy on 9 Mar 2018

Running multiple highstates is probably limited at one per minion, not per salt master. A single highstate per saltmaster - i highly doubt, that would be unusable for any larger installation.
In my case, it is not about multiple orchestrations per the same minion - rather multiple orchestrations - each targeting a different minion.

That is the way it used to be. you could only run one orchestrate state per master. because it uses the same highstate system that the minions use, and was checking if the master was running another orchestration state.

gtmanfred on 9 Mar 2018

That would totally not scale, I doubt this was implemented and I see no documentation for that. Is this limit documented somewhere?
Besides, in my case, i always have 10 at a moment, never more. Where from 10 ? No idea, probably some hidden and undocumented limitation.
In your example, are you sure you are not targeting the same minion ?
I would add, if adding an explicit limit of 1 (ONE) orchestration at a time per saltmaster, that would make any significantly large deployment unusable for anything event-driven. If you add 10 resources at the same time in an infrastructure and rely on the orchestration to do some complex setup for each of them via a reactor triggered by the minion coming alive, then what you say is that it would be required to run the orchestration once per each minion, one at a time. And given the fact that the orchestration was designed exactly as a tool to facilitate very long running operations, making this limited to one run at a time is contrary to the spirit of what reactors and orchestrations were supposed to achieve. I just cannot see this as even considered :)

GreatSnoopy on 9 Mar 2018

As I said in a previous comment, this was an old limitation, that was supposed to be fixed. But I don't know when it was fixed, or who fixed it, and I am not aware of any limitation, which is why I pinged the rest of the core team for input.

gtmanfred on 9 Mar 2018

Oh. Missed the "old" part, sorry for that! Let's wait for their reaction, then.

GreatSnoopy on 9 Mar 2018

👍1

Should not the minion look after itself and queue up work if need be.
If two high states are called one after the other, should it not just with the second one respond with running until it can run (or respond with a new code waiting to run). For example in most platforms only one package install can run at once. I think yum waits and windows errors.

damon-atkins on 10 Mar 2018

There is a queue option on the state.highstate module

: False

    Instead of failing immediately when another state run is in progress, queue the new state run to begin running once the other has finished.

    This option starts a new thread for each queued state run, so use this option sparingly.

gtmanfred on 10 Mar 2018

This is NOT an issue about how to run the highstate concurrently on the same minion.

The states in the orchestration target different minions although they are triggered concurrently by an event and reactor
Also, this can be reproduced without even running highstates. An orchestration containing only a simple long running state will trigger this behavior.
The current limit is 10, although i cannot find any documented limit or parameter matching this number.

GreatSnoopy on 10 Mar 2018

@GreatSnoopy I understand where you are coming from however are 100% sure that your orchestration does not hit the same minion twice at any stage and get an error and stop. The the issue could be a combination of things.

damon-atkins on 11 Mar 2018

I think the code samples that can be used to reproduce the issue are pretty clear.
And just to confirm, yes. The orchestrations are started by triggering an event when the minion comes alive. The event triggers a reactor that launches an orchestration which in turn runs a long running state on the minion id that generated the event.

Let's go with your scenario, nevertheless - but referring to my example, where all the orchestration is actually a long running sleep 1800 command

We have 50 minions just created, each sends a "minion-created" event
We have minion1...minion10 that come alive and their respective orchestrations started. Each are now running a "echo >>file" for debug then "sleep 1800".
the minions 11 to 50 are doing basically nothing even tho they generated the event
You are implying that from any arbitrary number of minions, only 10 will somehow end up without overlapping while the others overlap ? Should some overlap happen it would probably not always match at 10. Sometimes the overlap will happen sooner sometimes later, so the first conflict would most likely happen at different concurrency levels.

GreatSnoopy on 11 Mar 2018

Found the answer in the sources. The limit is not in the orchestrator engine. Rather in the reactor engine.

self.opts['reactor_worker_threads'],  # number of workers for runner/wheel

Seems I was searching the docs in vain for "orch" hoping to find parameters for orchestration, where in fact the limit is not in the orchestration engine, but in the reactor engine. Happy that I looked directly in the sources. And the parameter is reactor_worker_threads - which has a default of 10, surprise, surprise.
I will increase that and see what happens on the real-life deployment which is more complex.

GreatSnoopy on 11 Mar 2018

👍1

For the moment, after beefing up the reactor_worker_threads it seems I no longer encounter the issue. Are there any best practices for this parameter ? What about reactor_worker_hwm, how its size can impact the reactor ? I do not understand exactly what queue it controls. Or what happens or does not happen if it is too small.

GreatSnoopy on 20 Mar 2018

👍1

Hello there. Let me do some explaining here. :)

As such, there are a few things to say about the selection of proper values for the master reactor.

Now, since tuning is hard and sort of tricky to do, especially given how things are now, what can we do to improve this situation:

1) I propose that we change the reactor to detect conditions where firing an async job returned False, which indicates that the queue was full. We should log this at log-level error.

2) We should better document what each tuning parameter does in the documentation so that people can make better decisions.

OK, I think that's enough to say about this. Let me know if you have questions. :)

cachedout on 20 Jul 2018

👍4

@cachedout PR for proposal 1 just went out.

As for 3, can you point me to a pattern to follow in code somewhere? I'd be happy to implement this as well.

geekinutah on 25 Jul 2018

Hi @geekinutah! Thanks for the PR!

On point number there, there really isn't an example in place anywhere in the current codebase. Basically, what I'm imagining here is a dedicated thread that checks the queue size and depth every N number of seconds and then fires events onto the bus. It would be a bit similar to master_stats, and if you grep for that string you should see some references to other master-side metrics being collected and fired.

cachedout on 28 Jul 2018

Cool, that's what I need. Thanks @cachedout .

geekinutah on 29 Jul 2018

@cachedout great update. thanks you and makes sense to me.
i would say in general no.3 seems like a good option for monitoring all hwm 👍

austinpapp on 31 Jul 2018

@cachedout re your explanation above, reactor workers are only responsible for rendering runner.state.orchestrate / event.master_fire for the master to run said orch right? if that is the case setting even a moderately high worker queue/hwm on the reactor would saturate master resources before hitting failing reactor side I thought? (though I guess that really boils down to a function of how much volume you're pushing through the reactor in terms of reactions).
re; your master_stats comment, from what I can find zeromq exposes no such introspection stats on a socket for queue depth/hwm usage/etc (from trying to look at and monitor similar issues in the past). did you have something specific in mind?