It seems that there is some kind of limit in the concurrent number of either reactors or orchestrations regardless of the number of worker threads configured in the saltmaster configuration file
I am showing the config used to reproduce the issue.
Reactor configured in master :
(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
reactor: [ {"debug/minion-created": ["/srv/salt/reactor/debug_minion_creation.sls"]} ]
# cat /srv/salt/reactor/debug_minion_creation.sls
debug_orchestration:
runner.state.orchestrate:
- args:
- mods: orch.debug
- pillar:
event_tag: {{ tag }}
event_data: {{ data|json }}
###############################
# cat /srv/salt/orch/debug.sls
{% import "orch/map.jinja" as v with context %}
debug_event_task_on_master:
salt.function:
- name: cmd.run
- tgt: "saltmaster*"
- ret: rawfile_json
- arg:
- "echo $HOSTNAME ORCHESTRATION minion={{v.minion_id}} >>/tmp/debug_orchestration.log"
debug_event_task_on_target:
salt.function:
- name: cmd.run
- tgt: {{ v.minion_id }}
- ret: rawfile_json
- arg:
- "echo $HOSTNAME ORCHESTRATION minion={{v.minion_id}} >>/tmp/debug_orchestration.log"
state_append_date:
salt.state:
- sls: append_date
- tgt: {{ v.minion_id }}
- ret: rawfile_json
# cat /srv/salt/append_date.sls
debug_command_start:
cmd.run:
- name: "echo ORCH_STATE_START Minion $(date) >>/tmp/debug_orchestration.log"
just_wait:
cmd.run:
- name: sleep 1800
debug_command_end:
cmd.run:
- name: "echo ORCH_STATE_END Minion $(date) >>/tmp/debug_orchestration.log"
(Include debug logs if possible and relevant.)
Generate the event on all minions
salt "*" cmd.run "salt-call event.send --out json 'debug/minion-created'"
At this point, usually a maximum of 10 minions will have their orchestrations running. This is verifiable by either checking all the /tmp/debug_orchestration.log files on all minions involved OR by counting the lines in the debug_orchestration_log in the saltmaster itself. To be noted, on the saltmaster we do an append to that file on each run of the orchestration of any minion.
It seems on this simplified use case that once the first batch finishes, another batch will eventually start, but in our more complex real life setup we often end up with orchestrations not starting at all not sure yet how to also reproduce this, too. Maybe a longer yet state that will trigger some timeout in the master to discard the event and the queue of orchestrations to run? In any case, if the saltmaster has activity, the orchestrations do not start any more. For us, this is a significant issue because I cannot seem to find what parameter controls this concurrency bottleneck and it's not even clear to me if this is a reactor or orchestration related limit. The fact that in our scenario some orchestrations never start is even more problematic (but let's not focus on that last part right now as I cannot reliably reproduce the total loss of an orchestration, I just hope that once I can adjust the concurrency the apparent event loss will also go away)
(Provided by running salt --versions-report
. Please also mention any differences in master/minion versions.)
salt-master-2017.7.4-1.el7.noarch
Salt Version:
Salt: 2017.7.4
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: 3.4.3
pygit2: Not Installed
Python: 2.7.5 (default, Nov 6 2016, 00:28:07)
python-gnupg: 0.3.8
PyYAML: 3.12
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: centos 7.3.1611 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-514.16.1.el7.x86_64
system: Linux
version: CentOS Linux 7.3.1611 Core
Could you take a look at this?
https://docs.saltstack.com/en/latest/ref/runners/all/salt.runners.queue.html#queued-runners
I wrote this originally to allow for queuing up jobs so that they aren't missed and so that you could control the number of runners that are run at once.
Basically, in your reactor, you would just insert the runners into a pgjsonb queue, and it would then pop them out and run the job when the scheduler is set to process it. That way, you make sure you don't lose jobs if too many orchestration states are running at once.
Thanks,
Daniel
That looks something worth looking into, definitely - I will test it as soon as possible.
However:
The only concurrency issue I am aware of is that there used to be a problem with running more than one orchestrate at once, because it was using the HighState class, which can only run one command per salt instance, because the class checks if another instance of states are running, and doesn't let any more be run, usually useful for when minions are targeted to run multiple states, it returns back something like
highstate is already running as 2838810(jid) since xxxxxx
But i think that that is fixed for Orchestration since then.
@saltstack/team-core are there any other limitations to the concurrency of the reactor/runner setup or any of the other things mentioned in the last update?
Thanks,
Daniel
You are showing an example of trying to apply a highstate twice on the same minion. Different story.
Running multiple highstates is probably limited at one per minion, not per salt master. A single highstate per saltmaster - i highly doubt, that would be unusable for any larger installation.
In my case, it is not about multiple orchestrations per the same minion - rather multiple orchestrations - each targeting a different minion.
My example does not even use a highstate, btw. It just launches a simple orchestration on a bunch of different minions.
Running multiple highstates is probably limited at one per minion, not per salt master. A single highstate per saltmaster - i highly doubt, that would be unusable for any larger installation.
In my case, it is not about multiple orchestrations per the same minion - rather multiple orchestrations - each targeting a different minion.
That is the way it used to be. you could only run one orchestrate state per master. because it uses the same highstate system that the minions use, and was checking if the master was running another orchestration state.
As I said in a previous comment, this was an old limitation, that was supposed to be fixed. But I don't know when it was fixed, or who fixed it, and I am not aware of any limitation, which is why I pinged the rest of the core team for input.
Oh. Missed the "old" part, sorry for that! Let's wait for their reaction, then.
Should not the minion look after itself and queue up work if need be.
If two high states are called one after the other, should it not just with the second one respond with running until it can run (or respond with a new code waiting to run). For example in most platforms only one package install can run at once. I think yum waits and windows errors.
There is a queue option on the state.highstate module
: False
Instead of failing immediately when another state run is in progress, queue the new state run to begin running once the other has finished.
This option starts a new thread for each queued state run, so use this option sparingly.
This is NOT an issue about how to run the highstate concurrently on the same minion.
@GreatSnoopy I understand where you are coming from however are 100% sure that your orchestration does not hit the same minion twice at any stage and get an error and stop. The the issue could be a combination of things.
I think the code samples that can be used to reproduce the issue are pretty clear.
And just to confirm, yes. The orchestrations are started by triggering an event when the minion comes alive. The event triggers a reactor that launches an orchestration which in turn runs a long running state on the minion id that generated the event.
Let's go with your scenario, nevertheless - but referring to my example, where all the orchestration is actually a long running sleep 1800 command
Found the answer in the sources. The limit is not in the orchestrator engine. Rather in the reactor engine.
self.opts['reactor_worker_threads'], # number of workers for runner/wheel
Seems I was searching the docs in vain for "orch" hoping to find parameters for orchestration, where in fact the limit is not in the orchestration engine, but in the reactor engine. Happy that I looked directly in the sources. And the parameter is reactor_worker_threads - which has a default of 10, surprise, surprise.
I will increase that and see what happens on the real-life deployment which is more complex.
For the moment, after beefing up the reactor_worker_threads it seems I no longer encounter the issue. Are there any best practices for this parameter ? What about reactor_worker_hwm, how its size can impact the reactor ? I do not understand exactly what queue it controls. Or what happens or does not happen if it is too small.
Hello there. Let me do some explaining here. :)
@GreatSnoopy has more or less sussed out the correct assessment of this issue. What's likely happening is that the queue which the master uses to track reactor jobs prior to execution is reaching capacity and silently dropping jobs. It's pretty rare for this to happen under normal operation but in cases where the reactions are orchestrations which are potentially long-running and you're dealing with a lot of traffic to the reactor, this becomes much more likely.
The tl;dr to this is to turn reactor_worker_hwm
way up. To ensure that the queue limit is never reached, set this value to 0
in the master and the queue will only be bound by available memory. (I can't guarantee this won't produce leaks, though.) The default at present is 10,000
. Doing so will allocate slightly more memory on reactor startup but obviously, the queue depth will be increased and the chance of a job being silently dropped will decrease.
The longer explanation is the way that the reactor works is to use a thread pool implementation that's contained inside salt.utils.process.ThreadPool
. It uses Python's stdlib Queue
to enqueue jobs which are picked up by standard Python threads. If the queue is full, False
is simply returned by the firing method on the thread pool. However, it appears that this condition is not handled or logged. (Obviously, this is a problem.)
As such, there are a few things to say about the selection of proper values for the master reactor.
1) If you are in a situation where you expect many long-running jobs to be executed by the master, reactor_worker_hwm
should be increased or even set to 0
to bound it only by available memory. If set to zero, a close eye should be kept on memory consumption.
2) If you are in a situation where you expect many long-running jobs and execution concurrency and performance are a concern, you may also increase the value for reactor_worker_threads
. This will control the number of concurrent threads which are pulling jobs from the queue and executing them. Obviously, this bears a relationship to the speed at which the queue itself will fill up. The price to pay for this value is that each thread will contain a copy of Salt code needed to perform the requested action. (We call these "clients".) You may also get into a situation where the Python GIL begins to show its concurrency weaknesses, so it's perhaps not advisable not to blindly just turn this up too far.
I'm not going to get too far into the math that one would use to make an informed decision between how to tune one of those knobs in relation to the other, though I can recommend a great book by Neil Gunther on systems capacity planning which provides a good primer on problems like these. ;) In short, though, you need to ensure that you either turn up the number workers to consume from the queue so that it does not reach capacity -- which is probably what you want if your jobs don't run for a very long time and you have a high volume of reactions or you want to turn up the queue depth and let a smaller relative number of workers just churn through. (Of course you can do both with lots of memory and some bravery!)
Now, since tuning is hard and sort of tricky to do, especially given how things are now, what can we do to improve this situation:
1) I propose that we change the reactor to detect conditions where firing an async job returned False, which indicates that the queue was full. We should log this at log-level error
.
2) We should better document what each tuning parameter does in the documentation so that people can make better decisions.
3) We should consider the addition of a dedicated thread which does nothing but reports on the size of the queue. We should fire master events for this queue monitor so that it can be picked up by Grafana. A logging/graphing engine to achieve this should be added that can be easily enabled.
OK, I think that's enough to say about this. Let me know if you have questions. :)
@cachedout PR for proposal 1 just went out.
As for 3, can you point me to a pattern to follow in code somewhere? I'd be happy to implement this as well.
Hi @geekinutah! Thanks for the PR!
On point number there, there really isn't an example in place anywhere in the current codebase. Basically, what I'm imagining here is a dedicated thread that checks the queue size and depth every N number of seconds and then fires events onto the bus. It would be a bit similar to master_stats
, and if you grep for that string you should see some references to other master-side metrics being collected and fired.
Cool, that's what I need. Thanks @cachedout .
@cachedout great update. thanks you and makes sense to me.
i would say in general no.3 seems like a good option for monitoring all hwm 馃憤
@cachedout re your explanation above, reactor workers are only responsible for rendering runner.state.orchestrate / event.master_fire for the master to run said orch right? if that is the case setting even a moderately high worker queue/hwm on the reactor would saturate master resources before hitting failing reactor side I thought? (though I guess that really boils down to a function of how much volume you're pushing through the reactor in terms of reactions).
re; your master_stats comment, from what I can find zeromq exposes no such introspection stats on a socket for queue depth/hwm usage/etc (from trying to look at and monitor similar issues in the past). did you have something specific in mind?
@cachedout et. al, https://github.com/saltstack/salt/pull/48874 addresses proposal 3. Would appreciate feedback on PR.
Hi folks!
Does reactor_worker_threads works for minion in masterless mode with reactor enabled?
Thanks.
Most helpful comment
Hello there. Let me do some explaining here. :)
@GreatSnoopy has more or less sussed out the correct assessment of this issue. What's likely happening is that the queue which the master uses to track reactor jobs prior to execution is reaching capacity and silently dropping jobs. It's pretty rare for this to happen under normal operation but in cases where the reactions are orchestrations which are potentially long-running and you're dealing with a lot of traffic to the reactor, this becomes much more likely.
The tl;dr to this is to turn
reactor_worker_hwm
way up. To ensure that the queue limit is never reached, set this value to0
in the master and the queue will only be bound by available memory. (I can't guarantee this won't produce leaks, though.) The default at present is10,000
. Doing so will allocate slightly more memory on reactor startup but obviously, the queue depth will be increased and the chance of a job being silently dropped will decrease.The longer explanation is the way that the reactor works is to use a thread pool implementation that's contained inside
salt.utils.process.ThreadPool
. It uses Python's stdlibQueue
to enqueue jobs which are picked up by standard Python threads. If the queue is full,False
is simply returned by the firing method on the thread pool. However, it appears that this condition is not handled or logged. (Obviously, this is a problem.)As such, there are a few things to say about the selection of proper values for the master reactor.
1) If you are in a situation where you expect many long-running jobs to be executed by the master,
reactor_worker_hwm
should be increased or even set to0
to bound it only by available memory. If set to zero, a close eye should be kept on memory consumption.2) If you are in a situation where you expect many long-running jobs and execution concurrency and performance are a concern, you may also increase the value for
reactor_worker_threads
. This will control the number of concurrent threads which are pulling jobs from the queue and executing them. Obviously, this bears a relationship to the speed at which the queue itself will fill up. The price to pay for this value is that each thread will contain a copy of Salt code needed to perform the requested action. (We call these "clients".) You may also get into a situation where the Python GIL begins to show its concurrency weaknesses, so it's perhaps not advisable not to blindly just turn this up too far.I'm not going to get too far into the math that one would use to make an informed decision between how to tune one of those knobs in relation to the other, though I can recommend a great book by Neil Gunther on systems capacity planning which provides a good primer on problems like these. ;) In short, though, you need to ensure that you either turn up the number workers to consume from the queue so that it does not reach capacity -- which is probably what you want if your jobs don't run for a very long time and you have a high volume of reactions or you want to turn up the queue depth and let a smaller relative number of workers just churn through. (Of course you can do both with lots of memory and some bravery!)
Now, since tuning is hard and sort of tricky to do, especially given how things are now, what can we do to improve this situation:
1) I propose that we change the reactor to detect conditions where firing an async job returned False, which indicates that the queue was full. We should log this at log-level
error
.2) We should better document what each tuning parameter does in the documentation so that people can make better decisions.
3) We should consider the addition of a dedicated thread which does nothing but reports on the size of the queue. We should fire master events for this queue monitor so that it can be picked up by Grafana. A logging/graphing engine to achieve this should be added that can be easily enabled.
OK, I think that's enough to say about this. Let me know if you have questions. :)