Salt: reactor race condition. found in 2019.2.0 seen in other releases

Created on 29 Jul 2019  路  24Comments  路  Source: saltstack/salt

Description of Issue

Looks like we have some kind of race condition in the reactor system when passing pillar data to orchestration.

Setup

The setup I used to test this was based on salt-api webhooks. to send data from the webhook into the reactor and carry that data into orchestration that holds saltify cloud.profile.

The reactors all rendered 100% correctly. But the orchestrations were all over the place. Some repeating pillar information and in one case no pillar information was rendered.

The orchestration that rendered without any pillar data came from an attempt to mitigate issue using sleep of 1 second in-between sending the webhooks

Steps to Reproduce Issue

config files

/etc/salt/cloud.providers.d/saltify.conf

salty:
  driver: saltify

/etc/salt/cloud.profiles.d/saltify_host.conf

saltify_host:
  provider: salty
  force_minion_config: true

/etc/salt/master.d/*

rest_cherrypy:
  port: 8000
  webhook_disable_auth: True
  disable_ssl: True
log_level_logfile: all
reactor:
  - 'salt/netapi/hook/deploy_minion':
    - /srv/salt/reactor/deploy_minion.sls

/srv/salt/reactor/deploy_minion.sls


# {{data}}


{% set post_data = data.get('post', {}) %}

#{{post_data}}

{% set host = post_data.get('host', '') %}

#{{host}}

{#
Initiate the Orchestration State
#}
deploy_minion_runner:
  runner.state.orch:
    - mods: orch.minion.deploy
    - pillar:
        params:
          host: {{ host }}

/srv/salt/orch/minion/deploy.sls

{% set errors = [] %}
{% set params = salt['pillar.get']('params', {}) %}

{% if 'host' not in params %}
{% do errors.append('Please provide a host') %}
{% set host = '' %}
{% else %}
{% set host = params.get('host') %}
{% endif %}

{% if errors|length > 0 %}
preflight_check:
  test.configurable_test_state:
    - name: Additional information required
    - changes: False
    - result: False
    - comment: '{{ errors|join(", ") }}'
{% else %}
preflight_check:
  test.succeed_without_changes:
    - name: 'All parameters accepted'
{% endif %}

provision_minion:
  salt.runner:
    - name: cloud.profile
    - prof: saltify_host
    - instances:
      - {{ host }}
    - require:
      - preflight_check
    - vm_overrides:
        ssh_host: {{ host }}
        ssh_username: centos
        ssh_keyfile: /etc/salt/id_rsa
        minion:
          master:
            - 172.16.22.60
            - 172.16.22.115

sync_all:
 salt.function:
   - tgt: {{ host }}
   - name: saltutil.sync_all

#execute_highstate:
# salt.state:
#   - tgt: {{ host }}
#   - highstate: True

Command run

for i in {1..8}; do curl localhost:8000/hook/deploy_minion -d host=test-${i}; done

master (3).log

As can be seen from the log.
deploy_minion.sls rendered in order
test-1
test-2
test-3
test-4
test-5
test-6
test-7
test-8

But deploy.sls
rendered in order
test-8
test-2
test-3
test-1
test-7
test-6
test-6
test-5

I have not found the cause of the issue yet.

Versions Report

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Jun 20 2019, 20:27:34)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-862.14.4.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core
Bug CS-R1 CS-S1 Magnesium ZD severity-high

Most helpful comment

@theskabeater Yes, I just tested now the current master (04a0068dc1) + a patch generated from #56513 and I'm seeing all of the reaction orchestrations happen now, once only, in the parallel and non-parallel cases. Great!

I'm also going to link in the issue at https://github.com/saltstack/salt/issues/58369 to make sure this issue is visible there.

All 24 comments

This seems to happen more if the orchestrations that the reactor runs are long-running. Moving to another setup that uses test.arg or some other orchestration that is over quickly doesn't trigger the issue.

zd-3944

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

this is not stale

Thank you for updating this issue. It is no longer marked as stale.

@whytewolf this won't get marked stale any further, we updated the config and this shouldn't be a problem any longer, let me know if you see it on other tickets, too, you shouldn't

@sagetherage Yep, I ended up getting a flood of these when the timer was adjusted. but it all looks good now.

@whytewolf Yes, we didn't communicate that very well (at all) and should have, and although we are on a better path, we still want to iterate on the config or use GH Actions. LMK if you have other ideas, love to hear it!

ZD-4993

I wanted to add that we also see this inconsistent reactor to orchestration behavior. This occurs where the reactor fails to pass data or will even run a server ID twice and skip others. To duplicate this add the states below and accept 15+ minions at once.

Details of flow
1) Reactor salt/minion/*/start calls salt://reactor_state.sls
2) state reactor_state.sls uses runner.state.orchestrate to call orchestrate_state.sls
3) reactor_state.sls fails to pass event_data to orchestrate_state.sls

salt -V
Salt Version:
Salt: 2019.2.0

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.21.1
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.5 (default, Sep 26 2019, 13:23:47)
python-gnupg: 2.3.1
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: redhat 7.8 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1127.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.8 Maipo

#

reactor.conf
reactor:

#

reactor_state.sls
reactor_state_call:
runner.state.orchestrate:
- timeout: 2000
- args:
- mods: orchestrate_state
- pillar:
event_tag: {{ tag }}
event_data: {{ data|json }}

#

orchestrate_state.sls
{% set tag = salt.pillar.get('event_tag',{}) %}
{% set data = salt.pillar.get('event_data',{}) %}
orchestrate_state_data:
test.configurable_test_state:
- name: Data and tag pillar
- changes: False
- result: True
- comment: |
event_data: {{data}}
tag: {{tag}}

{% if 'id' not in data %}
orchestrate_state-notify-on-Data-failure:
test.configurable_test_state:
- name: Data or pillar Failure on Orchestration Call
- changes: False
- result: False
- comment: |
Failed to get proper pillar values for event_data in orchstandardconf orchestration
event_data: {{data}}
{%else%}
orchestrate_state-success:
test.configurable_test_state:
- name: Data success on Orchestration Call
- changes: False
- result: True
- comment: |
Finished ok
event_data: {{data}}
{%endif%}

#

Here is the output of a failed job and a successful job

failed job "20200428210139495968"

from the event bus

salt/run/20200428210508971833/ret {
"_stamp": "2020-04-28T21:05:09.741246",
"fun": "runner.jobs.list_job",
"fun_args": [
"20200428210139495968"
],
"jid": "20200428210508971833",
"return": {
"EndTime": "2020, Apr 28 21:01:50.904429",
"Error": "Cannot contact returner or no job with this jid",
"Result": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"return": {
"_stamp": "2020-04-28T21:01:39.509888",
"fun": "runner.state.orchestrate",
"fun_args": [
{
"mods": "orchestrate_state",
"pillar": {
"event_data": {
"_stamp": "2020-04-28T21:01:39.212370",
"cmd": "_minion_event",
"data": "Minion gcp6387temp36 started at Tue Apr 28 21:01:39 2020",
"id": "gcp6387temp36",
"pretag": null,
"tag": "salt/minion/gcp6387temp36/start"
},
"event_tag": "salt/minion/gcp6387temp36/start"
}
}
],
"jid": "20200428210139495968",
"return": {
"data": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"test_|-orchestrate_state-notify-on-Data-failure_|-Data or pillar Failure on Orchestration Call_|-configurable_test_state": {
"__id__": "orchestrate_state-notify-on-Data-failure",
"__run_num__": 1,
"__sls__": "orchestrate_state",
"changes": {},
"comment": "Failed to get proper pillar values for event_data in orchstandardconf orchestration \nevent_data: {} \n",
"duration": 2.84,
"name": "Data or pillar Failure on Orchestration Call",
"result": false,
"start_time": "21:01:50.889726"
},
"test_|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state": {
"__id__": "orchestrate_state_data",
"__run_num__": 0,
"__sls__": "orchestrate_state",
"changes": {},
"comment": "event_data: {} \ntag: {}\n",
"duration": 3.148,
"name": "Data and tag pillar",
"result": true,
"start_time": "21:01:50.885958"
}
}
},
"outputter": "highstate",
"retcode": 1
},
"success": false,
"user": "Reactor"
}
}
},
"StartTime": "2020, Apr 28 21:01:39.495968",
"jid": "20200428210139495968"
},
"success": true,
"user": "root"
}

[root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210139495968
EndTime:
2020, Apr 28 21:01:50.904429
Error:
Cannot contact returner or no job with this jid
Result:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
return:
----------
_stamp:
2020-04-28T21:01:39.509888
fun:
runner.state.orchestrate
fun_args:
|_
----------
mods:
orchestrate_state
pillar:
----------
event_data:
----------
_stamp:
2020-04-28T21:01:39.212370
cmd:
_minion_event
data:
Minion gcp6387temp36 started at Tue Apr 28 21:01:39 2020
id:
gcp6387temp36
pretag:
None
tag:
salt/minion/gcp6387temp36/start
event_tag:
salt/minion/gcp6387temp36/start
jid:
20200428210139495968
return:
----------
data:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
test_|-orchestrate_state-notify-on-Data-failure_|-Data or pillar Failure on Orchestration Call_|-configurable_test_state:
----------
__id__:
orchestrate_state-notify-on-Data-failure
__run_num__:
1
__sls__:
orchestrate_state
changes:
----------
comment:
Failed to get proper pillar values for event_data in orchstandardconf orchestration
event_data: {}
duration:
2.84
name:
Data or pillar Failure on Orchestration Call
result:
False
start_time:
21:01:50.889726
test_|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state:
----------
__id__:
orchestrate_state_data
__run_num__:
0
__sls__:
orchestrate_state
changes:
----------
comment:
event_data: {}
tag: {}
duration:
3.148
name:
Data and tag pillar
result:
True
start_time:
21:01:50.885958
outputter:
highstate
retcode:
1
success:
False
user:
Reactor
StartTime:
2020, Apr 28 21:01:39.495968
jid:
20200428210139495968

ok job from event bus

salt/run/20200428210624555837/ret {
"_stamp": "2020-04-28T21:06:25.339035",
"fun": "runner.jobs.list_job",
"fun_args": [
"20200428210149620344"
],
"jid": "20200428210624555837",
"return": {
"EndTime": "2020, Apr 28 21:01:54.637924",
"Error": "Cannot contact returner or no job with this jid",
"Result": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"return": {
"_stamp": "2020-04-28T21:01:49.636384",
"fun": "runner.state.orchestrate",
"fun_args": [
{
"mods": "orchestrate_state",
"pillar": {
"event_data": {
"_stamp": "2020-04-28T21:01:45.194576",
"cmd": "_minion_event",
"data": "Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020",
"id": "gcp6387temp38",
"pretag": null,
"tag": "salt/minion/gcp6387temp38/start"
},
"event_tag": "salt/minion/gcp6387temp38/start"
}
}
],
"jid": "20200428210149620344",
"return": {
"data": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"test_|-orchestrate_state-success_|-Data success on Orchestration Call_|-configurable_test_state": {
"__id__": "orchestrate_state-success",
"__run_num__": 1,
"__sls__": "orchestrate_state",
"changes": {},
"comment": "Finished ok \nevent_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'} \n",
"duration": 0.31,
"name": "Data success on Orchestration Call",
"result": true,
"start_time": "21:01:54.636565"
},
"test_|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state": {
"__id__": "orchestrate_state_data",
"__run_num__": 0,
"__sls__": "orchestrate_state",
"changes": {},
"comment": "event_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'} \ntag: salt/minion/gcp6387temp38/start\n",
"duration": 0.417,
"name": "Data and tag pillar",
"result": true,
"start_time": "21:01:54.636010"
}
}
},
"outputter": "highstate",
"retcode": 0
},
"success": true,
"user": "Reactor"
}
}
},
"StartTime": "2020, Apr 28 21:01:49.620344",
"jid": "20200428210149620344"
},
"success": true,
"user": "root"
}

[root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210149620344
EndTime:
2020, Apr 28 21:01:54.637924
Error:
Cannot contact returner or no job with this jid
Result:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
return:
----------
_stamp:
2020-04-28T21:01:49.636384
fun:
runner.state.orchestrate
fun_args:
|_
----------
mods:
orchestrate_state
pillar:
----------
event_data:
----------
_stamp:
2020-04-28T21:01:45.194576
cmd:
_minion_event
data:
Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020
id:
gcp6387temp38
pretag:
None
tag:
salt/minion/gcp6387temp38/start
event_tag:
salt/minion/gcp6387temp38/start
jid:
20200428210149620344
return:
----------
data:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
test_|-orchestrate_state-success_|-Data success on Orchestration Call_|-configurable_test_state:
----------
__id__:
orchestrate_state-success
__run_num__:
1
__sls__:
orchestrate_state
changes:
----------
comment:
Finished ok
event_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'}
duration:
0.31
name:
Data success on Orchestration Call
result:
True
start_time:
21:01:54.636565
test_|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state:
----------
__id__:
orchestrate_state_data
__run_num__:
0
__sls__:
orchestrate_state
changes:
----------
comment:
event_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'}
tag: salt/minion/gcp6387temp38/start
duration:
0.417
name:
Data and tag pillar
result:
True
start_time:
21:01:54.636010
outputter:
highstate
retcode:
0
success:
True
user:
Reactor
StartTime:
2020, Apr 28 21:01:49.620344
jid:
20200428210149620344

I was able to deploy 30 to 50 minions without issues by adding some randomness to the orchestration...

{% set random_start = range(1, 30) | random %}
pause:
  salt.runner:
    - name: test.sleep
    - s_time: {{ random_start }}

ZD-4955

I've been working to make this a minimal reproducible case to make this easier to work on, and just to confirm that it's still an issue. Short answer is that this is reproducible in 3001. Resubmitting below with the bug template.


Description

Near-concurrent reactions that call orchestrations are not reliable. Orchestrations may be dropped, or may be duplicated.

Setup

/etc/salt/master.d/reactor.conf:

reactor:
  - 'salt/job/*/ret/*':
      - '/srv/reactor/gh54045/reaction_state_finished.sls'

/srv/reactor/gh54045/reaction_state_finished.sls:

{% set i = data.get('fun_args')[1].get('pillar').get('i') %}
{% do salt.log.debug("GH54045 reactor ({})".format(i)) %}

reaction_state_finished:
  runner.state.orchestrate:
    - args:
        - mods: orch.gh54045.state_finished
        - pillar:
            data: {{ data|json }}

/srv/salt/orch/gh54045/state_finished.sls:

{% set i = salt.pillar.get('data').get('fun_args')[1].get('pillar').get('i') %}
{% do salt.log.debug("GH54045 orchestration ({})".format(i)) %}

/srv/salt/gh54045/init.sls:

{% set minion = salt.pillar.get('my_minion', 'minion1') %}

{% for i in range(10) %}
send_notification_{{ i }}:
  salt.state:
    - tgt: {{ minion }}
    - sls:
       - gh54045.succeed
    - pillar:
         i: {{ i }}
{% endfor %}

/srv/salt/gh54045/succeed.sls:

{% set i = salt.pillar.get('i', 'UNKNOWN') %}

succeeded:
  test.succeed_without_changes:
    - name: "SUCCEEDED: iteration {{ i }}"

Steps to Reproduce the behavior

Run:

salt-run state.orch gh54045

Watch or grep the master logs after the fact for GH54045. Here's a representative run:

2020-06-24 15:33:52,451 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (0)
2020-06-24 15:33:52,976 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (1)
2020-06-24 15:33:53,433 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (2)
2020-06-24 15:33:53,887 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (1)
2020-06-24 15:33:53,927 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (3)
2020-06-24 15:33:54,399 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (4)
2020-06-24 15:33:54,934 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (5)
2020-06-24 15:33:55,257 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (4)
2020-06-24 15:33:55,435 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (6)
2020-06-24 15:33:55,705 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (5)
2020-06-24 15:33:55,907 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (7)
2020-06-24 15:33:56,177 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (6)
2020-06-24 15:33:56,385 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (8)
2020-06-24 15:33:56,634 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (7)
2020-06-24 15:33:56,834 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (9)
2020-06-24 15:33:57,555 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (9)

Expected behavior

All 10 orchestrations from 0-9 shown in the logs. Note that 0, 2, 3, and 8 are missing in this example. Occasionally, there will be duplicate numbers in the GH54045 orchestration log messages.

Making the state parallel and concurrent

If I add to /srv/salt/gh54045/init.sls:

    - parallel: True
    - concurrent: True

... the situation is much worse. E.g.:

2020-06-24 15:41:23,019 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (3)
2020-06-24 15:41:23,085 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (8)
2020-06-24 15:41:23,160 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (4)
2020-06-24 15:41:23,229 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (0)
2020-06-24 15:41:23,240 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (7)
2020-06-24 15:41:23,252 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (2)
2020-06-24 15:41:23,354 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (1)
2020-06-24 15:41:23,459 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (6)
2020-06-24 15:41:23,485 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (5)
2020-06-24 15:41:23,543 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 reactor (9)
2020-06-24 15:41:28,457 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (0)
2020-06-24 15:41:28,522 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (0)
2020-06-24 15:41:29,119 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (2)
2020-06-24 15:41:29,264 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (5)
2020-06-24 15:41:29,269 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (5)
2020-06-24 15:41:29,310 [salt.loaded.int.module.logmod:42  ][DEBUG   ][7729] GH54045 orchestration (5)

Versions Report

salt --versions-report

Salt Version:
           Salt: 3001

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: Not Installed
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 5.3.0-59-generic
         system: Linux
        version: CentOS Linux 7 Core

ZD-5336

I think I got bitten by this bug on 2019.2.5, but it is hard to reduce my states to a minimal test case. Unfortunately, it looks like Salt reactors can't be relied upon 馃槦

When I run the above suggested steps to reproduce this I get the attached log.
gh54045.log

@oeuftete I was able to replicate your results using your example. @Ch3LL mentioned testing with this PR: https://github.com/saltstack/salt/pull/56513 -- and it appears to solve the issue in your example. I'm seeing all reactor and orchestration logs without any duplicate or missing indices in the log messages. Would you mind checking out that PR to see if you get similar results?

@theskabeater Yes, I just tested now the current master (04a0068dc1) + a patch generated from #56513 and I'm seeing all of the reaction orchestrations happen now, once only, in the parallel and non-parallel cases. Great!

I'm also going to link in the issue at https://github.com/saltstack/salt/issues/58369 to make sure this issue is visible there.

Since https://github.com/saltstack/salt/pull/56513 is merged I am closing here

Are there PRs other than #56513 that this fix is dependent on? 3002.2 appears to have all the modifications from that PR and I'm still seeing this issue.

# salt-call -V
Salt Version:
          Salt: 3002.2

Dependency Versions:
          cffi: 1.11.5
      cherrypy: Not Installed
      dateutil: 2.6.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.14
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Apr 16 2020, 01:36:27)
  python-gnupg: 0.4.6
        PyYAML: 3.12
         PyZMQ: 19.0.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.3

System Versions:
          dist: centos 8 Core
        locale: UTF-8
       machine: x86_64
       release: 4.18.0-193.28.1.el8_2.x86_64
        system: Linux
       version: CentOS Linux 8 Core

@lomeroe I will reopen this. The important memory leak fixes that went into 3002.2 reverted the changes that resolved this issue. I believe that pending PR https://github.com/saltstack/salt/pull/58853 will re-resolve this issue.

@oeuftete good info, thanks!

looks like #58853 was merged and will be in the Aluminium release, closing this issue

ZD-6199. 3002.2, reactors not involved, multiple concurrent orchestrations using inline pillar data, though.

Was this page helpful?
0 / 5 - 0 ratings