Salt: Scheduled highstates from Windows-minions not returning to master

Created on 7 Aug 2018  路  28Comments  路  Source: saltstack/salt

Description of Issue/Question

This issue was something we've run into in the past (#38371) and was fixed in #38923.
On Linux, this works fine. The "workaround returner" I created and to used to workaround this in the past was no longer necessary.

However, we recently installed a few Windows machines with Salt and noticed that it was showing the very same behavior as with the previous issue:

  • A salt-call state.highstate properly returns the salt/job/* event to the master
  • A scheduled state.highstate does not return the salt/job/* event to the master

Setup

  • Master on Ubuntu 16.04
  • Minion on Windows Server 2016

Steps to Reproduce Issue

  • Run Salt-minion on Windows
  • Setup a schedule to do a highstate every X minutes
  • Wait for the schedule to execute
  • Listen to the eventsbus on the master salt-run state.event pretty=True
  • Notice that only the minion/refresh/* and salt/auth events are being returned but no salt/job/* events for scheduled highstates
  • Run a manual highstate salt-call state.highstate on the Windows minion
  • Notice that the event is being returned to the master

Versions Report

Master

Salt Version:
           Salt: 2018.3.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.3.7
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.12 (default, Dec  4 2017, 14:50:18)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
         locale: UTF-8
        machine: x86_64
        release: 4.4.0-1031-aws
         system: Linux
        version: Ubuntu 16.04 xenial

Minion

Salt Version:
           Salt: 2018.3.2

Dependency Versions:
           cffi: 1.10.0
       cherrypy: 10.2.1
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.3
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 1.0.6
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.17
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)]
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.3
           RAET: Not Installed
          smmap: 2.0.3
        timelib: 0.2.4
        Tornado: 4.5.1
            ZMQ: 4.1.6

System Versions:
           dist:
         locale: cp1252
        machine: AMD64
        release: 2016Server
         system: Windows
        version: 2016Server 10.0.14393  Multiprocessor Free
Bug Windows severity-medium team-windows

Most helpful comment

Up.

All 28 comments

I am unable to replicate this.

I have assigned the following states.

[root@salt ~]# tail -c +0 /srv/salt/*.sls
==> /srv/salt/sched.sls <==
highstate:
  schedule.present:
    - function: state.apply
    - seconds: 30

==> /srv/salt/test.sls <==
test highstate:
  test.succeed_with_changes: []

==> /srv/salt/top.sls <==
base:
  '*':
    - test

I used the schedule.present to schedule the highstate function in the minion process. and I am seeing the job returns.

[root@salt ~]# salt-run state.event tagmatch='salt/job/*' pretty=true
salt/job/20180807152950409718/ret/salt.c.gtmanfred-1263.internal    {
    "_stamp": "2018-08-07T15:29:50.411386",
    "arg": [],
    "cmd": "_return",
    "fun": "state.apply",
    "fun_args": [],
    "id": "salt.c.gtmanfred-1263.internal",
    "jid": "20180807152950409718",
    "out": "highstate",
    "pid": 7212,
    "retcode": 0,
    "return": {
        "test_|-test highstate_|-test highstate_|-succeed_with_changes": {
            "__id__": "test highstate",
            "__run_num__": 0,
            "__sls__": "test",
            "changes": {
                "testing": {
                    "new": "Something pretended to change",
                    "old": "Unchanged"
                }
            },
            "comment": "Success!",
            "duration": 0.482,
            "name": "test highstate",
            "result": true,
            "start_time": "15:29:50.036213"
        }
    },
    "schedule": "highstate",
    "success": true,
    "tgt": "salt.c.gtmanfred-1263.internal",
    "tgt_type": "glob"
}
salt/job/20180807152956772370/ret/windows.c.gtmanfred-1263.internal {
    "_stamp": "2018-08-07T15:29:56.774412",
    "arg": [],
    "cmd": "_return",
    "fun": "state.apply",
    "fun_args": [],
    "id": "windows.c.gtmanfred-1263.internal",
    "jid": "20180807152956772370",
    "out": "highstate",
    "pid": 3556,
    "retcode": 0,
    "return": {
        "test_|-test highstate_|-test highstate_|-succeed_with_changes": {
            "__id__": "test highstate",
            "__run_num__": 0,
            "__sls__": "test",
            "changes": {
                "testing": {
                    "new": "Something pretended to change",
                    "old": "Unchanged"
                }
            },
            "comment": "Success!",
            "duration": 0.0,
            "name": "test highstate",
            "result": true,
            "start_time": "15:29:56.094000"
        }
    },
    "schedule": "highstate",
    "success": true,
    "tgt": "windows.c.gtmanfred-1263.internal",
    "tgt_type": "glob"
}
salt/job/20180807153019489257/ret/salt.c.gtmanfred-1263.internal    {
    "_stamp": "2018-08-07T15:30:19.491089",
    "arg": [],
    "cmd": "_return",
    "fun": "state.apply",
    "fun_args": [],
    "id": "salt.c.gtmanfred-1263.internal",
    "jid": "20180807153019489257",
    "out": "highstate",
    "pid": 7341,
    "retcode": 0,
    "return": {
        "test_|-test highstate_|-test highstate_|-succeed_with_changes": {
            "__id__": "test highstate",
            "__run_num__": 0,
            "__sls__": "test",
            "changes": {
                "testing": {
                    "new": "Something pretended to change",
                    "old": "Unchanged"
                }
            },
            "comment": "Success!",
            "duration": 0.568,
            "name": "test highstate",
            "result": true,
            "start_time": "15:30:19.083520"
        }
    },
    "schedule": "highstate",
    "success": true,
    "tgt": "salt.c.gtmanfred-1263.internal",
    "tgt_type": "glob"
}
salt/job/20180807153026754965/ret/windows.c.gtmanfred-1263.internal {
    "_stamp": "2018-08-07T15:30:26.756872",
    "arg": [],
    "cmd": "_return",
    "fun": "state.apply",
    "fun_args": [],
    "id": "windows.c.gtmanfred-1263.internal",
    "jid": "20180807153026754965",
    "out": "highstate",
    "pid": 1336,
    "retcode": 0,
    "return": {
        "test_|-test highstate_|-test highstate_|-succeed_with_changes": {
            "__id__": "test highstate",
            "__run_num__": 0,
            "__sls__": "test",
            "changes": {
                "testing": {
                    "new": "Something pretended to change",
                    "old": "Unchanged"
                }
            },
            "comment": "Success!",
            "duration": 0.0,
            "name": "test highstate",
            "result": true,
            "start_time": "15:30:26.094000"
        }
    },
    "schedule": "highstate",
    "success": true,
    "tgt": "windows.c.gtmanfred-1263.internal",
    "tgt_type": "glob"
}

Can you provide me with more information to replicate this?

Thanks,
Daniel

Very odd. I have just tested it again by adding a schedule:

salt 'web01.local' schedule.add test_highstate function='state.apply' seconds=30

Which - according to schedule.list gets added like:

      test_highstate:
        enabled: true
        function: state.apply
        jid_include: true
        maxrunning: 1
        name: test_highstate
        seconds: 30

I do not see any highstate returns using your state.event parameters. I however do see (Skipped event tags) salt/auth and minion/refresh/ID coming past every 30 seconds.
The minion logs do show that a highstate is being executed, it's just not being returned.

What are the differences between your and my setup, a different Salt/Python/Windows version/edition by any chance?

@syphernl We are also seeing this in 2018.3.3. Were you able to find any other information?

There are no return events from Windows minions (we execute highstate every 3 hours via the scheduler). Events are being returned from the Linux minions.

The being said, highstate is being executed correctly.

What's kind of odd is that highstate emails from the highstate_email.highstate are occuring, even when the event isn't emitted.

schedule:
  highstate:
    function: state.highstate
    minutes: 180
    splay: 600
    returner: highstate
    return_config: highstate_email

highstate_email.highstate:
  report_failures: True
  report_changes: True
  report_everything: False
  report_format: html
  report_delivery: smtp

@saltstack/team-triage can yall take a look at verifying this?

Looks like someone else is experiencing the same problem.

Thanks,
Daniel

So out of three deployments I have access to, only one of them has this problem with the Window's minions. The thing is, all of these deployments are practically identical - minions are installed the same way and the master is the same Docker image. The only difference that I could think of is that the problematic deployment is the oldest (it was upgraded from 2017 versions of SaltStack).

This is reproducible in newly installed Windows minions.

Are there any logs I could look though?

Looks like someone else had this issue from December of 2017 - #44779.

Ah, @gtmanfred found an unfortunate workaround...

On a hunch I disabled multiprocessing on a Windows minion:

multiprocessing: false

Using this to verify (originally no function was returned to the master):

salt windows schedule.add test_ping function='test.ping' seconds=30

And now scheduled functions are returning to the master. This doesn't appear to be a good fix as I believe there are several outstanding issues with Windows with multiprocessing disabled.

Log of good minion (Linux):

2019-02-02 19:02:27,212 [salt.utils.event :739 ][DEBUG   ][126612] Sending event: tag = __schedule_return; data = {u'fun_args': [], u'jid': u'req', u'return': True, u'retcode': 0, u'success': True, u'schedule': u'test_ping', u'cmd': u'_return', u'pid': 126612, u'_stamp': '2019-02-03T01:02:27.211963', u'fun': u'test.ping', u'id': u'linux'}
2019-02-02 19:02:27,212 [salt.utils.schedule:782 ][DEBUG   ][126612] schedule.handle_func: Removing /var/cache/salt/minion/proc/20190202190227188208
2019-02-02 19:02:27,213 [salt.minion      :2261][DEBUG   ][124488] Minion of 'salt.service' is handling event tag '__schedule_return'
2019-02-02 19:02:27,214 [salt.minion      :1863][INFO    ][124488] Returning information for job: req

Log of bad minion (Windows):

2019-02-02 18:49:38,221 [salt.utils.event :739 ][DEBUG   ][2780] Sending event: tag = __schedule_return; data = {u'fun_args': [], u'jid': u'req', u'return': True, u'retcode': 0, u'success': True, u'schedule': u'test_ping', u'cmd': u'_return', u'pid': 2780, u'_stamp': '2019-02-03T00:49:38.222000', u'fun': u'test.ping', u'id': u'windows'}
2019-02-02 18:49:38,221 [salt.utils.schedule:782 ][DEBUG   ][2780] schedule.handle_func: Removing c:\salt\var\cache\salt\minion\proc\20190202184938112000

Log of a bad minion with multiprocessing disabled (Windows):

2019-02-02 19:20:24,964 [salt.utils.event :739 ][DEBUG   ][2984] Sending event: tag = __schedule_return; data = {u'fun_args': [], u'jid': u'req', u'return': True, u'retcode': 0, u'success': True, u'schedule': u'test_ping', u'cmd': u'_return', u'pid': 2984, u'_stamp': '2019-02-03T01:20:24.965000', u'fun': u'test.ping', u'id': u'windows'}
2019-02-02 19:20:24,964 [salt.utils.schedule:782 ][DEBUG   ][2984] schedule.handle_func: Removing c:\salt\var\cache\salt\minion\proc\20190202192024808000
2019-02-02 19:20:24,964 [salt.minion      :2261][DEBUG   ][2984] Minion of 'salt.service' is handling event tag '__schedule_return'
2019-02-02 19:20:24,964 [salt.minion      :1863][INFO    ][2984] Returning information for job: req

So it looks like the event isn't being handed at all by the minion:

https://github.com/saltstack/salt/blob/4e1788e57b475102aa1e753d20a37a1ca3ad6076/salt/minion.py#L2638-L2672

Unfortunately, I no longer work for SaltStack, so someone from @saltstack/team-triage will need to take over managing this issue.

Multiprocessing makes things a little slower in Windows. I've had to increase the timeout when working with windows minions. I think the default is something like 5 seconds. I set it something like 30 for my Windows boxes. Even longer (300) if I'm working on something that I know takes a while to run... LGPO for example.
From the command line you can add -t 30 to the salt command.
There's also a config option for it that is just timeout. If you set it there you wouldn't have to do it with every CLI command.

Is there still a need for multiprocessing = true under Windows @twangboy (basically is multiprocessing = false a reliable workaround)?

multiprocessing=True is the default for Windows. There are problems with using multiprocessing=False. It's been years, so I can't remember what the problems are. I think there's an issue with how jobs are handled/cleared.

This is what needs to be fixed https://github.com/saltstack/salt/issues/48882#issuecomment-441038972
i.e. find_job needs to be an internal function which does not start a new thread. I had a go at it, but could not work out how to stop it creating a new thread. Its a simple change which would make a large difference. i.e. you would not need to use the -t option.

It would be nice if someone in @saltstack/team-core could implement find_job within minion.py without starting a thread it would save 10 seconds and make a heap of difference to windows.

@damon-atkins traveling at the moment but once I'm home I can look at this one.

@garethgreenaway Is this issue somewhere on your radar?

I would be happy to provide more information with current versions of SaltStack on Windows 2016/2019. I do believe it's still happening.

I'm testing saltstack - and unfortunately it's performance on windows is much worse than on linux :( grains_cache=True helped a little bit (simple test.ping went from 15s on my workstation vagrant setup to 7s), but it's still slow - on linux I've got response after 2s. And we're talking about test.ping! That's quite a problem for saltstack evaluation.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

Up.

Thank you for updating this issue. It is no longer marked as stale.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

.

Thank you for updating this issue. It is no longer marked as stale.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

reopen

Thank you for updating this issue. It is no longer marked as stale.

kicking this back into triage to get movement on it @DmitryKuzmenko can you please re-triage this issue this week?

I suspect this is a duplicate or part of the problem. #51621 Master calling find_job every 10 seconds, Windows takes 10 seconds to execute it depending on how schedule works.

Was this page helpful?
0 / 5 - 0 ratings