Salt: daily highstate fails after 2019.2 upgrade

Created on 21 May 2019  路  38Comments  路  Source: saltstack/salt

After upgrading masters and minions to 2019.02, observed this behavior:

  • daily scheduled highstate fails to run UNLESS a manual pillar has been run recently (last 12 hours or so), but this doesn't always make it run
  • when the highstate runs, if there are any failures, I receive 20 or so reports of failure per minion via the returner, as opposed to just one with 2018.03.4; it appears that the minion is trying state.apply over and over again for about two hours, after which it stops

This happens on two different masters.

Here's is a pillar/schedule.sls for one of the masters:

schedule:
  highstate:
    function: state.highstate
    when: 4:30am
    splay: 600
    maxrunning: 1
    returner: highstate

Downgrading salt and salt-minion on the minions fixed the problem.

Here is one of the upgraded minions:

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: 1.2.5
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.12.2.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

Here is the master:
```
Salt Version:
Salt: 2019.2.0

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: 0.26.3
libnacl: Not Installed
M2Crypto: 0.31.0
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: 1.2.5
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: 0.26.4
Python: 2.7.5 (default, Apr 9 2019, 14:30:50)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: centos 7.6.1810 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-957.12.2.el7.x86_64
system: Linux
version: CentOS Linux 7.6.1810 Core
```
Thanks.

Bug Core Pillar Regression severity-critical severity-medium v3000.1

Most helpful comment

Was not able to test today. Will try to have a test result by Monday.

All 38 comments

I am also experiencing this problem. Does this effect Windows minions more so than other ones? Are there any suggestions for a best workaround until this is fixed (such as using cmd run with salt-call)?

I think it reasonable to conjecture that this is bug is the same as, or is highly related to #53135

I'm not seeing a lot of similarity with #53135, but it's possible I'm missing something.

My minions are all on CentOS 7; no Windows.

Anyway i'm trying out cmd.run with "salt-call state.apply" as the argument in my schedule. I'll let folks know if that's an adequate workaround

Thanks!

@jbeaird I still get the same problem when I do it with salt-call. Either I downgrade or just schedule restarts. I'm going to schedule restarts now. This isn't ideal but I guess it won't make that much difference to me.

Note that the highstate seems to work right after a minion start (I do mine as part of an orchestration reaction).

Thanks @H20-17 . Are you just going to add cron jobs (or the Windows equivalent, if the minions are Windows) that run salt-call on each box?

@jbeaird I'm using pillar based scheduling and and orchestration that runs highstate (on the minion) is reaction to the minion start event. I think this will be ok for my purposes but there are a limited number of orchestrations that will be ran at once (10 is the default I think) so it could be a problem but it won't be a big deal for me for my use case.

Thanks for the info.

One of the labels is 2019.2.2, but since this isn't closed I figure it hasn't been fixed yet.

* when the highstate runs, if there are any failures, I receive 20 or so reports of failure per minion via the returner, as opposed to just one with 2018.03.4; it appears that the minion is trying state.apply over and over again for about two hours, after which it stops

@jbeaird What were those errors exactly? Trying to reproduce the bug I got this error on repeat :

[INFO    ] Beacon load configuration invalid, not running.
Configuration for load beacon must be a list.

Thanks, @Akm0d . Sorry for not clarifiying--the errors were state.apply failures, not errors per se.

I should note that this has followed me into CentOS 7.7, and persists even with the latest 2019.2 versions.

@jbeaird This is my top priority right now and I'm trying to fix it. Do you have any logs related to the issue?

Thanks very much! I couldn't ever find anything relevant in the logs.

What advice do you have for me to provide you with decent debug data? In this case in which a scheduled highstate is just not running.

@jbeaird sure, what is the latest release you have seen this on? We are up to 2019.2.3 now.

Show me any errors or evidence about how you know out the scheduled job isn't running -- I want to reproduce exactly the same bug to fix it properly.

I will look into this too. I havent tried running scheduled high states since 2019.2.0 because of this but I'll see what happens now

I'm running a highstate every 15 minutes now on a test minion and I'm failing to reproduce the issue (schedules from pillar on the master). WHen I was experiencing this I was running a highstate just once a week on every mininon (also scheduled from pillar on the master). When it was happening, it would happen on just about every Windows minion and all it took was the one scheduled highstate. I will keep my test minion highstating all night (every 15 minutes). It's possible that the issue was inadvertantly resolved due to changes (other bug fixes perhaps?).

I'l let you guys know if I find anything. For the time being everything is working. (I'm now fully up to date with 2019.2.3 (py3) on both the master and the Windows minion.)

No indication of the problem yet. I don't think I can reproduce this any more. I'll comment again in the morning.

I am unable to reproduce the bug now. From my end it appears to have been fixed somewhere along the line. If I see it come up again I'll resume trying to triangulate it. Thanks.

Thanks @H20-17 for all the work.

I need to upgrade my master to the latest update, which I will do today.

I'll upgrade a bunch of minions, too, and let you know what happens.

In my experience, running the highstate oftener than every 8-12 hours will keep the bug from showing up. So I'm going to leave it at every 24 hours for a few days.

Thanks, @Akm0d for putting this at the top of your list.

Thanks. I did not know that the time interval made a difference. That's an interesting find.

I am now testing with a 24 hour interval (debug level logging).

@H20-17, is that just a matter of running in the foreground on minions and master?

@jbeaird I'm running the minion in the foreground with the following command: salt-minion.bat --log-file-level=debug. I'm not doing any master side debugging as of yet. I'm going to see what happens first.

Thanks. I'm running in the foreground on a few minions, too.

Alright. I'm hoping this is somewhat helpful or at least might give us a direction to move in.

recap: all minions that server one master and that master have been upgraded to the latest version. Yesterday I set three minions to run in the foreground and the master.

One of my states that runs at highstate is to make sure that the salt-minion service is running, so on the three minions I was running in the foreground, the highstate ran and told the salt-minion service to start. So those boxes now look like this:

[root@host ~]# ps auxwww | grep salt
root 2646 0.0 0.0 112712 964 pts/1 S+ 10:05 0:00 grep --color=auto salt
root 6889 0.0 0.5 314180 21768 pts/0 S+ Jan22 0:00 /usr/bin/python /usr/bin/salt-minion -l debug
root 6892 0.1 1.3 791584 51616 pts/0 Sl+ Jan22 1:16 /usr/bin/python /usr/bin/salt-minion -l debug
root 6894 0.0 0.5 395456 20256 pts/0 S+ Jan22 0:00 /usr/bin/python /usr/bin/salt-minion -l debug
root 25321 0.0 0.5 314188 21760 ? Ss 04:34 0:00 /usr/bin/python /usr/bin/salt-minion
root 25328 0.1 1.3 711484 53772 ? Sl 04:34 0:25 /usr/bin/python /usr/bin/salt-minion
root 25333 0.0 0.5 395456 20236 ? S 04:34 0:00 /usr/bin/python /usr/bin/salt-minion
root 31842 0.0 0.5 395456 20236 ? S Jan21 0:00 /usr/bin/python /usr/bin/salt-minion

Drilling down a bit, on this particular host, we have another job scheduled, not highstate, to run every morning, but it started this morning and has kept running every few minutes. Here's the pillar schedule file:

schedule:
oh-my-zsh:
function: state.apply
args:
- library.name.user
# run between 4:10-4:20am
# - pull in the updates to the oh-my-zsh repo
# - do not use the highstate returner (we don't need email alerts about the
# active development we pull in daily from this repo)
# - complete before scheduled nightly state.apply starts at 4:30am
# - minimize chances that the scheduled nightly state.apply will have changes
# because of a fresh commit to the oh-my-zsh repo
when: 4:10am
# splay job start time over 600 seconds (10 minutes)
splay: 600

And here's the debug data for what has been running on that minion every few mintues since this morning:

[DEBUG ] schedule: Job oh-my-zsh was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG ] schedule: Job oh-my-zsh was scheduled with a max number of 1
[INFO ] Running scheduled job: oh-my-zsh (runtime missed by 0.594065 seconds)
[DEBUG ] Subprocess SignalHandlingMultiprocessingProcess-1:101-Schedule-oh-my-zsh added
[DEBUG ] schedule.handle_func: adding this job to the jobcache with data {u'fun_args': [], u'jid': u'20200123095749634377', u'schedule': 'oh-my-zsh', u'pid': 1456, u'fun': 'state.apply', u'id': u'host.wheaton.edu'}
[DEBUG ] LazyLoaded state.apply
[DEBUG ] LazyLoaded saltutil.is_running
[DEBUG ] LazyLoaded grains.get
[DEBUG ] LazyLoaded config.get
[DEBUG ] key: test, ret: _|-
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] Gathering pillar data for state run
[DEBUG ] Determining pillar cache
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Finished gathering pillar data for state run
[INFO ] Loading fresh modules for state activity
[DEBUG ] LazyLoaded jinja.render
[DEBUG ] LazyLoaded yaml.render
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/user.sls' to resolve 'salt://library/name/user.sls'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/user.sls' to resolve 'salt://library/name/user.sls'
[DEBUG ] compile template: /var/cache/salt/minion/files/base/library/name/user.sls
[DEBUG ] Jinja search path: [u'/var/cache/salt/minion/files/base']
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/library/name/user.sls' using 'jinja' renderer: 0.0152320861816
[DEBUG ] Rendered data from file: /var/cache/salt/minion/files/base/library/name/user.sls:
Install and configure ZSH
install ZSH:
pkg.installed:
- pkgs:
- zsh

install oh-my-zsh:
git.latest:
- name: https://github.com/robbyrussell/oh-my-zsh.git
- target: /home/username/.oh-my-zsh
- user: username
- require:
- cmd: create name's home dir
- pkg: install ZSH

manage oh-my-zsh permissions:
file.directory:
- name: /home/username/.oh-my-zsh
- user: username
- group: domain users
- dir_mode: 700
- recurse:
- user
- group
- require:
- git: install oh-my-zsh

install antigen:
file.managed:
- name: /home/username/.antigen.zsh
- user: username
- group: domain users
- source: salt://library/name/antigen.zsh
- require:
- git: install oh-my-zsh

set .zshrc:
file.managed:
- name: /home/username/.zshrc
- user: username
- group: domain users
- source: salt://library/name/.zshrc
- require:
- file: install antigen

Make sure there is my user
create name's home dir:
cmd.run:
- name: /usr/sbin/mkhomedir_helper username 0077
- creates: /home/username

make sure there is a user .ssh directory
ssh directory:
file.directory:
- name: /home/username/.ssh
- user: username
- group: domain users
- dir_mode: 700
- require:
- cmd: create name's home dir

ssh public key
make username's public key:
file.managed:
- name: /home/username/.ssh/authorized_keys
- user: username
- group: domain users
- mode: 600
- source: salt://library/name/authorized_keys
- require:
- file: ssh directory



md5-6b0f813d90ff44ca7eb907d1624bffb9



[DEBUG ] Results of YAML rendering:
OrderedDict([(u'install ZSH', OrderedDict([(u'pkg.installed', [OrderedDict([(u'pkgs', [u'zsh'])])])])), (u'install oh-my-zsh', OrderedDict([(u'git.latest', [OrderedDict([(u'name', u'https://github.com/robbyrussell/oh-my-zsh.git')]), OrderedDict([(u'target', u'/home/username/.oh-my-zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'require', [OrderedDict([(u'cmd', u"create name's home dir")]), OrderedDict([(u'pkg', u'install ZSH')])])])])])), (u'manage oh-my-zsh permissions', OrderedDict([(u'file.directory', [OrderedDict([(u'name', u'/home/username/.oh-my-zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'dir_mode', 700)]), OrderedDict([(u'recurse', [u'user', u'group'])]), OrderedDict([(u'require', [OrderedDict([(u'git', u'install oh-my-zsh')])])])])])), (u'install antigen', OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.antigen.zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'source', u'salt://library/name/antigen.zsh')]), OrderedDict([(u'require', [OrderedDict([(u'git', u'install oh-my-zsh')])])])])])), (u'set .zshrc', OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.zshrc')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'source', u'salt://library/name/.zshrc')]), OrderedDict([(u'require', [OrderedDict([(u'file', u'install antigen')])])])])])), (u"create name's home dir", OrderedDict([(u'cmd.run', [OrderedDict([(u'name', u'/usr/sbin/mkhomedir_helper username 0077')]), OrderedDict([(u'creates', u'/home/username')])])])), (u'ssh directory', OrderedDict([(u'file.directory', [OrderedDict([(u'name', u'/home/username/.ssh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'dir_mode', 700)]), OrderedDict([(u'require', [OrderedDict([(u'cmd', u"create name's home dir")])])])])])), (u"make username's public key", OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.ssh/authorized_keys')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'mode', 600)]), OrderedDict([(u'source', u'salt://library/name/authorized_keys')]), OrderedDict([(u'require', [OrderedDict([(u'file', u'ssh directory')])])])])]))])
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/library/name/user.sls' using 'yaml' renderer: 0.0042450428009
[DEBUG ] LazyLoaded config.option
[DEBUG ] LazyLoaded pkg.install
[DEBUG ] LazyLoaded pkg.installed
[DEBUG ] schedule.handle_func: Adding splay of 328 seconds to next run.
[DEBUG ] key: ifttt.secret_key, ret: |-
[DEBUG ] key: ifttt:secret_key, ret: |-
[DEBUG ] key: pushbullet.api_key, ret: |-
[DEBUG ] key: pushbullet:api_key, ret: |-
[DEBUG ] LazyLoaded systemd.booted
[DEBUG ] You should upgrade pyOpenSSL to at least 0.14.1 to enable the use of X509 extensions in the tls module
[DEBUG ] key: victorops.api_key, ret: |-
[DEBUG ] key: victorops:api_key, ret: |-
[DEBUG ] DSC: Only available on Windows systems
[DEBUG ] Module PSGet: Only available on Windows systems
[DEBUG ] LazyLoaded platform.is_windows
[DEBUG ] Could not LazyLoad pkg.ex_mod_init: 'pkg.ex_mod_init' is not available.
[INFO ] Running state [install ZSH] at time 09:57:51.526328
[INFO ] Executing state pkg.installed for [install ZSH]
[DEBUG ] Could not LazyLoad pkg.resolve_capabilities: 'pkg.resolve_capabilities' is not available.
[INFO ] Executing command [u'rpm', u'-qa', u'--queryformat', u'%{NAME}|-%{EPOCH}|-%{VERSION}|-%{RELEASE}|-%{ARCH}|-(none)|-%{INSTALLTIME}'] in directory '/root'
[INFO ] All specified packages are already installed
[INFO ] Completed state [install ZSH] at time 09:57:53.265552 (duration_in_ms=1739.224)
[INFO ] Executing command [u'git', u'--version'] in directory '/root'
[DEBUG ] stdout: git version 1.8.3.1
[DEBUG ] LazyLoaded git.latest
[DEBUG ] LazyLoaded cmd.run
[INFO ] Running state [/usr/sbin/mkhomedir_helper username 0077] at time 09:57:53.285793
[INFO ] Executing state cmd.run for [/usr/sbin/mkhomedir_helper username 0077]
[INFO ] /home/username exists
[INFO ] Completed state [/usr/sbin/mkhomedir_helper username 0077] at time 09:57:53.287441 (duration_in_ms=1.648)
[INFO ] Running state [https://github.com/robbyrussell/oh-my-zsh.git] at time 09:57:53.287824
[INFO ] Executing state git.latest for [https://github.com/robbyrussell/oh-my-zsh.git]
[INFO ] Executing command [u'git', u'config', u'--global', u'--get-regexp', u'filter\.lfs\.'] in directory '/root'
[DEBUG ] retcode: 1
[INFO ] Checking remote revision for https://github.com/robbyrussell/oh-my-zsh.git
[INFO ] Executing command [u'git', u'ls-remote', u'https://github.com/robbyrussell/oh-my-zsh.git'] as user 'username' in directory '/home/username'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c HEAD
4e45e12dc355e3ba34e7e40ce4936fb222f0155c refs/heads/master
...
I'm cutting a bunch of these out for brevity's sake...
...
974b111e08e9845cda5bea17212c523eeef45985 refs/pull/999/merge
[INFO ] Executing command [u'git', u'for-each-ref', u'--format', u'%(refname:short)', u'refs/heads/'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: master
[INFO ] Executing command [u'git', u'for-each-ref', u'--format', u'%(refname:short)', u'refs/tags/'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Checking local revision for /home/username/.oh-my-zsh
[INFO ] Executing command [u'git', u'rev-parse', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c
[INFO ] Checking local branch for /home/username/.oh-my-zsh
[INFO ] Executing command [u'git', u'rev-parse', u'--abbrev-ref', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: master
[INFO ] Executing command [u'git', u'remote', u'--verbose'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: origin https://github.com/robbyrussell/oh-my-zsh.git (fetch)
origin https://github.com/robbyrussell/oh-my-zsh.git (push)
[INFO ] Executing command [u'git', u'diff', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'merge-base', u'--is-ancestor', u'4e45e12dc355e3ba34e7e40ce4936fb222f0155c', u'4e45e12dc355e3ba34e7e40ce4936fb222f0155c'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'rev-parse', u'--abbrev-ref', u'master@{upstream}'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: origin/master
[INFO ] Executing command [u'git', u'ls-remote', u'--tags', u'origin'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'rev-parse', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c
[INFO ] Repository /home/username/.oh-my-zsh is up-to-date
[INFO ] Completed state [https://github.com/robbyrussell/oh-my-zsh.git] at time 09:57:55.771382 (duration_in_ms=2483.557)
[DEBUG ] LazyLoaded file.directory
[INFO ] Running state [/home/username/.oh-my-zsh] at time 09:57:55.778111
[INFO ] Executing state file.directory for [/home/username/.oh-my-zsh]
[INFO ] The directory /home/username/.oh-my-zsh is in the correct state
[INFO ] Completed state [/home/username/.oh-my-zsh] at time 09:57:57.057594 (duration_in_ms=1279.483)
[INFO ] Running state [/home/username/.antigen.zsh] at time 09:57:57.058069
[INFO ] Executing state file.managed for [/home/username/.antigen.zsh]
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/antigen.zsh' to resolve 'salt://library/name/antigen.zsh'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/antigen.zsh' to resolve 'salt://library/name/antigen.zsh'
[INFO ] File /home/username/.antigen.zsh is in the correct state
[INFO ] Completed state [/home/username/.antigen.zsh] at time 09:57:57.092618 (duration_in_ms=34.549)
[INFO ] Running state [/home/username/.zshrc] at time 09:57:57.093287
[INFO ] Executing state file.managed for [/home/username/.zshrc]
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/.zshrc' to resolve 'salt://library/name/.zshrc'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/.zshrc' to resolve 'salt://library/name/.zshrc'
[INFO ] File /home/username/.zshrc is in the correct state
[INFO ] Completed state [/home/username/.zshrc] at time 09:57:57.109653 (duration_in_ms=16.366)
[INFO ] Running state [/home/username/.ssh] at time 09:57:57.110128
[INFO ] Executing state file.directory for [/home/username/.ssh]
[INFO ] The directory /home/username/.ssh is in the correct state
[INFO ] Completed state [/home/username/.ssh] at time 09:57:57.111696 (duration_in_ms=1.567)
[INFO ] Running state [/home/username/.ssh/authorized_keys] at time 09:57:57.112288
[INFO ] Executing state file.managed for [/home/username/.ssh/authorized_keys]
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/authorized_keys' to resolve 'salt://library/name/authorized_keys'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/authorized_keys' to resolve 'salt://library/name/authorized_keys'
[INFO ] File /home/username/.ssh/authorized_keys is in the correct state
[INFO ] Completed state [/home/username/.ssh/authorized_keys] at time 09:57:57.129189 (duration_in_ms=16.9)
[DEBUG ] File /var/cache/salt/minion/accumulator/139640369767056 does not exist, no need to cleanup
[DEBUG ] LazyLoaded state.check_result
[DEBUG ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_f84fe5becb_pub.ipc
[DEBUG ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_f84fe5becb_pull.ipc
[DEBUG ] Sending event: tag = schedule_return; data = {u'fun_args': ['library.name.user'], u'jid': u'req', u'return': {u'file|-install antigen|-/home/username/.antigen.zsh_|-managed': {u'comment': u'File /home/username/.antigen.zsh is in the correct state', u'name': u'/home/username/.antigen.zsh', u'start_time': '09:57:57.058069', u'result': True, u'duration': 34.549, u'run_num': 4, u'sls': u'library.name.user', u'changes': {}, u'id': u'install antigen'}, u'file_|-manage oh-my-zsh permissions_|-/home/username/.oh-my-zsh_|-directory': {u'comment': u'The directory /home/username/.oh-my-zsh is in the correct state', u'name': u'/home/username/.oh-my-zsh', u'start_time': '09:57:55.778111', u'result': True, u'duration': 1279.483, u'run_num': 3, u'sls': u'library.name.user', u'changes': {}, u'id': u'manage oh-my-zsh permissions'}, u'file_|-ssh directory_|-/home/username/.ssh_|-directory': {u'comment': u'The directory /home/username/.ssh is in the correct state', u'name': u'/home/username/.ssh', u'start_time': '09:57:57.110129', u'result': True, u'duration': 1.567, u'run_num': 6, u'sls': u'library.name.user', u'changes': {}, u'id': u'ssh directory'}, u'git_|-install oh-my-zsh_|-https://github.com/robbyrussell/oh-my-zsh.git_|-latest': {u'comment': u'Repository /home/username/.oh-my-zsh is up-to-date', u'name': u'https://github.com/robbyrussell/oh-my-zsh.git', u'start_time': '09:57:53.287825', u'result': True, u'duration': 2483.557, u'run_num': 2, u'sls': u'library.name.user', u'changes': {}, u'id': u'install oh-my-zsh'}, u'pkg_|-install ZSH_|-install ZSH_|-installed': {u'comment': u'All specified packages are already installed', u'name': u'zsh', u'start_time': '09:57:51.526328', u'result': True, u'duration': 1739.224, u'run_num': 0, u'sls': u'library.name.user', u'changes': {}, u'id': u'install ZSH'}, u"file_|-make username's public key_|-/home/username/.ssh/authorized_keys_|-managed": {u'comment': u'File /home/username/.ssh/authorized_keys is in the correct state', u'name': u'/home/username/.ssh/authorized_keys', u'start_time': '09:57:57.112289', u'result': True, u'duration': 16.9, u'run_num': 7, u'sls': u'library.name.user', u'changes': {}, u'id': u"make username's public key"}, u'file_|-set .zshrc_|-/home/username/.zshrc_|-managed': {u'comment': u'File /home/username/.zshrc is in the correct state', u'name': u'/home/username/.zshrc', u'start_time': '09:57:57.093287', u'result': True, u'duration': 16.366, u'run_num': 5, u'sls': u'library.name.user', u'changes': {}, u'id': u'set .zshrc'}, u"cmd_|-create name's home dir_|-/usr/sbin/mkhomedir_helper username 0077_|-run": {u'comment': u'/home/username exists', u'id': u"create name's home dir", u'name': u'/usr/sbin/mkhomedir_helper username 0077', u'duration': 1.648, u'run_num': 1, u'start_time': '09:57:53.285793', u'sls': u'library.name.user', u'changes': {}, u'result': True}}, u'retcode': 0, u'success': True, u'schedule': 'oh-my-zsh', u'cmd': u'_return', u'pid': 1456, u'_stamp': '2020-01-23T15:57:57.135470', u'fun': 'state.apply', u'id': u'host.wheaton.edu'}
[DEBUG ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20200123095749634377
[DEBUG ] Subprocess SignalHandlingMultiprocessingProcess-1:101-Schedule-oh-my-zsh cleaned up

I don't know (a) why that scheduled job keeps running every few minutes, (b) if it could be caused by having more than one minion process running, or (c) if this is at all related to the original problem for which this issue was opened. On some other minions on which I am not running in the foreground, I can see a highstate run that repeated over and over again this morning, even though there were no failures in the run. The salt-minion logs on those minions are empty, so I don't have any data to report from them. It seems like it could be a similar problem, though, to the one for which I /do/ have debug data here, even though this is not for a highstate run.

Hope this helps somehow. Let me know what to try next. Thanks much!

I'm going to try and reproduce the issue with what we got, I'll let you know if I need more info, thank you!

Was not able to test today. Will try to have a test result by Monday.

I'm pretty sure a bunch of my minions didn't run highstate today but am having a hard time verifying that.

I've added some whitespace to a file.managed file and downgraded some of the minions back to 2018.3.3-1 to see what happens tomorrow morning, whether the highstate will run on some of them and not on others.

Sorry guys, I have had no luck reproducing the issue with my test minion, giving it a daily scheduled highstate. When I had the problem I was highstating approx 20 or so mionions on any given day (at a specific time with a 120 second splay [I believe the splay should have been much bigger. that must have been some oversight]. Every PC in my fleet gets highstated once a week, but the day of the depends on which minion it is. I'm going to stop testing this now. I will go back to weekly highstates of my minions. If I observe anything I will let you guys know and then I will do more testing involving more minions.

The machines I rolled back to 2018.3.3-1 are back to what I consider "normal" behavior--highstate running daily and, when there's a state.apply failure, it only attempts to run once.

On those downgraded minions, the file I added whitespace to have a timestamp of Saturday morning, which would have been the next highstate run. I also have returner output for that Saturday morning run. On the minions running 2019.2.3-1, the file was not changed and no highstate output via the returner, so no daily highstate run.

After rebooting one of the machines that hadn't run highstate since I upgraded the minion, the highstate ran at the expected time, about 15 hours after the reboot.

Catching up on this and I might have been able to reproduce it. Need to do a bit more testing.

Still digging into the fix for this but the cause is a conflict between when and splay. If you remove splay, not ideal obviously, but I'm confident that you'll see your scheduled state runs run when expected and only once per day.

Thanks @garethgreenaway !

Can you think of any simple workarounds we can use instead of splay?

@jbeaird If you're able to live patch your minions, this patch against a 2019.2.3 install will fix the issue:

diff --git a/salt/utils/schedule.py b/salt/utils/schedule.py
index 3ec12f576d..b32c2690bc 100644
--- a/salt/utils/schedule.py
+++ b/salt/utils/schedule.py
@@ -1003,7 +1003,7 @@ class Schedule(object):

                     if when < now - loop_interval and \
                             not data.get('_run', False) and \
-                            not data.get('run', False) and \
+                            not run and \
                             not data['_splay']:
                         data['_next_fire_time'] = None
                         data['_continue'] = True
@@ -1059,7 +1059,7 @@ class Schedule(object):

                 if when < now - loop_interval and \
                         not data.get('_run', False) and \
-                        not data.get('run', False) and \
+                        not run and \
                         not data['_splay']:
                     data['_next_fire_time'] = None
                     data['_continue'] = True

Thanks again, @garethgreenaway !

closing as fixed in 3000.1

Was this page helpful?
0 / 5 - 0 ratings