Any scheduled job that is defined through pillar and has run_on_start: true, runs on each pillar refresh (even if unchanged), thus ignoring the scheduled interval
Pillar:
schedule:
test_schedule:
function: test.ping
minutes: 1
run_on_start: true
Now run saltutil.refresh_pillar and watch the minion log for test_schedule.
Based on this comment: https://github.com/saltstack/salt/issues/51467#issuecomment-538340602
salt --versions-report
Salt Version:
Salt: 2019.2.2
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.10
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.9 (default, Nov 7 2019, 10:44:02)
python-gnupg: 0.4.1
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.2.5
System Versions:
dist: Ubuntu 18.04 bionic
locale: UTF-8
machine: x86_64
release: 4.15.0-72-generic
system: Linux
version: Ubuntu 18.04 bionic
@max-arnold any way you can open a PR for this?
@sagetherage Unfortunately, I have no idea why it happens and I'm not familiar with the Salt code that deals with scheduling.
@max-arnold we will try to get to this in Magnesium, but I cannot commit to the work ATM, but I have it assigned to Gareth and will watch it and attempt to reassign as we go through the release cycle.
I think this happens because refresh_pillar causes a 'delete job' then 'add job' for each pillar schedule job even if the job definition hasn't changed between refreshes. The 'run_at_start' is likely triggered somewhere in the 'add job' call stack which likely has an assumption that it's only being called once for the lifecycle of the minion. That is, 'add job' likely doesn't consider the case where refresh_pillar is removing and adding the same job back.
There's a related quirk - if you have a job that's scheduled via pillar to run, say, every 60 minutes, like this:
schedule:
test_schedule:
function: test.ping
minutes: 60
And for whatever reason you're calling refresh_pillar every 30 minutes -- perhaps because you've automated it or because it's changing frequently and you're doing it manually -- the test_schedule job will never be triggered because the underlying timer starts over at zero when the pillar is refreshed. That is, the timer never reaches 60 minutes to trigger the job.
It seems like this has the potential to be the cause of many "why doesn't my pillar scheduled job run consistently" type issues. It also seems like a fairly complex problem where the refresh pillar behavior needs to be well-defined and edge cases documented.
I'll take a stab at describing how I think maybe it should work:
For existing jobs (identified by name), ideally:
However, I'm not sure how difficult it would be to define 'something changed'. Maybe the Salt engine already has some deep dictionary comparison that can be used there.
If that's too difficult to pull off in the implementation, maybe this is the rule (with documentation):
Or perhaps the problem only affects 'run_at_start: true' and certain types of incremental schedules so the edge case only applies to those.
Was finally able to take a look at this one. The root of the issue is that when pillar is being refreshed all the schedule items are being replaced, regardless of any changes, so any of the internal values that track when the last run took place are wiped out. I have a fix worked up that checks to make sure we only swap pillar items if the non internal values have changed. Just need to work out some tests for a PR.