Salt: Pillar-based schedule runs on each pillar refresh

Created on 8 Jan 2020 · 5Comments · Source: saltstack/salt

Description of Issue

Any scheduled job that is defined through pillar and has run_on_start: true, runs on each pillar refresh (even if unchanged), thus ignoring the scheduled interval

Steps to Reproduce Issue

Pillar:

schedule:
  test_schedule:
    function: test.ping
    minutes: 1
    run_on_start: true

Now run saltutil.refresh_pillar and watch the minion log for test_schedule.

Based on this comment: https://github.com/saltstack/salt/issues/51467#issuecomment-538340602

Versions Report

salt --versions-report

Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Nov  7 2019, 10:44:02)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: UTF-8
        machine: x86_64
        release: 4.15.0-72-generic
         system: Linux
        version: Ubuntu 18.04 bionic

Aluminium Bug phase-execute severity-medium status-in-prog

Source

max-arnold

👍1

All 5 comments

@max-arnold any way you can open a PR for this?

sagetherage on 26 Mar 2020

@sagetherage Unfortunately, I have no idea why it happens and I'm not familiar with the Salt code that deals with scheduling.

max-arnold on 26 Mar 2020

@max-arnold we will try to get to this in Magnesium, but I cannot commit to the work ATM, but I have it assigned to Gareth and will watch it and attempt to reassign as we go through the release cycle.

sagetherage on 29 Jul 2020

👍1

I think this happens because refresh_pillar causes a 'delete job' then 'add job' for each pillar schedule job even if the job definition hasn't changed between refreshes. The 'run_at_start' is likely triggered somewhere in the 'add job' call stack which likely has an assumption that it's only being called once for the lifecycle of the minion. That is, 'add job' likely doesn't consider the case where refresh_pillar is removing and adding the same job back.

There's a related quirk - if you have a job that's scheduled via pillar to run, say, every 60 minutes, like this:

schedule:
  test_schedule:
    function: test.ping
    minutes: 60

And for whatever reason you're calling refresh_pillar every 30 minutes -- perhaps because you've automated it or because it's changing frequently and you're doing it manually -- the test_schedule job will never be triggered because the underlying timer starts over at zero when the pillar is refreshed. That is, the timer never reaches 60 minutes to trigger the job.

It seems like this has the potential to be the cause of many "why doesn't my pillar scheduled job run consistently" type issues. It also seems like a fairly complex problem where the refresh pillar behavior needs to be well-defined and edge cases documented.

I'll take a stab at describing how I think maybe it should work:

Jobs can be added to pillar and will take effect on refresh pillar
Jobs can be deleted from pillar and will be removed on refresh pillar

For existing jobs (identified by name), ideally:

Jobs will only be removed and re-added if something changed in their definition

However, I'm not sure how difficult it would be to define 'something changed'. Maybe the Salt engine already has some deep dictionary comparison that can be used there.

If that's too difficult to pull off in the implementation, maybe this is the rule (with documentation):

Changes to existing jobs will not be applied during refresh_pillar and will only take effect on minion restart to give developer control over when the 'run_at_start' and timer resets.

Or perhaps the problem only affects 'run_at_start: true' and certain types of incremental schedules so the edge case only applies to those.

jholloway7 on 11 Aug 2020

Was finally able to take a look at this one. The root of the issue is that when pillar is being refreshed all the schedule items are being replaced, regardless of any changes, so any of the internal values that track when the last run took place are wiped out. I have a fix worked up that checks to make sure we only swap pillar items if the non internal values have changed. Just need to work out some tests for a PR.