Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES)
Red Hat Enterprise Linux Server release 7.6 (Maipo)
I had a periodic job scheduled during the DST rollover this past weekend which was repeatedly rapidly evaluated at 6AM UTC when it was scheduled for 2AM local/ET. For reference, ET was UTC-5 before the change and UTC-4 after. The job ran many, many thousands of times before it was caught, and I believe there were many times that many allocations that never ended up getting placed (I posted about that in #4532). The flood eventually crippled the cluster, because Nomad was tracking so much that the OOM killer came out in force, and even without that, Nomad was mostly unresponsive and far as I can tell. I was able to stop the stage-a-restart-services job, but I seemed to also have to change the eligibility of the nodes to false to get the allocations to drain from being in the thousands per node to the dozen or as normal.
Running a job scheduled the same way across the the DST change should reproduce this, but I honestly don't have time to do things like bring up another Nomad instance in a VM where I have the privs to set the date.
job "stage-a-restart-services" {
type = "batch"
periodic {
cron = "0 0 2 * * * *"
time_zone = "Local"
}
datacenters = ["a"]
group "restart-services" {
task "restart-services" {
leader = true
driver = "raw_exec"
env {
TIER = "stage"
SITE = "a"
}
config {
command = "/home/fds/dsotm/FDSdsotm_misc/bin/restart_services"
}
}
task "store-logs" {
driver = "raw_exec"
config {
command = "/home/fds/dsotm/FDSdsotm_misc/bin/store_logs"
args = ["/home/fds/dsotm/log/${NOMAD_JOB_NAME}"]
}
}
}
}
/home/fds/dsotm/FDSdsotm_misc/bin/restart_services is a short Python script that thankfully didn't actually do anything with this time around.
We experienced the same issue with a periodic job configured with a 30 * * * * schedule and timezone "America/New_York" on Nomad 0.8.6.
The first problematic allocation was started at 2019-03-10T06:30:11.736Z. Once that allocation completed a new allocation for the same job instance was created. This cycled continued until this morning when we manually stopped the parent job and the child job instance. Ultimately, over 4000 allocations were created. Virtually all completed successfully.
Attempting to manually stop just the child job (while leaving the parent periodic job registered) was unsuccessful. Nomad would accept the DELETE request, and the child job would sometimes briefly be marked as dead, but would almost immediately return to a running state with all of the old allocations still in place.
After completely stopping the job (both parent and child) we were able to successfully re-register the job. However, it is now not being scheduled for execution at all. Looking at other periodic jobs, the ones that aren't stuck in continuous execution loops appear to not have run at all since the EST/EDT switch over.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
Thank you so much for reporting this and sorry for taking a long time. We plan to investigate and remedy this soon.
The issue here is that our cron library doesn't handle daylight transitions well. We have two complications: the library we use is deprecated and unmaintained[1] and the daylight saving issue is a known unresolved issue[2]. We'll investigate our options and address it soon.
Meanwhile, we recommend using UTC timezone for periodic jobs either in general or at least around DST transitioning time, if possible.
[1] https://github.com/gorhill/cronexpr
[2] https://github.com/gorhill/cronexpr/pull/17
Providing an update here with my notes.
We have two options:
We can fix gorhill/cronexpr library to handle DST properly. Sadly, the DST PR https://github.com/gorhill/cronexpr/pull/17 fails some our testing, as it gets into infinite recursion causing a stack overflow in some cases.
Alternatively, we can migrate to using another maintained library. https://github.com/robfig/cron is a very reasonable library. Its handling of DST passed our tests. The library is well maintained and commonly used.
The downside of switching libraries that cronexpr supports some cron expression extensions not supported by any other library I looked at, so we may risk introducing subtle compatibility changes:
L (last day), W (week day), # (further constraints on days) - these are trickier to implement while ensuring that we adhere to gorhill/cronexpr semantics properly.My current inclination is to check if robfig/cron would welcome contributions for the extensions - Their SpecSchedule struct would need to change significantly. If not, I would suggest fixing cronexpr as-is.