While refactoring plank in https://github.com/kubernetes/test-infra/pull/3887 I noticed that the pendingJobs cache is never updated when jobs transition from pending to success/failure. My theory is that maxConcurrency (introduced in https://github.com/kubernetes/test-infra/pull/3576) will not work after a while but I still haven't tested it and I may be wrong.
/cc @nlandolfi @spxtr
/area prow
/kind bug
/kind theory
Right
If I recall correctly, I figured that they would just be started in the next synchronization.
Perhaps we should delete them from the cache? I didn't like this originally because it seemed like I need to _order_ the jobs or do two passes because otherwise it would still be possible that you saw a job that looked like it was maxed out but then after synchronizing and not starting it you saw a job of the same type that went from pending -> success/failure.
Thoughts?
Why not handle pending jobs first and then start new jobs? It seems that you don't even need the cache.
yeah you're right
Got to reproduce this today and verified my theory: this is not working
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:10:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:10:55Z"}
{"level":"info","msg":"Sync time: 1.444286043s","time":"2017-09-01T11:10:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"level":"info","msg":"Sync time: 1.555490479s","time":"2017-09-01T11:11:26Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:54Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:55Z"}
{"level":"info","msg":"Sync time: 1.572635702s","time":"2017-09-01T11:11:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"level":"info","msg":"Sync time: 1.420743977s","time":"2017-09-01T11:12:26Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:57Z"}
I am going to work on a fix later today
/assign
Noting that this is definitely broken and a fix is necessary for migrating more jobs to prow without exceeding project quotas. Thanks for working on this @kargakis :smiley:
I could use some eyeballs in https://github.com/kubernetes/test-infra/pull/4328. I was testing it yesterday and it works fine, I was running an image with the changes the whole day w/o any issue. Once that PR merges, I will open one with the same changes for plank.
I am looking at #4328 currently.
On Tue, Sep 5, 2017 at 2:12 PM, Michail Kargakis notifications@github.com
wrote:
I could use some eyeballs in #4328
https://github.com/kubernetes/test-infra/pull/4328. I was testing it
yesterday and it works fine, I was running an image with the changes the
whole day w/o any issue. Once that PR merges, I will open one with the same
changes for plank.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4109#issuecomment-327304960,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq3IZB9JRmCCeznLFNBw5Ptjfbynaks5sfbkrgaJpZM4O7eCO
.