Test-infra: MaxConcurrency is not working

Created on 18 Aug 2017 · 7Comments · Source: kubernetes/test-infra

While refactoring plank in https://github.com/kubernetes/test-infra/pull/3887 I noticed that the pendingJobs cache is never updated when jobs transition from pending to success/failure. My theory is that maxConcurrency (introduced in https://github.com/kubernetes/test-infra/pull/3576) will not work after a while but I still haven't tested it and I may be wrong.

/cc @nlandolfi @spxtr

/area prow
/kind bug
/kind theory

areprow kinbug

Source

kargakis

All 7 comments

Right

If I recall correctly, I figured that they would just be started in the next synchronization.

Perhaps we should delete them from the cache? I didn't like this originally because it seemed like I need to _order_ the jobs or do two passes because otherwise it would still be possible that you saw a job that looked like it was maxed out but then after synchronizing and not starting it you saw a job of the same type that went from pending -> success/failure.

Thoughts?

nlandolfi on 22 Aug 2017

Why not handle pending jobs first and then start new jobs? It seems that you don't even need the cache.

kargakis on 22 Aug 2017

👍1

yeah you're right

nlandolfi on 22 Aug 2017

Got to reproduce this today and verified my theory: this is not working

{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:10:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:10:55Z"}
{"level":"info","msg":"Sync time: 1.444286043s","time":"2017-09-01T11:10:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:25Z"}
{"level":"info","msg":"Sync time: 1.555490479s","time":"2017-09-01T11:11:26Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:54Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:11:55Z"}
{"level":"info","msg":"Sync time: 1.572635702s","time":"2017-09-01T11:11:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:25Z"}
{"level":"info","msg":"Sync time: 1.420743977s","time":"2017-09-01T11:12:26Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:55Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:56Z"}
{"job":"ami_build_origin_int_rhel_build","level":"info","msg":"Not starting another instance of ami_build_origin_int_rhel_build, already 4187 running.","time":"2017-09-01T11:12:57Z"}

I am going to work on a fix later today
/assign

kargakis on 1 Sep 2017

👍1

Noting that this is definitely broken and a fix is necessary for migrating more jobs to prow without exceeding project quotas. Thanks for working on this @kargakis :smiley:

BenTheElder on 5 Sep 2017

I could use some eyeballs in https://github.com/kubernetes/test-infra/pull/4328. I was testing it yesterday and it works fine, I was running an image with the changes the whole day w/o any issue. Once that PR merges, I will open one with the same changes for plank.

kargakis on 5 Sep 2017

I am looking at #4328 currently.

On Tue, Sep 5, 2017 at 2:12 PM, Michail Kargakis notifications@github.com
wrote:

I could use some eyeballs in #4328
https://github.com/kubernetes/test-infra/pull/4328. I was testing it
yesterday and it works fine, I was running an image with the changes the
whole day w/o any issue. Once that PR merges, I will open one with the same
changes for plank.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/test-infra/issues/4109#issuecomment-327304960,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq3IZB9JRmCCeznLFNBw5Ptjfbynaks5sfbkrgaJpZM4O7eCO
.