Test-infra: Kubernetes CI Policy: merge-blocking jobs must run in dedicated cluster

Created on 31 Jul 2020  路  13Comments  路  Source: kubernetes/test-infra

Part of https://github.com/kubernetes/test-infra/issues/18551

Why this is necessary:

  • we believe declaring Guaranteed Pod QOS jobs may not be defended against Best Effort or Burstable Pods hogging all resources on the same node
  • a cluster that only has Guaranteed pods is far more likely to respect resource requirements

Decisions to make:

  • For all of the jobs being migrated to k8s-infra-prow-build, we're going to use the same node pool as everything else. To pin to a dedicated node pool will require much more boilerplate (or possibly augmenting prow's preset feature). If after migrating everything we find we _do_ need a dedicated nodepool, then we'll pay the added cost.
  • Jobs that can't be migrated quickly will remain in the default google.com-owned k8s-prow-builds cluster until such time as they can be migrated. They will continue to compete for resources until they have migrated, and we will be reliant on test-infra-oncall to provide any visibility into the behavior of such jobs.

Jobs to migrate:

TODO:

arejobs arerelease-eng sirelease wk8s-infra

Most helpful comment

@LappleApple https://github.com/kubernetes/test-infra/issues/19073 is the last remaining issue

All 13 comments

sig testing
/sig release
/wg k8s-infra
/area release-eng
/area jobs

I'm trialing migration of three different kinds of jobs to get a feel for whether the rest can be migrated over in a similar fashion, or whether more preparation is needed:

Based on how these go I'll break out other TODO's into help wanted issues

/cc

OK, I've broken out everything I think can be done right now, tagged as help-wanted, and added to the CI Policy Improvements - Next Priority column

The remaining TODO's need some unblocking work

We may need to roll some jobs back or adjust some limits. Started bumping into a quota we're having difficulty raising https://github.com/kubernetes/k8s.io/issues/1132#issuecomment-678389503

IP quota has been bumped to a reasonable level. I'm now less concerned about trying to mitigate or undo migration of the jobs that have been migrated thus far.

https://github.com/kubernetes/k8s.io/issues/1187 trying to improve cluster's I/O capacity, I'm wary of moving over pull-kubernetes-bazel jobs until this is addressed

https://github.com/kubernetes/k8s.io/issues/1231 we're starting to hit the max nodepool size of 90 so I'd like to increase to 150

Heya: +1 if we can get this updated with the checkbox marked for #18854

In the write-up, under

_TODO: determine gcp project / boskos requirements, provision_

This item is now closed:
_kubernetes/k8s.io#851 - scalability-project_

Is that TODO now complete?

Also wondering if the other items in the TODO section:

TODO: migrate jobs that require special projects
TODO: declare we are finished (or done enough to move on)

need an update?

@LappleApple https://github.com/kubernetes/test-infra/issues/19073 is the last remaining issue

Closed out the last remaining issue https://github.com/kubernetes/test-infra/issues/19073#issuecomment-756848275

All merge-blocking jobs run in k8s-infra-prow-build. Last thing to do is switch job config tests to fail instead of warn to prevent future violations of this policy.

Opened https://github.com/kubernetes/test-infra/pull/20416 to switch job config tests

Was this page helpful?
0 / 5 - 0 ratings