Test-infra: Re-evaluate set of merge-blocking jobs for kubernetes/kubernetes

Created on 7 Aug 2020  路  16Comments  路  Source: kubernetes/test-infra

Pulling this out of https://github.com/kubernetes/kubernetes/issues/92937#issuecomment-662178519

Related to but not explicitly part of https://github.com/kubernetes/test-infra/issues/18551

One thing that's come up in discussion over why kubernetes PR's are so hard to merge (https://github.com/kubernetes/kubernetes/issues/92937) is whether we really need so many jobs to run for each and every single PR that is opened against kubernetes/kubernetes. There is a desire to see if we can trim the number of jobs down without sacrificing (too much) coverage.

Reasons for this are:

  • if we assume each job has a non-zero chance of flaking, fewer jobs means fewer chances for a PR to encounter a flake
  • if we assume jobs are flaking due to resource contention, fewer jobs running means more resources available for jobs to consume

/priority important-soon
/area jobs
/sig testing
/sig release

FYI @BenTheElder @liggitt @kubernetes/ci-signal

arejobs prioritimportant-longterm sirelease sitesting

All 16 comments

Comment from @liggitt

I'd divide those into categories for improvement like this:

immediate: should probably be post-submits or periodics set to notify relevant owners:

  • pull-kubernetes-dependencies-canary
  • pull-kubernetes-e2e-gce-device-plugin-gpu (could keep a presubmit with always_run: false for manual triggering/testing if we want)
  • pull-kubernetes-files-remake

significantly overlapping tests ... can we move one to post-submit + notification:

  • pull-kubernetes-e2e-gce (cos+docker)
  • pull-kubernetes-e2e-gce-ubuntu-containerd (ubuntu+containerd ... already have containerd coverage via kind presubmits)

pair of performance tests (can we have just one performance presubmit?):

  • pull-kubernetes-kubemark-e2e-gce-big
  • pull-kubernetes-e2e-gce-100-performance

several tests that seem to just check ability to build... can we collapse these somehow:

  • pull-kubernetes-bazel-build
  • pull-kubernetes-typecheck
  • pull-kubernetes-cross

https://github.com/kubernetes/test-infra/pull/18728 - to address pull-kubernetes-e2e-gce-device-plugin-gpu

https://github.com/kubernetes/test-infra/pull/18649 - proposes demoting pull-kubernetes-e2e-gce in favor of pull-kubernetes-e2e-gce-ubuntu-containerd

https://github.com/kubernetes/test-infra/pull/18612 - moved pull-kubernetes-cross to optional and manually triggered

@mm4tt - re scalability related. We briefly talked about that offline too, and looking into past statistics there aren't many things that were discovered by kubemark-500, especially recently. So indeed it might make sense to keep it just as periodic (for faster detection comparing to 5k-node jobs).

Yep, I agree
@spiffxp, do you need help with removing the kubemark-500 presubmit? Let me know

Opened https://github.com/kubernetes/test-infra/pull/18788, leaves the presubmit around but manually triggered / optional. If you'd rather remove entirely or do something else let me know

I think changing it to optional and manually triggered is much better - thanks!

Still need to dedupe pull-kubernetes-e2e-gce / pull-kubernetes-e2e-gce-ubuntu-containerd, I dropped the ball on this one.

Brief update, here's one snapshot as I try to find the right way to slice this:

From 2020-07-01 to today, we've gone from 14 to 12 merge-blocking jobs running for every PR against the main branch of kubernetes/kubernetes:

  • pull-kubernetes-bazel-build
  • pull-kubernetes-bazel-test
  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • ~pull-kubernetes-e2e-gce~ (dropped by https://github.com/kubernetes/test-infra/pull/18832)
  • pull-kubernetes-e2e-gce-100-performance
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6 (added by https://github.com/kubernetes/test-infra/pull/18718)
  • ~pull-kubernetes-files-remake~ (dropped by https://github.com/kubernetes/test-infra/pull/18524)
  • pull-kubernetes-integration
  • ~pull-kubernetes-kubemark-e2e-gce-big~ (dropped by https://github.com/kubernetes/test-infra/pull/18788)
  • pull-kubernetes-node-e2e
  • pull-kubernetes-typecheck
  • pull-kubernetes-verify

From 2020-07-01 to today, we've gone from 18 to 12 always-run jobs running for every PR against the main branch of kubernetes/kubernetes:

  • pull-kubernetes-bazel-build
  • pull-kubernetes-bazel-test
  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • ~pull-kubernetes-dependencies-canary~ (dropped by https://github.com/kubernetes/test-infra/pull/18421)
  • ~pull-kubernetes-e2e-gce~ (dropped by https://github.com/kubernetes/test-infra/pull/18832)
  • pull-kubernetes-e2e-gce-100-performance
  • ~pull-kubernetes-e2e-gce-device-plugin-gpu~ (dropped by https://github.com/kubernetes/test-infra/pull/18728)
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6 (added by https://github.com/kubernetes/test-infra/pull/18718)
  • ~pull-kubernetes-files-remake~ (dropped by https://github.com/kubernetes/test-infra/pull/18524)
  • pull-kubernetes-integration
  • ~pull-kubernetes-kubemark-e2e-gce-big~ (dropped by https://github.com/kubernetes/test-infra/pull/18788)
  • pull-kubernetes-node-e2e
  • ~pull-kubernetes-node-e2e-containerd~ (dropped by https://github.com/kubernetes/test-infra/pull/18356)
  • pull-kubernetes-typecheck
  • pull-kubernetes-verify

pull-kubernetes-dependencies could certainly be faster but is relatively cheap overall and has a pretty excellent pass rate.
pull-kubernetes-typecheck is reasonably expensive, similar to compiling though cheaper than actually cross compiling, but quick enough and very reliable.

The rest of these involve non-trivial building at least for the beginning of the job, and tend to be have higher flake risk (actually flake rates vary quite a bit with time...).

None of them are terribly obvious candidate to remove at the moment, IMHO.

pull-kubernetes-verify is probably a candidate to parallelize better.
EDIT: previously having had pull-kubernetes-typecheck and pull-kubernetes-dependencies split out of it, it's still rather slow.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

/remove-lifecycle rotten

related: https://github.com/kubernetes/test-infra/issues/6380 - document what the criteria are for merge-blocking

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lavalamp picture lavalamp  路  3Comments

BenTheElder picture BenTheElder  路  4Comments

BenTheElder picture BenTheElder  路  3Comments

spzala picture spzala  路  4Comments

MrHohn picture MrHohn  路  4Comments