Pulling this out of https://github.com/kubernetes/kubernetes/issues/92937#issuecomment-662178519
Related to but not explicitly part of https://github.com/kubernetes/test-infra/issues/18551
One thing that's come up in discussion over why kubernetes PR's are so hard to merge (https://github.com/kubernetes/kubernetes/issues/92937) is whether we really need so many jobs to run for each and every single PR that is opened against kubernetes/kubernetes. There is a desire to see if we can trim the number of jobs down without sacrificing (too much) coverage.
Reasons for this are:
/priority important-soon
/area jobs
/sig testing
/sig release
FYI @BenTheElder @liggitt @kubernetes/ci-signal
Comment from @liggitt
I'd divide those into categories for improvement like this:
immediate: should probably be post-submits or periodics set to notify relevant owners:
- pull-kubernetes-dependencies-canary
- pull-kubernetes-e2e-gce-device-plugin-gpu (could keep a presubmit with always_run: false for manual triggering/testing if we want)
- pull-kubernetes-files-remake
significantly overlapping tests ... can we move one to post-submit + notification:
- pull-kubernetes-e2e-gce (cos+docker)
- pull-kubernetes-e2e-gce-ubuntu-containerd (ubuntu+containerd ... already have containerd coverage via kind presubmits)
pair of performance tests (can we have just one performance presubmit?):
- pull-kubernetes-kubemark-e2e-gce-big
- pull-kubernetes-e2e-gce-100-performance
several tests that seem to just check ability to build... can we collapse these somehow:
- pull-kubernetes-bazel-build
- pull-kubernetes-typecheck
- pull-kubernetes-cross
https://github.com/kubernetes/test-infra/pull/18728 - to address pull-kubernetes-e2e-gce-device-plugin-gpu
https://github.com/kubernetes/test-infra/pull/18649 - proposes demoting pull-kubernetes-e2e-gce in favor of pull-kubernetes-e2e-gce-ubuntu-containerd
https://github.com/kubernetes/test-infra/pull/18612 - moved pull-kubernetes-cross to optional and manually triggered
@mm4tt - re scalability related. We briefly talked about that offline too, and looking into past statistics there aren't many things that were discovered by kubemark-500, especially recently. So indeed it might make sense to keep it just as periodic (for faster detection comparing to 5k-node jobs).
Yep, I agree
@spiffxp, do you need help with removing the kubemark-500 presubmit? Let me know
Opened https://github.com/kubernetes/test-infra/pull/18788, leaves the presubmit around but manually triggered / optional. If you'd rather remove entirely or do something else let me know
I think changing it to optional and manually triggered is much better - thanks!
Still need to dedupe pull-kubernetes-e2e-gce / pull-kubernetes-e2e-gce-ubuntu-containerd, I dropped the ball on this one.
Brief update, here's one snapshot as I try to find the right way to slice this:
From 2020-07-01 to today, we've gone from 14 to 12 merge-blocking jobs running for every PR against the main branch of kubernetes/kubernetes:
From 2020-07-01 to today, we've gone from 18 to 12 always-run jobs running for every PR against the main branch of kubernetes/kubernetes:
pull-kubernetes-dependencies could certainly be faster but is relatively cheap overall and has a pretty excellent pass rate.
pull-kubernetes-typecheck is reasonably expensive, similar to compiling though cheaper than actually cross compiling, but quick enough and very reliable.
The rest of these involve non-trivial building at least for the beginning of the job, and tend to be have higher flake risk (actually flake rates vary quite a bit with time...).
None of them are terribly obvious candidate to remove at the moment, IMHO.
pull-kubernetes-verify is probably a candidate to parallelize better.
EDIT: previously having had pull-kubernetes-typecheck and pull-kubernetes-dependencies split out of it, it's still rather slow.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle rotten
related: https://github.com/kubernetes/test-infra/issues/6380 - document what the criteria are for merge-blocking