Test-infra: Different set of blocking jobs on stable1, 2 and 3 releases

Created on 12 Sep 2018 · 29Comments · Source: kubernetes/test-infra

In particular soak-gci-gce-1.X job is blocking for stable2, but is not blocking on stable1 and stable3. As a result soak job will be blocking for just a part of each release lifecycle. We should have a consistent set of blocking jobs for all patch releases for a single branch. I think it would be more correct if we just have a specific set of blocking jobs defined for each release, rather than have rolling stable1-3.
If that's not possible we should at least make sure the set of jobs is the same for all three (the downside of that is that it will be hard to change the set of blocking jobs).

cc: @krzyzacy @BenTheElder @kubernetes/sig-release-bugs

arejobs arerelease-eng kinbug sirelease

Source

MaciekPytel

All 29 comments

cc @spiffxp

they are mostly same except for, say, 1.12 doesn't have kubeadm upgrade jobs set up

we can probably bring in a presubmit to enforce this

I'll remove the soak-stable2 job from release blocking dashboard.

krzyzacy on 12 Sep 2018

/area jobs
We absolutely should have the same jobs. The only time I can see that differing is if we add new blocking jobs in the current release that couldn't work against older releases.

A presubmit sounds good.

I haven't done a thorough audit, but last I checked we were also missing a scalability-related stable3 job.

spiffxp on 17 Sep 2018

/assign

I'll create a doc under sig-release to define a list of release-blocking jobs, and we can use that as a source of truth in the future.

krzyzacy on 17 Sep 2018

👍1

/milestone v1.13
FYI @cjwagner @jberkus

spiffxp on 4 Oct 2018

The plan is to accomplish this before 2018-10-23 aka Enhancements Freeze of v1.13 release cycle:

get release-1.y-blocking reconciled using release-1.12-blocking as the "authoritative" source
get sig-release-master-blocking reconciled agains release-1.12-blocking
reconcile means: same set of jobs, same naming scheme for testgrid tabs

spiffxp on 5 Oct 2018

👍1

I've done some comparison of master-blocking and 1.12-blocking. Here's where they don't match. Note that I'm using the actual job names below instead of the label you see in testgrid, because it's hard to figure out which job it is from the label:

Tests that are in 1.12-blocking with no equivalent in master-blocking:

ci-kubernetes-e2e-kubeadm-gce-1-11-on-1-12
periodic-kubernetes-bazel-build-1-12
periodic-kubernetes-bazel-test-1-12

Tests that are in master-blocking with no equivalent in 1.12-blocking

ci-kubernetes-e2e-gce-scale-correctness
ci-kubernetes-e2e-gce-scale-performance
ci-periodic-cloud-provider-openstack-acceptance-test-e2e-conformance
ci-periodic-cloud-provider-vsphere-test-e2e-conformance
ci-periodic-vsphere-test-e2e-conformance
periodic-kubernetes-e2e-packages-pushed

Also, note that several of the test jobs for 1.12-blocking are named "beta" instead of "1.12", which suggests that those may not be version-specific.

jberkus on 6 Oct 2018

I unify the name of master-blocking dashboard with 1.12 a bit

for the no equivalent, if we take 1.12 as a source of truth, we have both postsubmit|periodic bazel job so we are probably fine with only keep one of them. @neolit123 might want to add a latest-release-on-master kubeadm job for consistency?

also do we really want all conformance tests from cloud providers to block release?

Also, note that several of the test jobs for 1.12-blocking are named "beta" instead of "1.12", which suggests that those may not be version-specific.

the release channels are defined at https://github.com/kubernetes/test-infra#release-branch-jobs--image-validation-jobs, so for each new release, we can rename the testgrid dashboard without remaking all the jobs

krzyzacy on 15 Oct 2018

@neolit123 might want to add a latest-release-on-master kubeadm job for consistency?

i can add kubeadm-gce-stable-on-master in sig-release-master-blocking to make it consistent with -1.12-blocking.

also do we really want all conformance tests from cloud providers to block release?

that was something raised as a question last week with @spiffxp and @BenTheElder .

neolit123 on 15 Oct 2018

/milestone v1.14

The jobs aren't yet identical. I think we can close this out as we split up dashboards into -blocking/-informing etc. ref: kubernetes/sig-release#347

spiffxp on 3 Jan 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Apr 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 3 May 2019

/remove-lifecycle rotten
/milestone v1.15
ref: https://github.com/kubernetes/test-infra/issues/11977

spiffxp on 10 May 2019

/unassign
/sig release

krzyzacy on 5 Jun 2019

/milestone v1.16

spiffxp on 9 Jul 2019

/cc @jberkus
Apparently "the gce reboot job" is informing in master, and blocking on all other branches:

https://testgrid.k8s.io/sig-release-master-informing#gce-cos-master-reboot
https://testgrid.k8s.io/sig-release-1.16-blocking#gce-cos-k8sbeta-reboot
https://testgrid.k8s.io/sig-release-1.15-blocking#gce-cos-k8sstable1-reboot
https://testgrid.k8s.io/sig-release-1.14-blocking#gce-cos-k8sstable2-reboot
https://testgrid.k8s.io/sig-release-1.13-blocking#gce-cos-k8sstable3-reboot

spiffxp on 21 Aug 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 19 Nov 2019

AFAIK, this issue has not been resolved.

@Katharine @BenTheElder ?

/remove-lifecycle stale

jberkus on 20 Nov 2019

I'm pretty over capacity and don't remember what we wanted here.

It's entirely reasonable imo to have different sets of jobs per release.
The config forker pretty much ensures that at each release we copy over jobs from master to that release.

BenTheElder on 24 Nov 2019

/assign
I was the last to touch config-forker, so I'll try and close the loop on this.

justaugustus on 24 Nov 2019

/milestone v1.18
/area release-eng

justaugustus on 28 Nov 2019

@BenTheElder @justaugustus

For the 1.16 release, the config-forker did not copy from master to the release; it copied a different, template set of tests based on the 1.13 test set. That may have been fixed since; if it has, that's one solution to this issue.

The core problem was that the config-forker was copying a set of jobs that was based on no SIG-release determined set, but was instead historical and impossible to update.

jberkus on 4 Dec 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Mar 2020

/remove-lifecycle stale

Where are we on this? @Katharine ?

jberkus on 4 Mar 2020

This should've been resolved long ago, and the last set of inconsistent jobs expired out.

As for this comment:

For the 1.16 release, the config-forker did _not_ copy from master to the release; it copied a different, template set of tests based on the 1.13 test set. That may have been fixed since; if it has, that's one solution to this issue.

This is (or should be, to my knowledge) impossible, unless some awful misconfiguration happened. Why do you think that happened?

Katharine on 4 Mar 2020

Because the set of jobs when 1.16 branch was created was different from the set of jobs in master, that's why. And when I asked about it, that's what test-infra folks said was why it happened.

Particularly, the slow performance tests had been moved from blocking to informing before the branch, but where back in blocking in the 1.16 set.

jberkus on 7 Mar 2020

If it's copying from master now, then that's all good. I just wanted to check that it was.

jberkus on 7 Mar 2020

I scanned through the release-blocking dashboards:

kind jobs were missing from 1.18 (https://github.com/kubernetes/test-infra/issues/16700)
reboot job is missing from master (https://github.com/kubernetes/test-infra/pull/16742)
gcs-cos-...-default job was missing from 1.18 (https://github.com/kubernetes/test-infra/pull/16662) - this can't be present in all boards because it was demoted a while back, but work in the 1.18 cycle brought it back to informing (said work hasn't been cherry-picked back)

Once these three are closed, I'm going to call this closed unless there are any objections

spiffxp on 11 Mar 2020

/close
Please re-open if you think there's anything left to do here

spiffxp on 12 Mar 2020

@spiffxp: Closing this issue.

In response to this:

/close
Please re-open if you think there's anything left to do here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.