Sig-release: Decide on procedure for jobs entering (and leaving) Blocking

Created on 27 Aug 2019  路  26Comments  路  Source: kubernetes/sig-release

Follow-up to #752

Currently, we have no established procedure for how a job gets added to Blocking (or Informing), and for that matter what happens when it's removed. Questions we need to answer are:

  1. How does a SIG propose a job for the sig-release boards?
  2. Does it start on master-informing or somewhere else?
  3. Where do we document why a job was included?
  4. Who approves a job being added?
  5. When we remove a job from blocking or informing, who gets notified, and where is the removal documented?

attn:
@alejandrox1 @spiffxp @justaugustus

/kind cleanup
/kind documentation
/milestone v1.16

arerelease-team kincleanup kindocumentation lifecyclrotten needs-priority sirelease

Most helpful comment

I want to take the chance to describe the unicorn that would be good to see.

There currently is a lot of toil throughout the org: tons of jobs and tests in testgrid not all of them maintained; there are no standard channels of communication between SIGs and release team.
This causes the CI signal team along with some very active contributors to be the ones creating issues and alerting other SIGs that something of theirs is broken and needs to be fixed.

I think an ideal setup would be to propose that all SIGs have their own "CI signal team".
In this case, each SIG would have some group of people that will monitor their own testgrid dashbaords (i.e., sig-api-machinery, sig-node dashboards in https://testgrid.k8s.io/ ).
Many SIGs already have dedicated sessions for triaging PRs and issues assigned to them so this would be right up their ally (plus it would be a great way for new contributors to jump in!).

Couple things to keep in mind is that none of the tests in any of the SIG-release dashbaords are unique. Just to give a few examples, all tests tagged [sig-cli] are running in a SIG-cli owned job, all kind-based tests are running in either SIG-cluster-lifecycle or SIG-testing jobs.
There is already a duplication of tests which one could assume is aimed to provide a clear signal for the owning SIG and for the release team (in the case a test/job is deemed useful enough to measure the quality of a release).

In this manner, if each SIG had a group of people owning their testgrid dashboards, CI jobs and tests, and used the same criteria that the reelase team uses to hold dashboards, jobs, and tests up to standard (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ) collaboration between SIGs and the release team would go a lot smoother.

To put all this into context, let me answer the proposed questions on the context of my plan:

  1. How does a SIG propose a job for the sig-release boards?

If all SIGs fllow the same procedures and standards that the release team uses to gauge the usefulness and quality of a testgrid dashbaord, job, or tests (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ), then each SIG could easily proposed their "blocking" jobs to be part of SIG-release.
SIGs would have more ownership in the process and we could distribute the work.

  1. Does it start on master-informing or somewhere else?

All jobs should start on the owning SIG's dashboards.
Ultimately, there is no one better equipped to handle an issue than the owning SIG.
This would help create a tighter feedback loop within the SIG and would reduce some of the existing toil within the release team.

  1. Where do we document why a job was included?

Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind.
Prow job annotations already have a field for adding a description.
Additionally, it may be good to include some info in READMEs along with the job configurations (i.e., https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-release configs ).

  1. Who approves a job being added?

Jobs would be approved by the owning SIG whenever a new one is proposed from SIG-foo/dashboardX to SIG-foo/blocking.
Once a job is in SIG-foo/blocking then it can be proposed to be release blocking and added to SIG-release/{blocking,informing}.

  1. When we remove a job from blocking or informing, who gets notified, and where is the removal documented?

The owner of the testgrid dashboard should always be notified.
It should never be the case that a job just dissapears without the owning SIG consent.
And, in this case, since all jobs/tests are replicated, there is no information lost from demoting a job from SIG-release/informing.

I would love to hear any feedback, comments, or criticisms you all my have on this topic.

All 26 comments

/cc @Verolop @alenkacz
/sig release
/area release-team

I want to take the chance to describe the unicorn that would be good to see.

There currently is a lot of toil throughout the org: tons of jobs and tests in testgrid not all of them maintained; there are no standard channels of communication between SIGs and release team.
This causes the CI signal team along with some very active contributors to be the ones creating issues and alerting other SIGs that something of theirs is broken and needs to be fixed.

I think an ideal setup would be to propose that all SIGs have their own "CI signal team".
In this case, each SIG would have some group of people that will monitor their own testgrid dashbaords (i.e., sig-api-machinery, sig-node dashboards in https://testgrid.k8s.io/ ).
Many SIGs already have dedicated sessions for triaging PRs and issues assigned to them so this would be right up their ally (plus it would be a great way for new contributors to jump in!).

Couple things to keep in mind is that none of the tests in any of the SIG-release dashbaords are unique. Just to give a few examples, all tests tagged [sig-cli] are running in a SIG-cli owned job, all kind-based tests are running in either SIG-cluster-lifecycle or SIG-testing jobs.
There is already a duplication of tests which one could assume is aimed to provide a clear signal for the owning SIG and for the release team (in the case a test/job is deemed useful enough to measure the quality of a release).

In this manner, if each SIG had a group of people owning their testgrid dashboards, CI jobs and tests, and used the same criteria that the reelase team uses to hold dashboards, jobs, and tests up to standard (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ) collaboration between SIGs and the release team would go a lot smoother.

To put all this into context, let me answer the proposed questions on the context of my plan:

  1. How does a SIG propose a job for the sig-release boards?

If all SIGs fllow the same procedures and standards that the release team uses to gauge the usefulness and quality of a testgrid dashbaord, job, or tests (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ), then each SIG could easily proposed their "blocking" jobs to be part of SIG-release.
SIGs would have more ownership in the process and we could distribute the work.

  1. Does it start on master-informing or somewhere else?

All jobs should start on the owning SIG's dashboards.
Ultimately, there is no one better equipped to handle an issue than the owning SIG.
This would help create a tighter feedback loop within the SIG and would reduce some of the existing toil within the release team.

  1. Where do we document why a job was included?

Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind.
Prow job annotations already have a field for adding a description.
Additionally, it may be good to include some info in READMEs along with the job configurations (i.e., https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-release configs ).

  1. Who approves a job being added?

Jobs would be approved by the owning SIG whenever a new one is proposed from SIG-foo/dashboardX to SIG-foo/blocking.
Once a job is in SIG-foo/blocking then it can be proposed to be release blocking and added to SIG-release/{blocking,informing}.

  1. When we remove a job from blocking or informing, who gets notified, and where is the removal documented?

The owner of the testgrid dashboard should always be notified.
It should never be the case that a job just dissapears without the owning SIG consent.
And, in this case, since all jobs/tests are replicated, there is no information lost from demoting a job from SIG-release/informing.

I would love to hear any feedback, comments, or criticisms you all my have on this topic.

Jorge:

Great proposal! I'll leave others to bring up the political difficulties, but you've really pinpointed the overall problem with CI Signal, and why it's so labor-intensive.

I do need to call out one fly in the ointment because I don't think most SIG-release members are aware of the problem:

Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind. Prow job annotations already have a field for adding a description.

It's actually currently very, very difficult to find the Prow job annotations for a specific job, starting from its name on Testgrid UI. See https://github.com/kubernetes/test-infra/issues/14018

It's actually currently very, very difficult to find the Prow job annotations for a specific job, starting from its name on Testgrid UI. See kubernetes/test-infra#14018

Documentation wise, there definetely is a lot of work that needs to be done. Just the fact that my reply in that issue was so long is a testament of that, ref https://github.com/kubernetes/test-infra/issues/14018#issuecomment-525499879 .

Great proposal! I'll leave others to bring up the political difficulties, but you've really pinpointed the overall problem with CI Signal, and why it's so labor-intensive.

My proposal is a lot to ask but I think there are some reasonable steps that we could start with.

  • promoting to sig-release dashboards jobs / tests that already exist in the owning SIG's dashboard

    • before promoting a job/test to a release dashboard, we should be able to verify that it is working in the owning sig's dashbaord. It should be a similar process to what the promoting a test for the Kubernetes conformance suite.

    • this would imply that the owner of the dashboard approves when a job is added or removed from said dashboard

  • we could start y asking technical chairs and leads to add a "testgrid" session to their issue and PR triaging sessions.

    • 10 minutes to go through issues

    • 10 minutes to go through PRs

    • 5 minutes to go through testgrid

I love the proposal @alejandrox1 鉂わ笍

Overall owning common jobs/tests that are run before release is a hard problem. I know that many people are fighting this even inside their organizations. I know that we are.

I would love for the toil of CI signal to be spread more among all the SIGs since day 1 I joined the CI signal release team. Establishing a full-blown team inside every SIG for dealing with tests might be too much to ask (?) but I think it would even help if every sig appointed a person or group of people (it could be a rotating role) that would be a frontend to the CI signal work for that SIG for the iteration. It could be for the time of the release or any other period that makes sense.

Apart from the fact that this person would be responsible for also monitoring release jobs belonging to the SIG, it would also be a first line of contact for CI signal release team in case of any test fail triage.

It actually happened to me recently that we (CI signal release team) was owning a flake in a SIGs job and I could not get hold of anyone in the SIG that would reply to me and help me with deciding the priority or triaging the issue. The messages in the SIGs channel just went unnoticed and I did not really have any other connections within the SIG to push it.

To other topics:

  1. yes it should start in informing. We should define rules for that qualifying into blocking (certain amount of failures in a time period)

  2. +1 on the documentation

I think this is a great proposal @alejandrox1

While it will increase the toil on SIG attendees, I believe this is a good path forward to help close the loop between the SIGS and the Release team.

/milestone v1.17

Coming back to this...
There have been some recent discussions related to this topic that, in my opinion, make it clear that we need to make "ci signal" a broader effort.
All SIGs are independent entities, there is no SIG to rule them all.
With that on mind, I would like to propose that we enact "ci signal" and contribute with other SIGs as contributors.

To answer the problem statement of this issue: we (sig release) should ask whether they have some CI jobs that offer useful signal for the release or whether we could work with them on building them - ideally the tests that will offer us signal are already built, so for the most part we just need to "build" jobs that will run this.
It is important to keep in mind the differences between tests and jobs. Jobs we can ask for because we want to make sure that things are working. Tests, on the other hand, need insider knowledge. Only the people that work in a pertinent area of the project will know what tests are lacking, which need to be created (this should be part of KEPs already), etc.

There are a couple things we can do to get started:

  1. We could rally and occasionally look over the dashboards of other SIGs or drop in their slack channels and ask SIGs what they are doing to test X (if we need a job to test something in specific).
  2. Go over enhancements (triage them the way we triage normal issues) and make sure we can see a testgrid link for all of them.
  3. Triage the existing tags (i.e.,conformance, serial, slow) for our tests and see if we are missing tests for important behaviors/features.

Would people be willing to try that out? I can outright volunteer to do so and to help people learn all that's necessary.

@alejandrox1 thanks for your work here! I am also happy to volunteer on this effort. A couple of things that stick out to me as potential questions:

  • Because there are already a lot of jobs in the sig-release dashboards, will we start by auditing what we already have? An alternative could be building out a new set of dashboards with the process you propose, then retiring the current dashboards when we feel that the new ones are stable. I realize this would likely be a lengthy process that would span multiple releases, but it is possible we could organize a team that would be willing to invest a large portion of time into this.
  • Could we codify the promotion process of jobs from individual sig dashboards to sig-release and from informing to blocking within sig-release? I know we currently have a number of tools in test-infra/experiment that automate various tasks, but it would be cool if we could make this automation a little more robust, potentially even with some sort of visual tracking of the promotion and demotion of jobs.
  • Similar to above, could we implement a timeline / cadence for jobs being promoted? Too much process can be inefficient, but it might be helpful to have some hard guidelines that say your test needs to go through "x steps of verification" before being included in the sig-release dashboards.

I likely do not have as much context as some others here, but hopefully these thoughts can be helpful!

Update on this: there is an rfc to a proposal to create a CI signal team (independent of the release team), xref https://github.com/kubernetes/sig-release/issues/966.
If that moves forward then there we could potentially enact the policy we proposed here

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@alejandrox1 @justaugustus are we going to do anything about this?

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Reopening 'cas i wanna propose this for the ci signal subproject's charter
/reopen

@alejandrox1: Reopened this issue.

In response to this:

Reopening 'cas i wanna propose this for the ci signal subproject's charter
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lol life lesson: remove the lifecycle rotten label
/reopen
/remove-lifecycle rotten

@alejandrox1: Reopened this issue.

In response to this:

lol life lesson: remove the lifecycle rotten label
/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings