Follow-up to #752
Currently, we have no established procedure for how a job gets added to Blocking (or Informing), and for that matter what happens when it's removed. Questions we need to answer are:
attn:
@alejandrox1 @spiffxp @justaugustus
/kind cleanup
/kind documentation
/milestone v1.16
/cc @Verolop @alenkacz
/sig release
/area release-team
I want to take the chance to describe the unicorn that would be good to see.
There currently is a lot of toil throughout the org: tons of jobs and tests in testgrid not all of them maintained; there are no standard channels of communication between SIGs and release team.
This causes the CI signal team along with some very active contributors to be the ones creating issues and alerting other SIGs that something of theirs is broken and needs to be fixed.
I think an ideal setup would be to propose that all SIGs have their own "CI signal team".
In this case, each SIG would have some group of people that will monitor their own testgrid dashbaords (i.e., sig-api-machinery, sig-node dashboards in https://testgrid.k8s.io/ ).
Many SIGs already have dedicated sessions for triaging PRs and issues assigned to them so this would be right up their ally (plus it would be a great way for new contributors to jump in!).
Couple things to keep in mind is that none of the tests in any of the SIG-release dashbaords are unique. Just to give a few examples, all tests tagged [sig-cli] are running in a SIG-cli owned job, all kind-based tests are running in either SIG-cluster-lifecycle or SIG-testing jobs.
There is already a duplication of tests which one could assume is aimed to provide a clear signal for the owning SIG and for the release team (in the case a test/job is deemed useful enough to measure the quality of a release).
In this manner, if each SIG had a group of people owning their testgrid dashboards, CI jobs and tests, and used the same criteria that the reelase team uses to hold dashboards, jobs, and tests up to standard (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ) collaboration between SIGs and the release team would go a lot smoother.
To put all this into context, let me answer the proposed questions on the context of my plan:
If all SIGs fllow the same procedures and standards that the release team uses to gauge the usefulness and quality of a testgrid dashbaord, job, or tests (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ), then each SIG could easily proposed their "blocking" jobs to be part of SIG-release.
SIGs would have more ownership in the process and we could distribute the work.
All jobs should start on the owning SIG's dashboards.
Ultimately, there is no one better equipped to handle an issue than the owning SIG.
This would help create a tighter feedback loop within the SIG and would reduce some of the existing toil within the release team.
Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind.
Prow job annotations already have a field for adding a description.
Additionally, it may be good to include some info in READMEs along with the job configurations (i.e., https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-release configs ).
Jobs would be approved by the owning SIG whenever a new one is proposed from SIG-foo/dashboardX to SIG-foo/blocking.
Once a job is in SIG-foo/blocking then it can be proposed to be release blocking and added to SIG-release/{blocking,informing}.
The owner of the testgrid dashboard should always be notified.
It should never be the case that a job just dissapears without the owning SIG consent.
And, in this case, since all jobs/tests are replicated, there is no information lost from demoting a job from SIG-release/informing.
I would love to hear any feedback, comments, or criticisms you all my have on this topic.
Jorge:
Great proposal! I'll leave others to bring up the political difficulties, but you've really pinpointed the overall problem with CI Signal, and why it's so labor-intensive.
I do need to call out one fly in the ointment because I don't think most SIG-release members are aware of the problem:
Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind. Prow job annotations already have a field for adding a description.
It's actually currently very, very difficult to find the Prow job annotations for a specific job, starting from its name on Testgrid UI. See https://github.com/kubernetes/test-infra/issues/14018
It's actually currently very, very difficult to find the Prow job annotations for a specific job, starting from its name on Testgrid UI. See kubernetes/test-infra#14018
Documentation wise, there definetely is a lot of work that needs to be done. Just the fact that my reply in that issue was so long is a testament of that, ref https://github.com/kubernetes/test-infra/issues/14018#issuecomment-525499879 .
Great proposal! I'll leave others to bring up the political difficulties, but you've really pinpointed the overall problem with CI Signal, and why it's so labor-intensive.
My proposal is a lot to ask but I think there are some reasonable steps that we could start with.
I love the proposal @alejandrox1 鉂わ笍
Overall owning common jobs/tests that are run before release is a hard problem. I know that many people are fighting this even inside their organizations. I know that we are.
I would love for the toil of CI signal to be spread more among all the SIGs since day 1 I joined the CI signal release team. Establishing a full-blown team inside every SIG for dealing with tests might be too much to ask (?) but I think it would even help if every sig appointed a person or group of people (it could be a rotating role) that would be a frontend to the CI signal work for that SIG for the iteration. It could be for the time of the release or any other period that makes sense.
Apart from the fact that this person would be responsible for also monitoring release jobs belonging to the SIG, it would also be a first line of contact for CI signal release team in case of any test fail triage.
It actually happened to me recently that we (CI signal release team) was owning a flake in a SIGs job and I could not get hold of anyone in the SIG that would reply to me and help me with deciding the priority or triaging the issue. The messages in the SIGs channel just went unnoticed and I did not really have any other connections within the SIG to push it.
To other topics:
yes it should start in informing. We should define rules for that qualifying into blocking (certain amount of failures in a time period)
+1 on the documentation
I think this is a great proposal @alejandrox1
While it will increase the toil on SIG attendees, I believe this is a good path forward to help close the loop between the SIGS and the Release team.
/milestone v1.17
Coming back to this...
There have been some recent discussions related to this topic that, in my opinion, make it clear that we need to make "ci signal" a broader effort.
All SIGs are independent entities, there is no SIG to rule them all.
With that on mind, I would like to propose that we enact "ci signal" and contribute with other SIGs as contributors.
To answer the problem statement of this issue: we (sig release) should ask whether they have some CI jobs that offer useful signal for the release or whether we could work with them on building them - ideally the tests that will offer us signal are already built, so for the most part we just need to "build" jobs that will run this.
It is important to keep in mind the differences between tests and jobs. Jobs we can ask for because we want to make sure that things are working. Tests, on the other hand, need insider knowledge. Only the people that work in a pertinent area of the project will know what tests are lacking, which need to be created (this should be part of KEPs already), etc.
There are a couple things we can do to get started:
Would people be willing to try that out? I can outright volunteer to do so and to help people learn all that's necessary.
@alejandrox1 thanks for your work here! I am also happy to volunteer on this effort. A couple of things that stick out to me as potential questions:
sig-release dashboards, will we start by auditing what we already have? An alternative could be building out a new set of dashboards with the process you propose, then retiring the current dashboards when we feel that the new ones are stable. I realize this would likely be a lengthy process that would span multiple releases, but it is possible we could organize a team that would be willing to invest a large portion of time into this.test-infra/experiment that automate various tasks, but it would be cool if we could make this automation a little more robust, potentially even with some sort of visual tracking of the promotion and demotion of jobs.I likely do not have as much context as some others here, but hopefully these thoughts can be helpful!
Update on this: there is an rfc to a proposal to create a CI signal team (independent of the release team), xref https://github.com/kubernetes/sig-release/issues/966.
If that moves forward then there we could potentially enact the policy we proposed here
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
@alejandrox1 @justaugustus are we going to do anything about this?
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Reopening 'cas i wanna propose this for the ci signal subproject's charter
/reopen
@alejandrox1: Reopened this issue.
In response to this:
Reopening 'cas i wanna propose this for the ci signal subproject's charter
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
lol life lesson: remove the lifecycle rotten label
/reopen
/remove-lifecycle rotten
@alejandrox1: Reopened this issue.
In response to this:
lol life lesson: remove the lifecycle rotten label
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I want to take the chance to describe the unicorn that would be good to see.
There currently is a lot of toil throughout the org: tons of jobs and tests in testgrid not all of them maintained; there are no standard channels of communication between SIGs and release team.
This causes the CI signal team along with some very active contributors to be the ones creating issues and alerting other SIGs that something of theirs is broken and needs to be fixed.
I think an ideal setup would be to propose that all SIGs have their own "CI signal team".
In this case, each SIG would have some group of people that will monitor their own testgrid dashbaords (i.e., sig-api-machinery, sig-node dashboards in https://testgrid.k8s.io/ ).
Many SIGs already have dedicated sessions for triaging PRs and issues assigned to them so this would be right up their ally (plus it would be a great way for new contributors to jump in!).
Couple things to keep in mind is that none of the tests in any of the SIG-release dashbaords are unique. Just to give a few examples, all tests tagged
[sig-cli]are running in a SIG-cli owned job, all kind-based tests are running in either SIG-cluster-lifecycle or SIG-testing jobs.There is already a duplication of tests which one could assume is aimed to provide a clear signal for the owning SIG and for the release team (in the case a test/job is deemed useful enough to measure the quality of a release).
In this manner, if each SIG had a group of people owning their testgrid dashboards, CI jobs and tests, and used the same criteria that the reelase team uses to hold dashboards, jobs, and tests up to standard (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ) collaboration between SIGs and the release team would go a lot smoother.
To put all this into context, let me answer the proposed questions on the context of my plan:
If all SIGs fllow the same procedures and standards that the release team uses to gauge the usefulness and quality of a testgrid dashbaord, job, or tests (i.e., https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md ), then each SIG could easily proposed their "blocking" jobs to be part of SIG-release.
SIGs would have more ownership in the process and we could distribute the work.
All jobs should start on the owning SIG's dashboards.
Ultimately, there is no one better equipped to handle an issue than the owning SIG.
This would help create a tighter feedback loop within the SIG and would reduce some of the existing toil within the release team.
Documentation is great, but it should be short and simple otherwise it runs the danger of falling behind.
Prow job annotations already have a field for adding a description.
Additionally, it may be good to include some info in READMEs along with the job configurations (i.e., https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-release configs ).
Jobs would be approved by the owning SIG whenever a new one is proposed from
SIG-foo/dashboardXtoSIG-foo/blocking.Once a job is in
SIG-foo/blockingthen it can be proposed to be release blocking and added toSIG-release/{blocking,informing}.The owner of the testgrid dashboard should always be notified.
It should never be the case that a job just dissapears without the owning SIG consent.
And, in this case, since all jobs/tests are replicated, there is no information lost from demoting a job from SIG-release/informing.
I would love to hear any feedback, comments, or criticisms you all my have on this topic.