Community: [umbrella] k/k-wide triage workflow improvements

Created on 18 Mar 2019  Â·  39Comments  Â·  Source: kubernetes/community

This is an overview of ideas I've been thinking of in the last 6 months as triage lead on the release team. Related initial discussion for point 1. can be found here: https://groups.google.com/forum/#!topic/kubernetes-sig-contribex/BvGmOQ0v5f0 , the rest should be further discussed in some meeting - 1.14 release retro is a good candidate

Series of items and features that would be beneficial if implemented:

  1. All issues hitting K/K are auto-labeled as 'needs-sig-triage' or something similar.
    addressed here: https://github.com/kubernetes/test-infra/pull/11818
                 |
            open bug/PR
                 |
                 V
        WAITING-ROOM: needs-sig, needs-sig-triage
                 |         ^
            (assign SIG)   |
                 |         |
                 V         |
      --> TRIAGE: needs-sig-triage<----
     |       /           \          |
     |  (close with   (verify)      |
     |   reason)          |         |
     |      |             V         |
      -- CLOSED      BACKLOG: kind/*, priority/*
                          |
                      (assign or claim)
                          |
                          V
                       IN-PROGRESS: assignee
  1. SIGs are tasked by definition to regularly search all issues and appropriately label them / categorize them. This is made much easier by implementing point 1.

  2. Each SIG has a dedicated project/Kanban board each, where visibility of current and upcoming work and milestoned work is very, very visible with a quick glance - columns like Backlog, In Progress, Release-Blocking, etc. cc @parispittman @idvoretskyi on boards but for broader project usage

    Case in point: https://github.com/orgs/kubernetes/projects/8 , the SIG-Windows board has worked great, both for them as a SIG and sig-release / release issue triage.

  3. After SIG reviews the new ticket(issue), it gets an appropriate category - either via direct labels or via Project Board automated labels. thockin suggested the use of triage labels which are a bit legacy and should be reworked in tandem with project boards to have the desired workflow.
    An example on a project board being: issues moved from 'backlog' to 'in progress' automatically get a 'triage/inprogress' label (or smth similar). Label Automation + Projectboards + searchQueries should all have seamless integration and compliment each other in the final iteration of the new workflow.

  4. Release team specific: Based on all above, incoming 'milestoned' work is work that belongs to SIGs and it should be a SIG's responsibility to control and estimate what can be done for each release cycle, with the release team stepping in only when needed (as release approaches). Standard calendar checkpoints in release-readiness will further help - this is what the 'Enhancements Deadline' stands for, but doesn't cover stuff outside of new features and that work is usually left for the release team to ponder upon their fate.

  5. Therefore, a prototype flowchart is: New Ticket -> SIG -> Labeling or Deletion <-> Project Boards <-> Re-labeling based on current status <-> Release Team is able to view status at any time via project boards

  6. For all above, mass rework of labels is needed.
    'priority' labels are a subject of discussion in every release cycle as it's a fuzzy concept in itself, should be reworked with ideas such as 'impact' and 'importance' in mind,
    'triage' labels are a bit old and currently mostly unused but can be very helpful if properly reworked and integrated into a standard system,
    'kind' labels can be further reworked as there are many issues that do not belong in any current 'kind' (cc @BenTheElder)
    deletion of unwanted labels or re-work into other ones,
    addition of new labels like 'needs-sig-triage', 'release-blocking', 'wontfix' etc.
    related initial issue for 'triage' labels: https://github.com/kubernetes/community/issues/3455

  7. and with that all, rework of the old document located in https://github.com/kubernetes/community/blob/master/contributors/guide/issue-triage.md and possibly updating many others

Other generic improvements include:

  • Mechanism that auto-applies milestone in PRs that are merged out of code freeze, so the full list of PRs included in 1.14 is easily grepped
    (issue is here https://github.com/kubernetes/test-infra/issues/11611)

  • Label that signifies an Issue/PR that is changing something Outside of core k/k, whether it's testing/releng/automation/dependencies/external bundles like fluentd-gcp et cetera. Currently there's only a kind/cleanup which is rather vague. Label variety should be encouraged - with proper standardization, good ruling and automation around them they can be easily understood and utilized.

    Related issue + doc on defining external dependencies in progress:
    https://github.com/kubernetes/website/issues/12328
    https://docs.google.com/document/d/1WA8N7C48nkJmme9a96DU0o9jBpeycPhht8WF-Eam9QQ/edit?usp=sharing

  • Labels that indicate whether a ticket is release-blocking or good-to-have, e.g. (kind/release-blocking | kind/good-to-have)

  • Label + mechanism that automatically shifts a Ticket to the next milestone

    a few days after Freeze hits - this automates punting of 'good-to-have' stuff to the next milestone

  • Any ticket in the release-blocking column of a board automatically gets a kind/release-blocking label - this way, anyone can search github issues and PRs via label:kind/release-blocking+milestone:v1.14 query

  • Further improvements on how enhancements are handled here:
    https://github.com/kubernetes/sig-release/issues/539

/sig release pm contributor-experience

tl;dr make ticket management easier for everyone

kinfeature lifecyclrotten prioritimportant-soon sicontributor-experience sirelease sitesting

All 39 comments

@kubernetes/sig-release @kubernetes/sig-contributor-experience-feature-requests
@thockin @guineveresaenger @nikhita @idvoretskyi @justaugustus @BenTheElder @neolit123
@kubernetes/sig-testing @fejta @cjwagner @BenTheElder

Summarized the points for discussion in the 1.14 retro doc
https://docs.google.com/document/d/1he2axf3adOIk3gA3vxFAewejtE2tm3Wl1NA1p-ooXpo/edit#

/assign

/assign

/milestone May

This is an umbrella issue, so moving out of the current milestone.

/milestone Next

@fejta @spiffxp there seems to be enough work here on the Prow side to have this fulfill an Epic for us.

Here's the state machine as a chart, based on what I have in my head:

State Machine

| State | Description | Entry Criteria | Bot Actions | Human Actions |Exit Criteria |
|---|---|---|---|---|---|
| Open | Default state when an issue is opened | N/A | needs/sig and needs/triage are applied | One or more sig/* labels are applied | Has sig/* label |
| Triage | SIG triages issue to determine if it needs more info, should be closed, or moved to the backlog | Has sig/* and needs/triage label | N/A | Needs info: send /needs info, Closed: send /close <reason>, Backlog: apply kind/* and priority/* | Has closed/* OR kind/* and priority/* |
| Closed/Complete | SIG has determined that issue was completed or cannot be completed | Has closed/* label | needs/triage, needs/info are removed | Can send /reopen to reopen the issue | N/A, complete state |
| Backlog | SIG has determined that issue is relevant and should be picked up by a SIG member | Has kind/* and priority/* label | needs/triage, needs/info are removed | Assign the issue - self: /lifecycle active, /assign (applies lifecycle/active), org member: /assign <org-member> | Has lifecycle/active label |
| In Progress | SIG member has begun work on the issue | Has lifecycle/active label | N/A | Work the issue, send /close [<reason>] | Has closed/* OR stale labels |
| Stale | Issue has been open for some interval without an update | Issue has been open 30 days without an update | lifecycle/{needs-attention,stale,rotten is applied, lifecycle/active is removed | Active: send /lifecycle active, Close: send /close [<reason>], Freeze: send /lifecycle frozen | Has lifecycle/active, closed/*, or lifecycle/frozen |
| Frozen | Issue is a long-term priority for the SIG and should not be subject to stale labels | Has lifecycle/frozen label | lifecycle/{needs-attention,stale,rotten is removed | Close: send /close [<reason>], Unfreeze: send /remove-lifecycle frozen | Has closed/* label OR lifecycle/frozen is removed |

Labels

Needs

  • needs/sig
  • needs/triage
  • needs/more-info

Closed

  • closed/complete
  • closed/support
  • closed/duplicate|dupe
  • closed/not-reproducible|no-repro
  • closed/unresolved

Lifecycle

  • lifecycle/active
  • lifecycle/needs-attention
  • lifecycle/stale
  • lifecycle/rotten
  • lifecycle/frozen

Priority

  • priority/critical-urgent
  • priority/important-soon
  • priority/important-longterm

Actions

  • Rename needs-* labels to needs/ and allow for /needs commands
  • Rename triage/needs-information to needs/[more-]info and the remaining triage/* to closed/*
  • Deprecate unused priority/* labels

/sig testing

Thanks so much for putting this together, @nikopen!
Allow me to comment on a few of these items...

  1. All issues hitting K/K are auto-labeled as 'needs-sig-triage' or something similar.
    addressed here: kubernetes/test-infra#11818

Agreed. This is a great first step with the immediate impact of being able to search by a single label, instead of an aggregate of them.

I'm in favor of needs-triage or needs/triage.

  1. SIGs are tasked by definition to regularly search all issues and appropriately label them / categorize them. This is made much easier by implementing point 1.

Are SIGs indeed tasked with this by definition or is it an undocumented expectation?

  1. Each SIG has a dedicated project/Kanban board each, where visibility of current and upcoming work and milestoned work is very, very visible with a quick glance - columns like Backlog, In Progress, Release-Blocking, etc. cc @parispittman @idvoretskyi on boards but for broader project usage

    Case in point: https://github.com/orgs/kubernetes/projects/8 , the SIG-Windows board has worked great, both for them as a SIG and sig-release / release issue triage.

Agreed that this would be benefitial on the SIG level, but for the Release Team, they'd still have to run through multiple boards to get an idea of what's happening. Perhaps a dashboard would be more useful?

  1. After SIG reviews the new ticket(issue), it gets an appropriate category - either via direct labels or via Project Board automated labels. thockin suggested the use of triage labels which are a bit legacy and should be reworked in tandem with project boards to have the desired workflow.
    An example on a project board being: issues moved from 'backlog' to 'in progress' automatically get a 'triage/inprogress' label (or smth similar). Label Automation + Projectboards + searchQueries should all have seamless integration and compliment each other in the final iteration of the new workflow.

A few things here...

  • We shouldn't encourage direct application of labels, as not everyone has access to direct apply.
  • AFAIK, project boards don't support automated labels.
  • The triage labels I've seen seem to be more accurately described as post-triage labels. They really only seem useful in the case that an issue is closed and we want to grep the reason for that after the fact, hence the suggestion above to rename them to closed/*. Issues assigned and in progress could instead searched via lifecycle/active. Any other states seem to be covered by the state chart above.
  1. Release team specific: Based on all above, incoming 'milestoned' work is work that belongs to SIGs and it should be a SIG's responsibility to control and estimate what can be done for each release cycle, with the release team stepping in only when needed (as release approaches). Standard calendar checkpoints in release-readiness will further help - this is what the 'Enhancements Deadline' stands for, but doesn't cover stuff outside of new features and that work is usually left for the release team to ponder upon their fate.

What do you think we can do to improve this, without too much friction?

  1. Therefore, a prototype flowchart is: New Ticket -> SIG -> Labeling or Deletion <-> Project Boards <-> Re-labeling based on current status <-> Release Team is able to view status at any time via project boards

What are we trying to glean here? Completeness of the task? Last updated time?
Again, I think a dashboard would ultimately be more useful to the Release Team here.

  1. For all above, mass rework of labels is needed.
    'priority' labels are a subject of discussion in every release cycle as it's a fuzzy concept in itself, should be reworked with ideas such as 'impact' and 'importance' in mind,
    'triage' labels are a bit old and currently mostly unused but can be very helpful if properly reworked and integrated into a standard system,
    'kind' labels can be further reworked as there are many issues that do not belong in any current 'kind' (cc @BenTheElder)
    deletion of unwanted labels or re-work into other ones,
    addition of new labels like 'needs-sig-triage', 'release-blocking', 'wontfix' etc.
    related initial issue for 'triage' labels: #3455

Agreed on some of the rework (see above), but I think we should punt on doing anything with the kind/*, priority/* labels in the near term. I only say that because these labels lead to some bikeshedding and I don't think refactoring them is strictly necessary to move this forward.

  1. and with that all, rework of the old document located in https://github.com/kubernetes/community/blob/master/contributors/guide/issue-triage.md and possibly updating many others

+1.

Other generic improvements include:

  • Mechanism that auto-applies milestone in PRs that are merged out of code freeze, so the full list of PRs included in 1.14 is easily grepped
    (issue is here kubernetes/test-infra#11611)

+1.

Let's land the standard and then reassess adding other label types.

  • Labels that indicate whether a ticket is release-blocking or good-to-have, e.g. (kind/release-blocking | kind/good-to-have)

release-blocking would probably be a priority; good-to-have I'm not sure about. Same opinion around punting this until we land the workflow.

  • Label + mechanism that automatically shifts a Ticket to the next milestone
    a few days after Freeze hits - this automates punting of 'good-to-have' stuff to the next milestone

+1.

  • Any ticket in the release-blocking column of a board automatically gets a kind/release-blocking label - this way, anyone can search github issues and PRs via label:kind/release-blocking+milestone:v1.14 query

I need to ponder how the board interaction would work, as this functionality doesn't exist natively.

I still owe a response for the enhancements tracking stuff. I'll add notes to that issue.

@nikopen -- Also, this is meaty and impactful enough now that it's deserving of a KEP.
Let's kick around thoughts on the state machine before moving forward with that.

Excellent, thanks for the lengthy responses!

more thoughts ----

_state machine / labels_

  • standardizing /needs and /lifecycle for the triage workflows looks good!
  • /closed labels good!
  • /priority is sometimes neglected and bit fuzzy as a notion, could be optional
  • /lifecycle probably needs one more - lifecycle _ready_ or similar - to indicate a ticket that is triaged and ready to be picked up - lifecycle active could mean more like in progress.

as in, needs triage -> lifecycle ready -> lifecycle active,
or
needs triage -> lifecycle/backlog -> lifecycle/ready -> lifecycle/active

depending on how many columns of actions are decided. big list of items to-do, smaller list of prioritized, smaller list of in-progress?

  • as lifecycle will likely be used more often, stale and rotten timeouts could be increased to 50 days stale -> 120 days rotten or similar

  • agree with actions

on the rest:

1 _open needs-triage PR_

  • let's create needs/ labels with lazy consensus and then merge as needs/triage?
  • needs/triage exit criteria can be sig + kind + lifecycle/ready,active,frozen -- priority optional?

2 _written definitions of workflow_
it should be in a final doc as guidelines or best practices, replacing the older docs like this one , best to make sure it accomodates the needs of most teams from the getgo through lazy consensus etc

3 _release team / many boards_
what would that dashboard be made of?

4 _project boards_

  • perhaps Github Actions can be utilized to instrument a template board with the basics, (most basic being if ticket has sig/XYZ + needs/triage, then put into "triage" column in "sig/XYZ" board) - then each SIG/team can accomodate to their needs. I'm looking into that with Github

5 _automatic milestoned work_

  • cross-sigs planning meetings every 2-3 months, plans/enhancements for the next release, review scheduling, etc? can make some proposals in KEPform
  • kicking off a beta process around this by the next release cycle?
  • over time after these processes are implemented backlogs will be clearer and easier to predict
  • in progress column of project board synced with applying milestone on specific issues
  • main handling through Enhancements process to be improved separately / merging over time / fuzzy

7 _label rework_
agree it's hard, proposed ones as above are good to start with

+1 to rest

let's move forward with 1. and create a KEP^^

A few thoughts:

  • We have "mass reworked" labels a couple of times, and it's always extremely impactful on the contributor experience. Nobody is sure what they mean, who can apply them and how. Or which combination of labels is the right one. We need to tread lightly here.
  • Do we really need close/* labels? These get stale if something is reopened. Is this something we need to or want to track via labels?
  • Project boards are complicated and can get out of hand easily.. I'd rather not force them on folks.
  • Priorities are unclear today, and don't really become clearer here. Having an establish guideline for "prioritization" would be helpful.

Will think and look at this more after code freeze.

@cblecker ^^

  • Agree on impact and confusion, thus having short and concise docs, announcements and clear "what changed" comms really helps with adoption. It's an important part of delivering it to 'prod'
  • There can be a rule that "if issue/PR is open, remove all close/ labels". They're the replacement of the old 'triage' labels, but could do without them too, not too important for the results.
  • Idea is to have all changes as 'guidelines' and as a solid 'default template' (e.g. project board structure) that accomodates the needs of most folks, so it makes sense to get adopted. Again it won't be mandatory but ideally highly recommended ;) Forcing stuff is generally not good.

on the docs we can work with sig-docs folks like @zacharysarah @Bradamant3 to have them as simple as possible

To add on @cblecker point - initially when I proposed the triage labels they were actually close/* along with auto closing of issues. We had some folks liked the name but eventually with further discussion with community we voted to go with triage/* instead of close/* - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/kubernetes-dev/c8J8VOYeDB8/39kYEYkrBwAJ and https://groups.google.com/forum/#!searchin/kubernetes-sig-contribex/sahdev%7Csort:date/kubernetes-sig-contribex/IoENWO_2p2g/9_8TEHRVAQAJ (second link is based on search because kubernetes-wg-contribex is renamed and doesn't exit now) Thanks!

Good to know!

Ultimately in their current state they just confuse people and they're old/mostly unused, so it's better to switch them back to close/ with some extra automation or even remove them.
It's necessary to change/remove them in order to move forward.

I don't understand why it's necessary to change/remove them in order to move forward but I am totally fine with the overall decision here. It may be helpful if we keep all or some of them as such or with some renaming if it's confusing, and try educate triage engineers about the usage. As @cblecker mentioned close/* may be confusing as well. The idea of using triage/* labels was to use them mostly in conjunction with closing issues. More like,

  • An issue was closed (or should close) because of a particular triage outcome (support, not-reproducible, duplicate etc) which could potentially speed up closing, give a better understanding of the reason behind a close and an a better query to see how many support or not-reproducible etc issues we runs into.

  • Such a label can provide SIGs a high level understanding of an issue per basic findings from a new or experienced contributor (e.g. an issue was already attempted to reproduce unsuccessfully or found a probable duplicate with a provided link of duplicate as a comment).

  • Potentially can be used for auto closing with ease (e.g. an issue that is identified as not-reproducible or needs-information could be closed quicker than waiting for normal auto close when issue reporter don't provide requested information in a certain numbers of days)

Also when I search issues with label:triage/support or label:triage/unresolved they seems used widely. Thanks!

@spzala There's a lot of scattered context throughout this and other tickets that you might have missed.

the first actionable item of this ticket is a continuation of the discussion in this link which is about live issue triage by @thockin . 99% of the triage label usage you can see on your search queries are for sig-network, mostly by @thockin, and used as a provision because changing them needs consensus/communication/ has a big impact, et cetera. It would be great if we can resolve this.

I think people would be very happy to reclaim triage/ labels by changing name, have a way to categorize new incoming issues via needs-triage, and utilize the existing lifecycle labels (or other) for issue status.

Happy to hear other suggestions if closed/ is confusing, though I think it's pretty straightforward - focused on giving context to closed issues.

We could also move this specific discussion to this issue which is focused on this topic, so we can ideally resolve it soon:
https://github.com/kubernetes/community/issues/3455

I'll ping @thockin to weigh in on this.

Thanks @nikopen and sounds good. I have no objection with changes if that's making things simpler, some of those triage/* labels are probably pre-triage sort of findings with intention to help speed up triage by SMEs. closed/* is good, but it may not help identifying issues that are candidates for close. Agree that new issues always needs-triage.

With this triage label (similarly to having 1 global triage label now), how do we differentiate between different SIGs that need to look at an issue? Say, something is tagged apps, architecture, and apimachinery... if the first person (say in apimachinery) comes along and marks it as triaged, that's unhelpful for the other SIGs.

@vllry -- There would be an additional label (lifecycle/ready), which when applied, would remove needs-triage.

While needs-triage is applied, SIGs can search on that and assign to members of other SIGs as required.

Does that sound okay?

The alternative I see is having per-SIG triage labels, which I think would get messy quickly.

Indeed, if someone deems that an issue is triaged and is ready to be worked on, then applying the ready label would be enough to remove needs-triage and signify any number of SIGs attached to the issue that it can move forward.

An issue might be cross-SIG and that's up to the participating SIGs to determine how they will work together or not, there's little Labels can do to help in this case - it's context for actual comments / comms.

Generally it's an edge case that can be handled in other ways.

I can create a dedicated DevStats dashboard around this if needed.

/priority important-soon

/cc

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@nikopen -- Are you coming back to this one?
/remove-lifecycle stale
/lifecycle frozen

I've opened the following PRs, to carry the work, as it has stalled out:

Both PRs are on _explicit hold_ until we/I produce a phase 0 KEP for issue triage.
(I hope to get that out to you all this cycle.)

/unassign @nikopen

/remove-sig pm
/area enhancements

Mislabeled:
/remove-area enhancements

/remove-lifecycle frozen

When can we hope to see this in action?

On Wed, May 20, 2020 at 10:27 AM Marky Jackson notifications@github.com
wrote:

/remove-lifecycle frozen

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/community/issues/3456#issuecomment-631616318,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKWAVHHHOXLP3ZBBIQ44PDRSQHJPANCNFSM4G7KPUPA
.

@thockin -- I had some Releng work to do with anago, but will be picking this up later in the week and next week.

The PR is already mostly complete here: https://github.com/kubernetes/test-infra/pull/16298

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

justaugustus picture justaugustus  Â·  53Comments

BenTheElder picture BenTheElder  Â·  35Comments

pmorie picture pmorie  Â·  36Comments

jberkus picture jberkus  Â·  32Comments

danielepolencic picture danielepolencic  Â·  44Comments