Sig-release: plan a short cycle stability release

Created on 23 Sep 2019 · 20Comments · Source: kubernetes/sig-release

It's been proposed that we do a short (one month?) release cycle that only includes stability fixes and test enhancements. Could a trial of this happen for 1.21 at the end of 2020? If so that needs planned and communicated starting perhaps as early as in the 1.19 cycle. Eg: this would mean people don't get to add features (new, different, scary?!) perhaps even if they missed readiness in 1.20, this would mean criteria of feature versus bugfix needs tightened, this would mean establishing better criteria for critical-urgent versus important-longterm changes, this would mean we need to decide would this release count as a deprecation point for version skew, etc.

The milestone versions and dates above are for concrete example only: We should discuss, pick a time to try this, and back up from there in time to do the proper pre-planning to make it successful.

@calebamiles @justaugustus @thockin
/sig release
/area release-eng

arerelease-eng kinfeature lifecyclstale prioritimportant-longterm sirelease

Source

tpepper

👍7 🚀3 🎉3

Most helpful comment

Let me begin by saying that I strongly support fixing bugs, stabilizing components, adding/deflaking tests, improving documentation, and graduating features (and have a track record of working to make those things happen).

There seems to be an impression that the best way to move the project in that direction is to designate particular releases as "stability releases" and forbid any feature-related work in that release. With a project like Kubernetes, the number of distributed efforts in progress, and the length of time it takes to progress features through various maturity stages, pausing feature work for 25% (Q4 proposal) or 50% (every other release proposal) of development time seems problematic.

It also presumes that all work can be categorized neatly into "fix" or "feature" buckets. Is adding a field to an existing beta API to address user feedback, resolve performance/stability issues, and move closer to GA a fix or a feature? Case in point: the timeout field added to admission webhooks in v1.14 was required to improve responsiveness and resolve specific bugs and performance issues in a backwards-compatible way.

In my experience, properly progressing a feature or API to GA serves as a forcing function for burning down bug backlogs, adding scale and conformance tests, demonstrating consistently green CI, and writing the final user guides and samples. As an example, the extensibility work in v1.15 (beta) and v1.16 (GA) was definitely feature work, but also resolved many long-standing bugs, test flakes, and user-reported usability issues, added conformance tests covering 100% of the API, and greatly improved documentation. I consider the extensibility work done over those two releases exemplary, and would be happy to see more feature development follow that as a model. Blocking that work for a release because of a project-wide feature embargo, even though it greatly improved the health and stability of those components, would seem counter-productive to the stated stability goals.

I think it would be more helpful to have metrics for SIGs / subproject / component health over time. For example:

Features (feature gate count, maturity level)
APIs (resource count, maturity level)
Bugs (count, severity)
Tests (flakes, coverage)
Regressions (backports)

Those metrics (and their trajectory over time) should inform local decisions about where effort is required and where quality/stability is a problem, not an arbitrary "no feature work in one release a year" approach.

Motivation and enforcement are open questions with either approach, but I think the localized, metrics-driven approach would have a better chance of actually promoting stability, and of course-correcting where needed and when needed

A fixed "stability release" approach is vulnerable to contributors simply shifting time during that release to working on features in branches, and holding them for merge until the beginning of the next release.

A metrics-driven approach would continue to identify particular components / subprojects / SIGs as needing to focus on health/quality/stability until the work was actually put in to improve those metrics.

liggitt on 26 Sep 2019

👍6 ❤5 🚀2

All 20 comments

tpepper on 24 Sep 2019

My thoughts are captured here, but I'll transpose to this thread later: https://twitter.com/stephenaugustus/status/1167273011885596673?s=19

justaugustus on 24 Sep 2019

Some historical context: https://docs.google.com/document/d/16TTI6K-s941UnlQs5uZSWUw58erUbuYbRkKxN5kDXs0/edit?usp=drivesdk

Thanks for the share, @jdumars!

justaugustus on 26 Sep 2019

+1 to this! It’d be a great move for the project in my opinion 👏🏼

onlydole on 26 Sep 2019

❤3

/assign

justaugustus on 26 Sep 2019

/kind feature
/priority important-longterm

justaugustus on 26 Sep 2019

I think it would be more helpful to have metrics for SIGs / subproject / component health over time. For example:

Features (feature gate count, maturity level)
APIs (resource count, maturity level)
Bugs (count, severity)
Tests (flakes, coverage)
Regressions (backports)

liggitt on 26 Sep 2019

👍6 ❤5 🚀2

On the SIG Release agenda for 10/7.

justaugustus on 26 Sep 2019

Yeah, ultimately the end goal is a more stable product, and given the diversity of bug / feature ratios across sigs only a metrics driven approach is likely to succeed (there is simply no rule that applies to all sigs)

smarterclayton on 26 Sep 2019

mhomaid on 26 Sep 2019

I strongly agree with @liggitt above. E.g., the request management / fairness & priority effort is both a "feature" and a desperately needed stability fix.

Additionally people who have been working super hard on something do not deserve to wait half a year to merge it.

Additionally, I think stability needs to be a way of life. Every release should be stability release, or at least, more stable than the previous one; we should not have every other release be a non-stable release!

lavalamp on 27 Sep 2019

👍2

@lavalamp I think we an all agree that stability needs to be a way of life. Sadly that is now how we humans tend to engineer things - it's in our nature to gravitate towards new shiny things vs. maintaining the things we already have.

@liggitt brings up two great points:

Using metrics to drive decision-making. Establishing metrics per-SIG/component/etc. is a great idea and project in its own right. With those metrics there needs to some measure of accountability and periodic check-in. Long term I hope owners will evaluate the health of their components and have the power to declare a feature pause (or raise the bar on new features such that they must measurably improve the stability). And if this is adopted, I think it is important to establish a periodic cadence so contributors are not caught off guard if the SIG/component they are working on declares a feature pause.
API progression bakes in stability/bug fixes. Does declaring that an API will not progress in a "stabilization release" reduce the incentive to continue the promotion work? I would also counter that scope creep is a destabilizing force - are there examples of a component becoming less stable because new use cases were taken on during the feature promotion process?

adambkaplan on 4 Oct 2019

If we had feature branches and feature branch CI, could every release be a short cycle stability release in as much as only stable, proved branch content merged to master with more broad integration happening ahead of merge?

tpepper on 21 Oct 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 8 Mar 2020

/remove-lifecycle stale

palnabarun on 25 Mar 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 23 Jun 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 Oct 2020

Closing as we have been discussing continuous stability efforts in favour of dedicated stability releases.

LappleApple on 16 Oct 2020

This issue represents a discussion history, but has not gone anywhere specific after the first request. In the meantime we have more current discussion in https://github.com/kubernetes/sig-release/issues/1290 around updating the cadence. I'm going to close this one now instead of continually kicking it forward.

tpepper on 16 Oct 2020

Copying some comments from https://github.com/kubernetes/sig-release/discussions/1290, in order to call maintenance/stability releases out-of-scope for the release cadence KEP:

@youngnick:

I agree with @spiffxp that whatever we end up doing, we should acknowledge that calendar Q4 is substantially quieter than other quarters, with US Kubecon rolling into US Thanksgiving, rolling into the December festive season.

I think that any plan to change the release cadence needs to take that as a prime consideration, whether it's keeping four releases a year and marking the Q4 one as minimal features, spreading three releases across the year, or some other solution.

@jberkus: