Enhancements: Taint Based Eviction

Created on 20 Jan 2017  ·  111Comments  ·  Source: kubernetes/enhancements

Feature Description

  • One-line feature description (can be used as a release note):
  • Primary contact (assignee): @gmarek
  • Responsible SIGs: @kubernetes/sig-scheduling-feature-requests
  • KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20200127-taint-based-evictions.md
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred:
  • Approver (likely from SIG/area to which feature belongs):
  • Feature target (which target equals to which milestone):

    • Alpha release target (x.y)

    • Beta release target (x.y) 1.13

    • Stable release target (x.y)

kinfeature sinode sischeduling stagstable trackeno

All 111 comments

This is finished except for documentation.

NoExecute taint effect is now in Beta (as part of moving taints/tolerations to Beta), and taint-based eviction for node problems is in Alpha.

The PRs involved were:

  • kubernetes/kubernetes#39469: Forgiveness (tolerationSeconds) API changes
  • kubernetes/kubernetes#39914: Forgiveness (tolerationSeconds) library changes
  • kubernetes/kubernetes#41414 and kubernetes/kubernetes#41815: defaultTolerationSeconds admission controller
  • kubernetes/kubernetes#41896: Make DaemonSets survive taints
  • kubernetes/kubernetes#40355: TaintController
  • kubernetes/kubernetes#41014: Refactoring
  • kubernetes/kubernetes#41133: NodeController sets node taints instead of deleting pods
  • kubernetes/test-infra#197: Enable e2e test suite in Jenkins

@davidopp @gmarek @kevin-wangzefeng please, provide us with the release notes and documentation PR (or links) at the features spreadsheet.

Regarding "taint-based eviction for node problems is in Alpha": nobody is available to move it to beta in 1.7, so it will stay in alpha in 1.7.

@davidopp I can help on this and target for 1.7, one question is that I saw https://github.com/kubernetes/kubernetes/pull/40355 is already enabled by default, so can you please explain more for what do you mean by move this to beta in 1.7?

@gyliu513 - taint controller is enabled by default, but using taints instead of direct evictions in case of node problems isn't. To move it to beta there's at least one thing to be done (except renaming stuff from alpha to bete), which is to rewrite/write new NodeController unit tests, as they currently assume direct evictions. There are some very basic tests for taint-based evictions, but they should be drastically extended (e.g. cover all master-disruption logic).

I think we can only do this in 1.7 if @gmarek has the bandwidth to do all the reviews and define what we need to do to move it to beta. (Sounds like he's already done most of the second thing above.)

@gmarek do you have time?

Yes, I can find time for reviews and I can find time to think through what needs to be done, if there's someone willing to work on it.

@davidopp @gmarek I will be the volunteer for this, can you please assign this to me? @gmarek will go through this feature and propose sth to you

assigned

@gyliu513 OK - let me know if you need some directions.

@davidopp @gmarek I've updated the feature description to fit the new template. Please, fill the empty fields in the new template (their actual state was unclear).

Move to Beta is goal for 1.7.

@gmarek I want to split the work to two tasks:
1) Rename from alpha to beta.
2) Update unit test for node controller to cover more cases.

Comments?

Renaming involves updating feature gates as well. I'd add:

  1. Update docs

Figuring out a solution to the discussion that starts here https://github.com/kubernetes/kubernetes/issues/44445#issuecomment-300851815 is a blocker to moving this to beta.

BTW I'm pretty nervous that we haven't really had people try out the alpha feature. Unless we can get some significant uses of the alpha feature very soon, we shouldn't move this to beta in 1.7. This feature can cause significant disruption/problems if there are bugs.

@davidopp so, have you finally agreed on the feature goal for 1.7? Will you keep it as alpha or promote to beta? cc @kubernetes/sig-scheduling-feature-requests

I don't think we should move it to beta in 1.7. Nobody has stepped forward to coordinate getting users of the alpha version of the feature, and until we have people successfully using the alpha it's not safe to move it to beta.

@davidopp thanks, will remove from 1.7 milestone.

@davidopp @timothysc This will be moved to beta in v1.8, right?

Maybe. It has lower priority that moving other NodeConditions to Taints.

@gmarek @davidopp @kubernetes/sig-scheduling-feature-requests any updates for 1.8? Is this feature still on track for the release?

Yes, AFAIK (though I was on vacation for past two weeks). All PRs are merged, we just lack e2e test. @k82cn

@k82cn can you please write e2e tests for this?

@gmarek can you confirm this is ready for Beta? And you feel it is safe to enable by default (switch the NodeController path from using status for eviction to using taints for eviction)?

If it does not have e2e tests I do not think we should enable it...

As far as I understood from @k82cn, this would stay in alpha for one release and become beta in v1.9...
I think we renamed a couple of constants, that should be reflected in release notes for v1.8

wow, sorry for the late response. let's add this into my queue for 1.9.

This feature will still be alpha in 1.9 as we discussed; but I'll add e2e test for it. We'll upgrade to beta (enabled by default) when there're enough user of this feature :).

@kubernetes/sig-node-feature-requests

@k82cn @davidopp What's blocking this to graduate to beta?

@luxas I'd really like a few "real" Kubernetes users to try this out before we move it to Beta (hence enable by default). It's a dangerous change because in the worst case it can evict all the pods in the cluster. It also needs (more) e2e tests; looks like @k82cn offered to write some.

BTW there was also an unofficial agreement with sig-node that we would transfer ownership of this from sig-scheduling to sig-node. So I think they should make the decision about when it moves to Beta, but this is my opinion.

@luxas I'd really like a few "real" Kubernetes users to try this out before we move it to Beta

@davidopp How can we get these "real" users? Do we have an action plan for that?

Not really.

Pretty much what David wrote. We need more data and more tests. Current NC unit tests are bad but they cover original behavior pretty well. We do have some basic unit tests for this behavior (and an e2e test for it: https://k8s-testgrid.appspot.com/google-gce#gce-taint-evict - which also shows that it is possible to write non-flaky e2e test:), but we need to noticeably increase the unit test coverage for the new behavior.

Testing part is probably a week-ish of work, but finding someone to test is probably much more work.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

davidopp@ How can we get these "real" users? Do we have an action plan for that?

Can we enable this by default in some test/dev tools, e.g. local cluster? That's really a challenge for user to try it in prod environment for this feature :).

/cc @luxas @davidopp @gmarek

/lifecycle frozen

The best we can do in OSS is to create a dedicated test suite for this. If someone is running tests internally it would be good to know, but I don't think we ever figured out how to gather such data.

@gmarek
Any plans for this in 1.11?

If so, can you please ensure the feature is up-to-date with the appropriate:

  • Description
  • Milestone
  • Assignee(s)
  • Labels:

    • stage/{alpha,beta,stable}

    • sig/*

    • kind/feature

cc @idvoretskyi

/remove-lifecycle frozen

@justaugustus , the draft plan will be 1.13 if no one help on that. We (sig-scheduling) would like to get TaintNodesByCondition and ScheduleDaemonSetPod graduated to beta firstly. Assign to myself to follow up :)

/assign

/remove-lifecycle rotten

This feature current has no milestone, so we'd like to check in and see if there are any plans for this in Kubernetes 1.12.

If so, please ensure that this issue is up-to-date with ALL of the following information:

  • One-line feature description (can be used as a release note):
  • Primary contact (assignee):
  • Responsible SIGs:
  • Design proposal link (community repo):
  • Link to e2e and/or unit tests:
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred:
  • Approver (likely from SIG/area to which feature belongs):
  • Feature target (which target equals to which milestone):

    • Alpha release target (x.y)

    • Beta release target (x.y)

    • Stable release target (x.y)

Set the following:

  • Description
  • Assignee(s)
  • Labels:

    • stage/{alpha,beta,stable}

    • sig/*

    • kind/feature

Once this feature is appropriately updated, please explicitly ping @justaugustus, @kacole2, @robertsandoval, @rajendar38 to note that it is ready to be included in the Features Tracking Spreadsheet for Kubernetes 1.12.


Please note that Features Freeze is tomorrow, July 31st, after which any incomplete Feature issues will require an Exception request to be accepted into the milestone.

In addition, please be aware of the following relevant deadlines:

  • Docs deadline (open placeholder PRs): 8/21
  • Test case freeze: 8/28

Please make sure all PRs for features have relevant release notes included as well.

Happy shipping!

P.S. This was sent via automation

/milestone next-milestone

/milestone v1.13

@k82cn can you plz update this issue with your level of confidence, whats pending in terms of PR, test and docs for this feature to wrap in 1.13? Considering the various test stability issues we ran into last cycle, it will be great to land the open PRs sooner in this cycle to we have enough time to watch the CI signals and stabilize.

Currently Code Slush for 1.13 is Nov 9th and Freeze starts on Nov 15th. Thanks

For now, we do not have outstanding issue. We need to increase code coverage for this feature; so several PRs on e2e/integration test will be opened later. Maybe @Huang-Wei and @ravisantoshgudimetla can give a suggestion when those PRs will be ready; I can help to review them ASAP :)

FWIW I put a /hold on https://github.com/kubernetes/kubernetes/pull/69824 because it seems to me like not everything has been finished before promoting to beta. Am I misunderstanding what that PR is doing?

@k82cn thanks for the update above

@Huang-Wei and @ravisantoshgudimetla I see only 1/7 tasks complete in #69533. Is there any ETA for other tasks to be complete and when this will be enabled in master? As mentioned earlier, we would like to get this feature code and test complete sooner than later so we have enough time to monitor the test results in master and stabilize. Is it possible to get all open PRs wrangled by end of this month?

@AishSundar we're tracking this as our highest priority. We're optimitic to finish them by end of this month - among the resting 6 ones: one might be duplicated, one is docs update, one is performance enhancement (good to have, not blocker). The other 3 ones are being reviewed, should be able to be merged soon.

Thanks for the update @Huang-Wei. could you point me to the performance enhancement PR/ issue plz? thanks

Hi @Huang-Wei @davidopp @ravisantoshgudimetla , I'm the docs wrangler for the 1.13 release. Could you please open a placeholder PR for the docs for this enhancement against the dev-1.13 branch of k/website and send me a link? If you already have a docs PR open, or if this doesn't require docs in k/website, please let me know.

The deadline for placeholder PRs for the 1.13 release is November 8. So it's important to make a docs PR as soon as possible.

If you have any questions about any of this, I'm happy to help. You can also message me on slack (I'm tfogo there too). 😀

Thanks! ✨

@tfogo there is already a docs PR: https://github.com/kubernetes/website/pull/10765.

It's being reviewed right now. cc/ @k82cn @bsalamat @ravisantoshgudimetla that

The deadline for placeholder PRs for the 1.13 release is November 8`.

@Huang-Wei Awesome thanks!

@Huang-Wei -I'm an enhancements shadow checking in on this issue - sounds like progress is being made - as an FYI code slush is 11/9 and Code freeze is 11/15 - any concerns about making those dates

@claurence We're trying to enhance one final e2e test. Right now we have PR https://github.com/kubernetes/kubernetes/pull/70681 ongoing, and it's the plan A. If its changes is too aggressive or can't make the date, we also have a plan B PR.

So generally I'm optimistic on making the 11/15 date.

@k82cn @Huang-Wei I assume this enhancement is complete now for 1.13. Let me know if you are tracking any more pending PRs on your end.

I think we need to keep this open until it GAed.

Oh yes wee will keep this open, but no PR pending for going Beta in 1.13, right?

@AishSundar we're good for going Beta in 1.13.

@gmarek @k82cn @Huang-Wei Hello - I’m the enhancement’s lead for 1.14 and I’m checking in on this issue to see what work (if any) is being planned for the 1.14 release. Enhancements freeze is Jan 29th and I want to remind that all enhancements must have a KEP - I don't see a KEP for this enhancement, if there is one can you please drop a link to it? Thanks.

Are we planning to promote this to GA in 1.14?

Are we planning to promote this to GA in 1.14?

That'll be great if any volunteer can help on that; I can help on review part. If no volunteer, we'll move to next release.

@gmarek Hello - I’m the enhancement’s lead for 1.14 and I’m checking in on this issue to see what work (if any) is being planned for the 1.14 release. Enhancements freeze is Jan 29th and I want to remind that all enhancements must have a KEP

@gmarek @k82cn @Huang-Wei 👋 1.14 release enhancements shadow here. We are less than a week out from 1.14 enhancements freeze. Friendly ping on @claurence comment above.

@gmarek @k82cn @Huang-Wei Hello, i'm one of the 1.14 release enhancements shadow, tomorrow enhancement freeze occurs and we still need a KEP. Otherwise you will be asked to fill in an exception.

@gmarek @k82cn @Huang-Wei since there is no KEP for this issue yet we will be removing it from the 1.14 milestone. To have it added back in please file an exception - information on the exception process can be found here: https://github.com/kubernetes/sig-release/blob/master/releases/EXCEPTIONS.md

Hello @gmarek @k82cn @Huang-Wei , I'm the Enhancement Lead for 1.15. Is this feature going to be graduating alpha/beta/stable stages in 1.15? Please let me know so it can be tracked properly and added to the spreadsheet. This will also need a KEP to be included.

Once coding begins, please list all relevant k/k PRs in this issue so they can be tracked properly.

Hi @gmarek @k82cn @Huang-Wei , I'm the 1.16 Enhancement Lead/Shadow. Is this feature going to be graduating alpha/beta/stable stages in 1.16? Please let me know so it can be added to the 1.16 Tracking Spreadsheet. If not's graduating, I will remove it from the milestone and change the tracked label.

Once coding begins or if it already has, please list all relevant k/k PRs in this issue so they can be tracked properly.

As a reminder, every enhancement requires a KEP in an implementable state with Graduation Criteria explaining each alpha/beta/stable stages requirements.

Milestone dates are Enhancement Freeze 7/30 and Code Freeze 8/29.

Thank you.

Hey there @gmarek @k82cn @Huang-Wei , 1.17 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to alpha/beta/stable in 1.17?

The current release schedule is:

  • Monday, September 23 - Release Cycle Begins
  • Tuesday, October 15, EOD PST - Enhancements Freeze
  • Thursday, November 14, EOD PST - Code Freeze
  • Tuesday, November 19 - Docs must be completed and reviewed
  • Monday, December 9 - Kubernetes 1.17.0 Released

If you do, I'll add it to the 1.17 tracking sheet (https://bit.ly/k8s117-enhancement-tracking). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Please note that all enhancements should have a KEP, the KEP PR should be merged, the KEP should be in an implementable state, have a testing plan and graduation criteria.

Thanks!

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Hey there @gmarek @k82cn @Huang-Wei -- 1.18 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to [alpha|beta|stable] in 1.18 or having a major change in it's current level?

The current release schedule is:

  • Monday, January 6th - Release Cycle Begins
  • Tuesday, January 28th EOD PST - Enhancements Freeze
  • Thursday, March 5th, EOD PST - Code Freeze
  • Monday, March 16th - Docs must be completed and reviewed
  • Tuesday, March 24th - Kubernetes 1.18.0 Released

To be included in the release,

  1. The KEP PR must be merged
  2. The KEP must be in an implementable state
  3. The KEP must have test plans and graduation criteria.

If you would like to include this enhancement, once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

We'll be tracking enhancements here: http://bit.ly/k8s-1-18-enhancements

Thanks! :)

@palnabarun Thanks for the reminding. Given that it's been beta since 1.13, and we haven't got any blocker issue. We'd like to promote it to GA in 1.18.

(BTW: in 1.17, we refactored the TaintBasedEviction e2e test - to transform it to integration test for portability, and also fixed some flakiness)

@Huang-Wei Thank you for the updates. I will update the tracking sheet accordingly.

I see that this enhancement predates the KEP process. Just a nit here, in order to be able to track this enhancement, we need a merged KEP, in an implementable state and with test plans and graduation criteria.

/stage stable

/milestone v1.18

I see that this enhancement predates the KEP process. Just a nit here, in order to be able to track this enhancement, we need a merged KEP, in an implementable state and with test plans and graduation criteria.

/cc @damemi ^^ Could you please update the KEP as well? Thanks.

@Huang-Wei @damemi Is this (https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20200114-taint-based-evictions.md) the KEP for this enhancement?

@palnabarun yes, and this is the parent issue for the tasks that need to be done: https://github.com/kubernetes/kubernetes/issues/87161

@damemi Awesome! I see that the KEP now satisfies all criteria for the Enhancements Freeze.

Thank you for all the efforts. :)

I went ahead and updated the issue comment with the link to the KEP.

@damemi Hi, is the progress of moving #1450 from scheduling to node still move on? Will it be done before v1.18 KEP freeze?

@skilxn-go I opened a PR to move the KEP here: https://github.com/kubernetes/enhancements/pull/1510

Thanks, got it

Hi @damemi, just a friendly reminder that the Code Freeze will go into effect on Thursday 5th March.

Can you please link all the k/k PRs or any other PRs which should be tracked for this enhancement?

Thank You :)

Hi @palnabarun, we have an umbrella issue which links to the issues/PRs that are in the works for this: https://github.com/kubernetes/kubernetes/issues/87161

Thank you @damemi for updating this. :)

Hey @damemi -

Seth here, Docs shadow on the 1.18 release team.

Does this enhancement work planned for 1.18 require any new docs or modifications to existing docs?

If not, can you please update the 1.18 Enhancement Tracker Sheet (or let me know and I'll do so)

If doc updates are required, reminder that the placeholder PRs against k/website (branch dev-1.18) are due by Friday, Feb 28th.

Let me know if you have any questions!

@sethmccombs IIUC, given I have the doc PR opened against dev-1.18 branch (https://github.com/kubernetes/website/pull/19302), there's no need to update any sheet, right?

@ingvagabund you got it, I'll update the Enhancement tracking sheet!

Hi @damemi, this a reminder that we are just two days away from Code Freeze on 5th March.

By the Code Freeze, all the relevant PR's should be merged else you would need to file an exception request.

@palnabarun I've updated the 3 PRs that this is waiting on to see if they will merge by code freeze.

  1. https://github.com/kubernetes/kubernetes/pull/88152
  2. https://github.com/kubernetes/kubernetes/pull/87487
  3. https://github.com/kubernetes/website/pull/19302

@palnabarun actually before those 3 can merge, we need to get the KEP move approved: https://github.com/kubernetes/enhancements/pull/1510

@damemi I see that the PR's are blocked on approvals at the moment. Do you think they would make it before the deadline?

Today EOD is the Code Freeze

Please file an exception if you think the PR's might slip the deadline.

I think we will need more time to get the approvals, what's the process to file an exception?

@jeremyrickard thanks, exception filed

@damemi The exception request was approved. :)

Hi @damemi, since this enhancement graduated to Stable this release :rocket:, the status can now be set to be Implemented.

Can you please update the status? After that, we will close this issue.

@palnabarun sure, opened that here: https://github.com/kubernetes/enhancements/pull/1625

Thank you @damemi :)

The corresponding enhancement has graduated to Stable. :partying_face:

Closing this issue.

/close

@palnabarun: Closing this issue.

In response to this:

The corresponding enhancement has graduated to Stable. :partying_face:

Closing this issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Edit: The issue description here had a 404 link. Pointed it to the correct one.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AndiLi99 picture AndiLi99  ·  13Comments

justaugustus picture justaugustus  ·  3Comments

justinsb picture justinsb  ·  11Comments

liggitt picture liggitt  ·  7Comments

saschagrunert picture saschagrunert  ·  6Comments