Enhancements: Consider IP Address Availability on a given Node when considering a Node for Pod scheduling

Created on 15 Aug 2018 · 38Comments · Source: kubernetes/enhancements

Feature Description

One-line feature description (can be used as a release note): Consider IP Address Availability on a given node when considering a Node for Pod Scheduling.
Primary contact (assignee): @krmayankk
Responsible SIGs: @kubernetes/sig-scheduling-misc
Design proposal link (community repo):
Link to e2e and/or unit tests:
Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred:
Approver (likely from SIG/area to which feature belongs): @bsalamat
Feature target (which target equals to which milestone):
- Alpha release target (x.y)
- Beta release target (x.y)
- Stable release target (x.y)

Lots of enterprises have a limited set of IP's per node for the Pod IP address allocation. Currently kubernetes scheduler schedules a pods on a given node even when there are no more Pod IP's available on that node and gets stuck in ContainerCreating. This means the Pod cannot run, even when there might be other nodes in the cluster with available IP addresses. IPv4 address is like a core resource(similar to cpu and memory) needed for any Kubernetes Pods without which a pod cannot be scheduled.

Currently the only mechanism available to resolve this issue, is to use the concept of Extended Resources. Cluster operators need to do extra work like create extended resource, write admission controller to inject this extended resource request in every pod that is not using hostNetwork, and label each node with the capacity for this extended resource. This seems like a lot of work for a very common
and core use case in running k8s clusters. Extended resources make more sense for uncommon use cases like allocating GPU ,etc.

If we can populate the node with the capacity of the ips available on that node(probably the kubelet would do that), then one of the predicates in scheduler could account for that assuming yes the pod being shceduled is not hostNetwork and yes it needs only one ip( are there cases we are aware that a pod needs more than one ip) ?

Including scheduling, network node folks to help comment on the feasibility of this approach and help alleviate a common pain point for cluster operators.

kinfeature sinetwork sinode sischeduling stagalpha trackeno

Source

krmayankk

👍5

Most helpful comment

This also may not be just per node. For example you might assign subnets by rack. Then any pod getting scheduled on a node in that rack needs an IP from that subnet. This is more a topology scheduling issue somewhat similar to #561 for storage. We should coordinate to make sure that the APIs are as similar as possible.

johnbelamaric on 24 Aug 2018

👍2

All 38 comments

@kubernetes/sig-node-feature-requests @kubernetes/sig-network-feature-requests

krmayankk on 15 Aug 2018

@krmayankk: GitHub didn't allow me to assign the following users: sig-node.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign sig-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 15 Aug 2018

@bsalamat

krmayankk on 15 Aug 2018

/sig node

krmayankk on 15 Aug 2018

/sig network

krmayankk on 15 Aug 2018

Thanks, @krmayankk. My knowledge about IP management in K8s is not strong at all. I have several questions that may help hash out the details:

Is the number of IPs fixed on a node or it could be changed dynamically?
Should we consider each Pod needs one IP or this is not a valid assumption (SIG network)?
Is an IP considered released as soon as a Pod terminates?

bsalamat on 15 Aug 2018

I can answer those!

Is the number of IPs fixed on a node or it could be changed dynamically?

The number of IPs on a node can be changed dynamically. It's ultimately up to each network plugin how it performs IP assignment, so this can vary a lot.

Should we consider each Pod needs one IP or this is not a valid assumption (SIG network)?

We can't assume this either - IPv6, as well as other looming developments will leave us in a situation where multiple IP addresses per pod is possible.

Is an IP considered released as soon as a Pod terminates?

Again this is up to the individual network implementation - Kubernetes doesn't have any knowledge about how IPAM is performed, so probably can't make this assumption (though it's probably almost always true).

I think in order to do this, we'll probably need to define additional interfaces between network implementations and Kubernetes (e.g. extend the CNI spec) so that k8s can check on available IP space by asking the plugins themselves. Even then, it's not really great because a lot of implementations _don't_ have a limited number of IPs per node, making this feature not very relevant to those implementations (e.g. Calico, Weave).

caseydavenport on 17 Aug 2018

👍2

Is there active work already happening for this?
It sounds like there are multiple design decisions to consider, which should be hashed out following the KEP process.

Pinging @kubernetes/sig-scheduling-feature-requests & @kubernetes/sig-network-feature-requests for feedback.

/kind feature
/assign @krmayankk

justaugustus on 20 Aug 2018

/stage alpha

justaugustus on 20 Aug 2018

johnbelamaric on 24 Aug 2018

👍2

thanks @caseydavenport I think your suggestion about extending the CNI spec so that kubernetes can ask the network plugin about the ip availability makes perfect sense. You mention that for Calico this doesn't apply since you don't have per node limits, but you still have a limited pool size right ? What happens if the pool is out of ip's ? Does the Pod remain Pending ?

I can see two possible ways of thinking:-

If the CNI provided a way to query if it supported per node IP limits and whether that node had IP availability, then we can have the scheduler use this information when scheduling Pods. Although it does seem like scheduler talking to CNI would be a new code path(i have limited knowledge here). Another way would be CNI populating annotations on the Node telling how many resources of type ip/ip6 are available and additionally how many per Pod is needed in that plugin. The scheduler will look at those annotations in the prioritized nodes list when making scheduling decisions.

Another way of thinking about this is what @johnbelamaric is suggesting where the ip limitation can exist at any level like host, rack, pool, namespace or cluster. Something which has global view of the number of ip's available(for e.g. Calico controller or whatever its called) will need to be consulted by the scheduler before scheduling pods. Also to avoid race, we would need a way to reserve the IP's from that topology before pods can be scheduled.

The first is simpler to implement and doable and would resolve this problem for majority of the providers where ip's are limited per node. For second, we need more thinking. We could do this in phases with phase 1 being per node and phase 2 being per topology although i need some experts to weigh in on how phase 2 would be achieved . Thoughts ?

krmayankk on 29 Aug 2018

Weave Net also shares IP address space dynamically across nodes. No host, rack or namespace limits.

What happens if the pool is out of ip's ? Does the Pod remain Pending ?

I think it will be ContainerCreating since it has been scheduled to a node but set-up hasn't completed.

CNI populating annotations on the Node

Definitely not - CNI is orchestrator-agnostic. It could be done by individual implementations - Calico, Weave, etc.

scheduler talking to CNI would be a new code path

Suggest you raise an enhancement request at https://github.com/containernetworking/cni/issues, ideally with a PR with the proposed API.

bboreham on 11 Sep 2018

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 Jan 2019

/remove-lifecycle stale

krmayankk on 7 Jan 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 7 Apr 2019

Hello @krmayankk , I'm the Enhancement Lead for 1.15. Is this feature going to be graduating alpha/beta/stable stages in 1.15? Please let me know so it can be tracked properly and added to the spreadsheet. This also needs a KEP to be implemented.

Once coding begins, please list all relevant k/k PRs in this issue so they can be tracked properly.

kacole2 on 12 Apr 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 12 May 2019

/remove-lifecycle rotten

krmayankk on 13 May 2019

Hi @krmayankk , I'm a 1.16 Enhancement Shadow. Is this feature going to be graduating alpha/beta/stable stages in 1.16? Please let me know so it can be added to the 1.16 Tracking Spreadsheet. If not's graduating, I will remove it from the milestone and change the tracked label.

Once coding begins or if it already has, please list all relevant k/k PRs in this issue so they can be tracked properly.

As a reminder, every enhancement requires a KEP in an implementable state with Graduation Criteria explaining each alpha/beta/stable stages requirements.

Milestone dates are Enhancement Freeze 7/30 and Code Freeze 8/29.

Thank you.

rbitia on 11 Jul 2019

Hey there @krmayankk , 1.17 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to alpha/beta/stable in 1.17?

The current release schedule is:

Monday, September 23 - Release Cycle Begins
Tuesday, October 15, EOD PST - Enhancements Freeze
Thursday, November 14, EOD PST - Code Freeze
Tuesday, November 19 - Docs must be completed and reviewed
Monday, December 9 - Kubernetes 1.17.0 Released

If you do, I'll add it to the 1.17 tracking sheet (https://bit.ly/k8s117-enhancement-tracking). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Thanks!

palnabarun on 2 Oct 2019

There is no KEP for this and so nothing happening on this . I no longer plan to work on this, so anyone can take this up and start writing a KEP to start with

krmayankk on 4 Oct 2019

👍1

Thank you @krmayankk for the update. :)

I will remove this from the enhancements sheet.

palnabarun on 4 Oct 2019

We have a similar request to account IP as resource when making decision for pod scheduling. We also thought of extending resource from the node capacity to account ips but it require other maintenance and custom admission-controller as described in the description.

@krmayankk Have you workaround'ed this problem using some other way? Just curious as this may be a good feature for kubernetes overall.

sarjeet2013 on 5 Oct 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Jan 2020

/remove-lifecycle stale

palnabarun on 9 Jan 2020

There is no KEP for this and so nothing happening on this . I no longer plan to work on this, so anyone can take this up and start writing a KEP to start with

As per the above,
/unassign @krmayankk

palnabarun on 9 Jan 2020

Hey there -- 1.18 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to [alpha|beta|stable] in 1.18?
The current release schedule is:
Monday, January 6th - Release Cycle Begins
Tuesday, January 28th EOD PST - Enhancements Freeze
Thursday, March 5th, EOD PST - Code Freeze
Monday, March 16th - Docs must be completed and reviewed
Tuesday, March 24th - Kubernetes 1.18.0 Released
To be included in the release, this enhancement must have a merged KEP in the implementable status. The KEP must also have graduation criteria and a Test Plan defined.
If you would like to include this enhancement, once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍
We'll be tracking enhancements here: http://bit.ly/k8s-1-18-enhancements
Thanks!

kikisdeliveryservice on 14 Jan 2020

As a reminder:

Tuesday, January 28th EOD PST - Enhancements Freeze

Enhancements Freeze is in 7 days. If you seek inclusion in 1.18 please update as requested above.

Thanks!

kikisdeliveryservice on 21 Jan 2020

Kirsten you can untrack this one.

johnbelamaric on 22 Jan 2020

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 21 Apr 2020

/remove-lifecycle stale

palnabarun on 27 Apr 2020

Hi, 1.19 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating in 1.19?

In order to have this part of the release:

The KEP PR must be merged in an implementable state
The KEP must have test plans
The KEP must have graduation criteria.

The current release schedule is:

Monday, April 13: Week 1 - Release cycle begins
Tuesday, May 19: Week 6 - Enhancements Freeze
Thursday, June 25: Week 11 - Code Freeze
Thursday, July 9: Week 14 - Docs must be completed and reviewed
Tuesday, August 4: Week 17 - Kubernetes v1.19.0 released

Please let me know and I'll add it to the 1.19 tracking sheet (http://bit.ly/k8s-1-19-enhancements). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Thanks!

kikisdeliveryservice on 1 May 2020

As a reminder, enhancements freeze is tomorrow May 19th EOD PST. In order to be included in 1.19 all KEPS must be implementable with graduation criteria and a test plan.

Thanks.

kikisdeliveryservice on 18 May 2020

Unfortunately the deadline for the 1.19 Enhancement freeze has passed. For now this is being removed from the milestone and 1.19 tracking sheet. If there is a need to get this in, please file an enhancement exception.

kikisdeliveryservice on 20 May 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Aug 2020

/remove-lifecycle stale

palnabarun on 1 Sep 2020

Hi all,

Enhancements Lead here. This doesn't have any assignee or KEP, so just confirming no plans for this in 1.20?

Thanks
Kirsten

kikisdeliveryservice on 12 Sep 2020

@kikisdeliveryservice yep, that's right. Nobody is working on this for 1.20 (at least not on the sig-network side)

caseydavenport on 18 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Expose status.containerStatuses[*].imageID through Downward API

mitar · 8Comments

Issue Triage Workflow and Automation

justaugustus · 3Comments

Release Notes Machine Learning Classifier

saschagrunert · 6Comments

harden the default RBAC discovery clusterrolebindings

dekkagaijin · 9Comments

kops: support for bare-metal

justinsb · 11Comments