Lots of enterprises have a limited set of IP's per node for the Pod IP address allocation. Currently kubernetes scheduler schedules a pods on a given node even when there are no more Pod IP's available on that node and gets stuck in ContainerCreating. This means the Pod cannot run, even when there might be other nodes in the cluster with available IP addresses. IPv4 address is like a core resource(similar to cpu and memory) needed for any Kubernetes Pods without which a pod cannot be scheduled.
Currently the only mechanism available to resolve this issue, is to use the concept of Extended Resources. Cluster operators need to do extra work like create extended resource, write admission controller to inject this extended resource request in every pod that is not using hostNetwork, and label each node with the capacity for this extended resource. This seems like a lot of work for a very common
and core use case in running k8s clusters. Extended resources make more sense for uncommon use cases like allocating GPU ,etc.
If we can populate the node with the capacity of the ips available on that node(probably the kubelet would do that), then one of the predicates in scheduler could account for that assuming yes the pod being shceduled is not hostNetwork and yes it needs only one ip( are there cases we are aware that a pod needs more than one ip) ?
Including scheduling, network node folks to help comment on the feasibility of this approach and help alleviate a common pain point for cluster operators.
@kubernetes/sig-node-feature-requests @kubernetes/sig-network-feature-requests
@krmayankk: GitHub didn't allow me to assign the following users: sig-node.
Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide
In response to this:
/assign sig-node
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@bsalamat
/sig node
/sig network
Thanks, @krmayankk. My knowledge about IP management in K8s is not strong at all. I have several questions that may help hash out the details:
I can answer those!
Is the number of IPs fixed on a node or it could be changed dynamically?
The number of IPs on a node can be changed dynamically. It's ultimately up to each network plugin how it performs IP assignment, so this can vary a lot.
Should we consider each Pod needs one IP or this is not a valid assumption (SIG network)?
We can't assume this either - IPv6, as well as other looming developments will leave us in a situation where multiple IP addresses per pod is possible.
Is an IP considered released as soon as a Pod terminates?
Again this is up to the individual network implementation - Kubernetes doesn't have any knowledge about how IPAM is performed, so probably can't make this assumption (though it's probably almost always true).
I think in order to do this, we'll probably need to define additional interfaces between network implementations and Kubernetes (e.g. extend the CNI spec) so that k8s can check on available IP space by asking the plugins themselves. Even then, it's not really great because a lot of implementations _don't_ have a limited number of IPs per node, making this feature not very relevant to those implementations (e.g. Calico, Weave).
Is there active work already happening for this?
It sounds like there are multiple design decisions to consider, which should be hashed out following the KEP process.
Pinging @kubernetes/sig-scheduling-feature-requests & @kubernetes/sig-network-feature-requests for feedback.
/kind feature
/assign @krmayankk
/stage alpha
This also may not be just per node. For example you might assign subnets by rack. Then any pod getting scheduled on a node in that rack needs an IP from that subnet. This is more a topology scheduling issue somewhat similar to #561 for storage. We should coordinate to make sure that the APIs are as similar as possible.
thanks @caseydavenport I think your suggestion about extending the CNI spec so that kubernetes can ask the network plugin about the ip availability makes perfect sense. You mention that for Calico this doesn't apply since you don't have per node limits, but you still have a limited pool size right ? What happens if the pool is out of ip's ? Does the Pod remain Pending ?
I can see two possible ways of thinking:-
The first is simpler to implement and doable and would resolve this problem for majority of the providers where ip's are limited per node. For second, we need more thinking. We could do this in phases with phase 1 being per node and phase 2 being per topology although i need some experts to weigh in on how phase 2 would be achieved . Thoughts ?
Weave Net also shares IP address space dynamically across nodes. No host, rack or namespace limits.
What happens if the pool is out of ip's ? Does the Pod remain Pending ?
I think it will be ContainerCreating
since it has been scheduled to a node but set-up hasn't completed.
CNI populating annotations on the Node
Definitely not - CNI is orchestrator-agnostic. It could be done by individual implementations - Calico, Weave, etc.
scheduler talking to CNI would be a new code path
Suggest you raise an enhancement request at https://github.com/containernetworking/cni/issues, ideally with a PR with the proposed API.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Hello @krmayankk , I'm the Enhancement Lead for 1.15. Is this feature going to be graduating alpha/beta/stable stages in 1.15? Please let me know so it can be tracked properly and added to the spreadsheet. This also needs a KEP to be implemented.
Once coding begins, please list all relevant k/k PRs in this issue so they can be tracked properly.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle rotten
Hi @krmayankk , I'm a 1.16 Enhancement Shadow. Is this feature going to be graduating alpha/beta/stable stages in 1.16? Please let me know so it can be added to the 1.16 Tracking Spreadsheet. If not's graduating, I will remove it from the milestone and change the tracked label.
Once coding begins or if it already has, please list all relevant k/k PRs in this issue so they can be tracked properly.
As a reminder, every enhancement requires a KEP in an implementable state with Graduation Criteria explaining each alpha/beta/stable stages requirements.
Milestone dates are Enhancement Freeze 7/30 and Code Freeze 8/29.
Thank you.
Hey there @krmayankk , 1.17 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to alpha/beta/stable in 1.17?
The current release schedule is:
If you do, I'll add it to the 1.17 tracking sheet (https://bit.ly/k8s117-enhancement-tracking). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 馃憤
Thanks!
There is no KEP for this and so nothing happening on this . I no longer plan to work on this, so anyone can take this up and start writing a KEP to start with
Thank you @krmayankk for the update. :)
I will remove this from the enhancements sheet.
We have a similar request to account IP as resource when making decision for pod scheduling. We also thought of extending resource from the node capacity to account ips but it require other maintenance and custom admission-controller as described in the description.
@krmayankk Have you workaround'ed this problem using some other way? Just curious as this may be a good feature for kubernetes overall.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
There is no KEP for this and so nothing happening on this . I no longer plan to work on this, so anyone can take this up and start writing a KEP to start with
As per the above,
/unassign @krmayankk
Hey there -- 1.18 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating to [alpha|beta|stable] in 1.18?
The current release schedule is:
Monday, January 6th - Release Cycle Begins
Tuesday, January 28th EOD PST - Enhancements Freeze
Thursday, March 5th, EOD PST - Code Freeze
Monday, March 16th - Docs must be completed and reviewed
Tuesday, March 24th - Kubernetes 1.18.0 Released
To be included in the release, this enhancement must have a merged KEP in the implementable
status. The KEP must also have graduation criteria and a Test Plan defined.
If you would like to include this enhancement, once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 馃憤
We'll be tracking enhancements here: http://bit.ly/k8s-1-18-enhancements
Thanks!
As a reminder:
Tuesday, January 28th EOD PST - Enhancements Freeze
Enhancements Freeze is in 7 days. If you seek inclusion in 1.18 please update as requested above.
Thanks!
Kirsten you can untrack this one.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Hi, 1.19 Enhancements shadow here. I wanted to check in and see if you think this Enhancement will be graduating in 1.19?
In order to have this part of the release:
The KEP PR must be merged in an implementable state
The KEP must have test plans
The KEP must have graduation criteria.
The current release schedule is:
Monday, April 13: Week 1 - Release cycle begins
Tuesday, May 19: Week 6 - Enhancements Freeze
Thursday, June 25: Week 11 - Code Freeze
Thursday, July 9: Week 14 - Docs must be completed and reviewed
Tuesday, August 4: Week 17 - Kubernetes v1.19.0 released
Please let me know and I'll add it to the 1.19 tracking sheet (http://bit.ly/k8s-1-19-enhancements). Once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 馃憤
Thanks!
As a reminder, enhancements freeze is tomorrow May 19th EOD PST. In order to be included in 1.19 all KEPS must be implementable with graduation criteria and a test plan.
Thanks.
Unfortunately the deadline for the 1.19 Enhancement freeze has passed. For now this is being removed from the milestone and 1.19 tracking sheet. If there is a need to get this in, please file an enhancement exception.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Hi all,
Enhancements Lead here. This doesn't have any assignee or KEP, so just confirming no plans for this in 1.20?
Thanks
Kirsten
@kikisdeliveryservice yep, that's right. Nobody is working on this for 1.20 (at least not on the sig-network side)
Most helpful comment
This also may not be just per node. For example you might assign subnets by rack. Then any pod getting scheduled on a node in that rack needs an IP from that subnet. This is more a topology scheduling issue somewhat similar to #561 for storage. We should coordinate to make sure that the APIs are as similar as possible.