Zero-to-jupyterhub-k8s: Exploratory documentation: managing maximum pods per node limitations

Created on 30 Sep 2020 · 1Comment · Source: jupyterhub/zero-to-jupyterhub-k8s

When a Z2JH deployment grow, bigger and bigger nodes are typically used, but at some point these nodes will refuse to add pods to them because there is an upper limit on pods per node in all k8s clusters related to IP address range allocations. Since the limitation relates to IP address allocations. Since the limitation relates to IP addresses, the Kubernetes cluster's choice of a Container Network Interface (CNI) can sometimes influence the limitation.

This is exploratory documentation on the various mitigation strategies and the state of the issues related to various cloud providers. Perhaps it will find its way into the z2jh guide or other place at a later time when we overview the situation better.

GKE (Google)

GKE have a default maximum node limit of 110 pods, which can be lower (but not higher) depending on network IP range allocations.

Although having 110 Pods per node is a hard limit, you can reduce the number of Pods per node.

GKE references

https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr

EKS (Amazon)

On EKS, the nodes in the cluster will have different pod limits depending on their machine type.

_Comment from @yuvipanda on Gitter:_
[...] but by default, EKS varies wildly based on what kind of node you are using. It's defined in here. This is a problem - if we use an m4.large node, I think it has enough RAM & CPU to run many hubs. But, it can hold only 28 pods! Between kube-system and hub pods, that's not nearly enough. So if you have two hubs (staging + prod), you end up needing 2 nodes instead of 1!

_Comment from @yuvipanda on Gitter:_

If you use Calico, you'll be easily able to get 100 pods per node, and based on performance up to several hundred no problems. But if you use a 'native' VPC solution like EKS' default, it's going to be more severely limited.

EKS references

_Comment from @yuvipanda on Gitter:_

On Azure, 'kubenet' (which isn't 'VPC native') gives you a lot more pods than the default (Azure CNI). You're gonna get slightly more network latency with kubenet, but totally worth it for all our use cases.

AKS references

https://docs.microsoft.com/en-us/azure/aks/configure-kubenet#ip-address-availability-and-exhaustion

documentation

Source

consideRatio

👍2

Most helpful comment

This is great, @consideRatio! I'll try provide some more context here.

Understanding Kubernetes's networking model helps a lot with understanding the 'max pod per node' limitation.

All pods have their own IP that doesn't change during the lifetime of the pod.
All pods on all node must be able to talk to all pods on any other node by being able to talk to the IP of the other pod, as if they were just on one giant flat network. This is what allows applications to run fairly unmodified in kubernetes - nothing kubernetes specific needed in the application.

So each time a pod is created, kubernetes must:

give it a unique IP
Make sure any pod on the cluster can talk to this new pod on its IP

Kubernetes uses a standard - the Container Network Interface (CNI) to accomplish these tasks. The cluster admin configures the Kubernetes cluster to use a particular CNI implementation - there are a million of them - based on the user needs, hardware in use, network topology or just their whims.

Cloud providers offer their own networking APIs that can do dynamic, interesting things. One common api they offer is the ability to provision an new network interface and then attach it to a running node. These network interfaces can then be assigned multiple IP addresses. You can attach many number of such network interfaces to a given node - and any traffic in the network to an IP address of any of those interfaces is automatically routed to the node! On AWS it is Elastic Network Interfaces, on GCE they're just Network Interfaces, etc. This is pretty fast, and CNI implementations often just piggy back on this functionality.

On AWS, if you use the default [AWS VPC CNI interface)(https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html), the following things happen when a new pod is created:

A new Elastic Network Interface is created if needed.
An IP is allocated from a pre-selected subnet, and assigned to the elastic network interface
This interface is attached to the node the pod is running on. So all traffic to this IP will be automatically sent to this node
Magic is performed inside the node to make sure that traffic coming to that particular IP goes to the correct pod.

This is very fast, efficient, and means other services in your VPC can talk to your pods very easily.

But, (possibly) creating a new network interface and IP for each pod has one very severe limitation - on EC2, only a limited number of network interfaces can be attached to any given node, and they can each have only a limited number of IPs! And this number is dependent on the size of the node. And since the number of pods depends on number of IPs and network interfaces that are attachable to the node, using the default CNI interfaces on EKS severly limits the maximum number of pods you can have on each node!

This file lists the maximum number of pods allowed in each node type. Let's take m5.large. It has 8G of RAM, and 2 CPUs. If I give each user pod a RAM guarantee of 128MB, I should be able to fit 62 user pods in there. But, and m5.large instance can only have 3 network interfaces, with 10 IPs per interface attached per instance. So this limits us to an absolute maximum of 30 IPs per node. One IP is for the node itself, leaving 29 for pods - which is the max number of pods allowed on an m5.large.

This is wasteful - even though you could fit 62 user pods, Kubernetes won't schedule them there after 29 pods, even though a lot of memory resource is available. Bad for cost. You could start using m5.xlarge instances, but they too allow only 58 pods - far less than the amount of user pods you can theoretically schedule on it.

So, if you're running a teaching hub on EKS, and want to pack user pods in as densely as possible, you must use a custon CNI plugin. Cost savings are worth it! EKS doesn't let you actively choose on creation, which sucks - but hopefully they'll fix it sometime!

Azure makes it slightly easier - you can choose which CNI plugin you want on cluster creation! Their Azure CNI plugin has the same problems mentioned here by default, although theoretically you can change that number (I haven't tried). kubenet plugin gives you 110 max pods per node, which is familiar to GKE users.

There are probably many wrong bits here, but hope this was useful :)

yuvipanda on 30 Sep 2020

❤2 👍2 🚀1 🎉1

>All comments

This is great, @consideRatio! I'll try provide some more context here.

Understanding Kubernetes's networking model helps a lot with understanding the 'max pod per node' limitation.

All pods have their own IP that doesn't change during the lifetime of the pod.
All pods on all node must be able to talk to all pods on any other node by being able to talk to the IP of the other pod, as if they were just on one giant flat network. This is what allows applications to run fairly unmodified in kubernetes - nothing kubernetes specific needed in the application.

So each time a pod is created, kubernetes must:

give it a unique IP
Make sure any pod on the cluster can talk to this new pod on its IP

On AWS, if you use the default [AWS VPC CNI interface)(https://docs.aws.amazon.com/eks/latest/userguide/pod-networking.html), the following things happen when a new pod is created:

A new Elastic Network Interface is created if needed.
An IP is allocated from a pre-selected subnet, and assigned to the elastic network interface
This interface is attached to the node the pod is running on. So all traffic to this IP will be automatically sent to this node
Magic is performed inside the node to make sure that traffic coming to that particular IP goes to the correct pod.

This is very fast, efficient, and means other services in your VPC can talk to your pods very easily.

There are probably many wrong bits here, but hope this was useful :)

yuvipanda on 30 Sep 2020

❤2 👍2 🚀1 🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

kube-lego doesn't renew certificates

betatim · 4Comments

Preserving user space storage on deployment deletion

jgerardsimcock · 4Comments

"Init:ImagePullBackOff", "FailedSync", "Error syncing pod" issues when building from a newly-created Docker image on Google Cloud

UsDAnDreS · 4Comments

Adding a CI step to test the upgrade path