Zero-to-jupyterhub-k8s: insufficient pods and the interaction with the hook-image-puller daemonset

Created on 31 Aug 2020  路  7Comments  路  Source: jupyterhub/zero-to-jupyterhub-k8s

Background

On GKE for example, one is limited to 110 pods per node no matter what configuration option is used. Other k8s deployments will have different limits but still have limits.

We have some challenges with regards to this limitation and the system to automatically download new images ahead of time on certain nodes. We do it by creating a deamonset that creates one pod per node. The hook puller daemonset is created temporary with a Helm hook as part of chart upgrades to ensure the new images is available before the actual chart upgrade. The continuous puller daemonset is created as part of the chart to ensure newly added nodes get the images.

Goal outcomes

  1. To avoid the upgrade failures following having a hook daemonset trying to create a pod on a node capped by a pod limit.
  2. To avoid using unnecessary pods.

Solution brainstorming

I'm not sure, but we could consider:

  1. changing the pod priority which influences scheduling order of pending pods and could potentially get pods evicted if they have a negative priority if i recall correctly.
  2. making the hook puller not want to schedule on nodes that have full capacity somehow, i think this may be hard though.
  3. modifying the hook-image-awaiter that inspects the hook pre-puller daemonset to see if its pods are ready and avoid waiting for pods in a pending state that hasn't been scheduled to avoid this kind of reliability issue.

I think I like option 3 the most, its quite simple logic update in the image awaiter pod. It will need to introduce some delay if pods are pending though as they will always be pending the initial second they are scheduled or so I figure. This tackles the both goal 1 and 2 for the hook-image-puller daemonset, but would fail to address the second goal for the continuous image puller daemonset. Perhaps that daemonset should get a placeholder kind of priority as well?

Suggested changes

  1. We update the logic of the hook-image-awaiter go binary we have developed to not just wait for the number of pods created to all be ready, but to instead wait mindful that some pods may get stuck in pending and accept that.
  2. We update the continuous image puller daemonsets pod priority to be like the user placeholders if scheduling.podPriority is enabled, which allows pre-pulling to run only on nodes which isn't full of users or other pods.

Ping @yuvipanda as I know you are already mindful of these issues, what do you think about these suggested changes to address them?

bug enhancement

Most helpful comment

I think we already schedule one pod total, and use initContainers for the actual image pulls.

If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.

All 7 comments

Thanks for writing this up, @consideRatio.

If my understanding is correct, what's happening here is:

  1. Daemonset schedules pods on all nodes
  2. But on some nodes, pods can't actually run, even though the scheduler has assigned them that node. This could be because of InsufficientPods condition, but could also be other reasons (NodeNotReady, NetworkPlugin issues, etc)
  3. Our awaiter waits until everything is 'Running' before it returns. Instead, we should somehow never 'get stuck' due to cluster being in a bad state.

Does this seem like a reasonable understanding of the issue, @consideRatio?

@yuvipanda yes, that is exactly right, in a single sentence: to not have hook-image-awaiter get stuck waiting for something other than actual image pulling that make sense to wait for.

Aside from that, I also considered reducing the pods from the continuous image puller by lowering the podpriority to be the same like user-placeholder pods, which makes them evict if needed. I entangled this suggestion in the same issue as the more pressing one above.

Hmmmm... I wonder if helm upgrade --wait will run into issues with a daemonset pod failing to get ready as well btw, which makes for another reason to not be that happy about --wait.

Heh, I still like --wait since if it fails we can consider the deployment to have 'failed' :D

I'm guessing the reason this is a problem is that MaxPods being reached doesn't prevent a daemonset from being scheduled. I wonder if the broader kubernetes community has dealt with this - can't be just us.

+1 on fiddling with the continuous image puller.

I think option 3 sounds good. Together with some documentation that points out that the awaiter is a optional niceness which is configured to "never fail the deploy at the risk of not waiting long enough. For example when the puller pods can't be scheduled on all nodes."

Side question: do we schedule one pod per image or one pod with one container per image? Could we do the latter (if we aren't already) to reduce the number of pods that need to be scheduled for people who need to pull many images?

I think we already schedule one pod total, and use initContainers for the actual image pulls.

If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.

Regarding --wait flag

Hmmmm... I wonder if helm upgrade --wait will run into issues with a daemonset pod failing to get ready as well btw, which makes for another reason to not be that happy about --wait.

It seems that helm upgrade --wait will ignore daemonset's pods readiness etc, which is actually quite nice for us if we want to avoid upgrade failures.

      --wait                         if set, will wait until all Pods, PVCs, Services, and minimum number of Pods of a Deployment, StatefulSet, or ReplicaSet are in a ready state before marking the release as successful. It will wait for as long as --timeout

Due to this, I'll create a PR proposing the priority of continuous image puller DS to become like the user-placeholder-pods, this way they will not block user pods nor upgrades. Also, they are less useful if a node is full anyhow.

Regarding configuration flags

If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.

While the daemonset pods failure to schedule won't block anything no matter because helm upgrade --wait doesn't care, we can let the hook-image-awaiter logic be conditional based on a flag still. My PR proposal will change the default behavior though as I think it makes common sense and don't see any issues other than the quite extreme edge case issue described below.

A minuscule compromise - potential image download wait on full nodes

If we accept hook puller pods to not be scheduled in the hook-image-awaiter, and the continuous puller pods to be evicted, the following could happen.

  1. A full node doesn't have a puller pod on it.
  2. An helm upgrade changes the image to use
  3. A user leaves the full node, and a new user quickly took a place at the previously full node.
  4. The newly arrived user will need to wait for the image download.

I consider this a edge case not even worth documenting could happen at this point.

Resolved by #1762, #1763, and #1787. We now avoid the upgrade failures following having a hook daemonset trying to create a pod on a node capped by a pod limit, and we avoid letting unnecessary pods block real users from nodes.

Was this page helpful?
0 / 5 - 0 ratings