On GKE for example, one is limited to 110 pods per node no matter what configuration option is used. Other k8s deployments will have different limits but still have limits.
We have some challenges with regards to this limitation and the system to automatically download new images ahead of time on certain nodes. We do it by creating a deamonset that creates one pod per node. The hook puller daemonset is created temporary with a Helm hook as part of chart upgrades to ensure the new images is available before the actual chart upgrade. The continuous puller daemonset is created as part of the chart to ensure newly added nodes get the images.
I'm not sure, but we could consider:
I think I like option 3 the most, its quite simple logic update in the image awaiter pod. It will need to introduce some delay if pods are pending though as they will always be pending the initial second they are scheduled or so I figure. This tackles the both goal 1 and 2 for the hook-image-puller daemonset, but would fail to address the second goal for the continuous image puller daemonset. Perhaps that daemonset should get a placeholder kind of priority as well?
Ping @yuvipanda as I know you are already mindful of these issues, what do you think about these suggested changes to address them?
Thanks for writing this up, @consideRatio.
If my understanding is correct, what's happening here is:
Does this seem like a reasonable understanding of the issue, @consideRatio?
@yuvipanda yes, that is exactly right, in a single sentence: to not have hook-image-awaiter get stuck waiting for something other than actual image pulling that make sense to wait for.
Aside from that, I also considered reducing the pods from the continuous image puller by lowering the podpriority to be the same like user-placeholder pods, which makes them evict if needed. I entangled this suggestion in the same issue as the more pressing one above.
Hmmmm... I wonder if helm upgrade --wait will run into issues with a daemonset pod failing to get ready as well btw, which makes for another reason to not be that happy about --wait.
Heh, I still like --wait since if it fails we can consider the deployment to have 'failed' :D
I'm guessing the reason this is a problem is that MaxPods being reached doesn't prevent a daemonset from being scheduled. I wonder if the broader kubernetes community has dealt with this - can't be just us.
+1 on fiddling with the continuous image puller.
I think option 3 sounds good. Together with some documentation that points out that the awaiter is a optional niceness which is configured to "never fail the deploy at the risk of not waiting long enough. For example when the puller pods can't be scheduled on all nodes."
Side question: do we schedule one pod per image or one pod with one container per image? Could we do the latter (if we aren't already) to reduce the number of pods that need to be scheduled for people who need to pull many images?
I think we already schedule one pod total, and use initContainers for the actual image pulls.
If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.
Hmmmm... I wonder if helm upgrade --wait will run into issues with a daemonset pod failing to get ready as well btw, which makes for another reason to not be that happy about --wait.
It seems that helm upgrade --wait will ignore daemonset's pods readiness etc, which is actually quite nice for us if we want to avoid upgrade failures.
--wait if set, will wait until all Pods, PVCs, Services, and minimum number of Pods of a Deployment, StatefulSet, or ReplicaSet are in a ready state before marking the release as successful. It will wait for as long as --timeout
Due to this, I'll create a PR proposing the priority of continuous image puller DS to become like the user-placeholder-pods, this way they will not block user pods nor upgrades. Also, they are less useful if a node is full anyhow.
If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.
While the daemonset pods failure to schedule won't block anything no matter because helm upgrade --wait doesn't care, we can let the hook-image-awaiter logic be conditional based on a flag still. My PR proposal will change the default behavior though as I think it makes common sense and don't see any issues other than the quite extreme edge case issue described below.
If we accept hook puller pods to not be scheduled in the hook-image-awaiter, and the continuous puller pods to be evicted, the following could happen.
I consider this a edge case not even worth documenting could happen at this point.
Resolved by #1762, #1763, and #1787. We now avoid the upgrade failures following having a hook daemonset trying to create a pod on a node capped by a pod limit, and we avoid letting unnecessary pods block real users from nodes.
Most helpful comment
I think we already schedule one pod total, and use initContainers for the actual image pulls.
If we do indeed make this not fail if it can't be scheduled on some nodes, I'd like it to be a flag. IMO for hubs I maintain, when the cluster is in a state like this I do want the deploy to fail, so a human can look at it.