Zero-to-jupyterhub-k8s: PR Discussion - Optimizing auto scaling through self destruction

Created on 18 Sep 2019 · 15Comments · Source: jupyterhub/zero-to-jupyterhub-k8s

@yuvipanda suggested some optimizations relating to the user-placeholder pods. I like these ideas so I'll elaborate on them here. To understand the optimizations, we need to understand the issues we currently have first, we can do that in a thought experiment!

Thought experiment

An autoscaling k8s cluster has nodes that can fit 10 users and there are 5 user placeholder pods. Currently the first and only node has 7 users and 3 placeholders, and 2 placeholders are pending.
A second node is added as there were pending pods, and is now starting up slowly.
The first new user arrives, a placeholder is pre-empted on the only available node to make room for the new user. If the user would been able to schedule without pre-empting a placeholder pod, the pre-emption would not have happened.
The second node becomes ready for pod scheduling, and is starting to pull images thanks to the continuous pre-puller daemonset pod that quickly scheduled there. Currently 8 users and 2 placeholders are on the first node, and the second node has 3 placeholders on it.
The second new user arrives, and is scheduled on the node with most resources used which has room, which becomes the second node. This node has not yet pulled all the images though so the user need to wait. The second node now has 1 user and 3 placeholders on it.

Optimization opportunity (OO#1): Force scheduling of user pods on nodes with image locality.
The image pre-pulling on the second node by the continuous pre-puller daemonset pod completes.
The third new user arrives, and is scheduled on the second node that now has 2 users and 3 placeholders on it.

Optimization Opportunity (OO#2): Force scheduling on the most truly busy node where user-placeholder pods shouldn't count. Auto scaling down a node will become easier.

Analysis

The user-placeholder pods help trigger auto scaling of nodes ahead of time by ending up in a pending state after getting pre-empted by user pods filling up the available nodes. Well done user-placeholder pods and thanks k8s for the pod priority mechanism! We made scale up ahead of time!
When the new node has been become available for scheduling, it will no longer pre-empt user-placeholder pods in order to schedule as it can simply schedule on the new node. The kube-scheduler will not evict unless it is required to in order to schedule the pod. The problem is that this node still isn't ready for the user pods as they want their images pulled on it already.

Solution ideas

Required node labels

We could maintain pre-emption of placeholder pods until image locality is available indirectly. We could ourselves make sure to set node labels that the user pods require.

The reason that we need to hack this is because image locality cannot be enforced, it can only improve the schedulers score of a node when ranking potential nodes to schedule on, but it would need to filtered out the node entirely from consideration in order to consider pre-emption of lower priority pods like the placeholder pods are.

To hack around this, we could set a hard affinity for the user pod to nodes that have a custom labels set on them if the images are available. We could then make the pre-puller daemonset pods maintain this label by first removing it in the first init-container and then adding it back in the last init-container for example.

Solution discussion

This solution would solve the image locality optimization, but fail with the busy node optimization.

Timed rescheduling of user-placeholder pods

If we would systematically reschedule the user placeholder pods to the least busy user node somehow. Such rescheduling could be done external to the placeholder, or by the placeholder pods themselves.

user-placeholder pod external rescheduling logic

They could for example be externally descheduled by a pod running the kubernetes-sigs/descheduler if it is later improved to support a descheduling strategy that could make sense.

For example, if there was a cronjob strategy we could deschedule all the user-placeholder pod at the same time. Another example would be to have a strategy that descheduled pods after a minute of lifetime.

user-placeholder pod internal rescheduling logic

They could also be internally descheduled, not by crashing though as that only restarts the container, but by speaking with the kubernetes API server and deleting themselves after a configurable amount of time. For this to work, we would attach a RBAC ServiceAccount bound through a RoleBinding to a Role that would give permission to delete pods in the namespace. We would use the k8s Downward API to expose the name of the pod to the logic in the container using the k8s go client so it can ask for self destruction.

Affinities to improve further

We can improve this solution strategy of rescheduling user-placeholder pods further. We can be setting a soft anti-affinity of the user pods for the user-placeholder pods and vice versa. Like this we would avoid the situation where a user pod could end up on a node with only placeholders because it was considered more busy even though there were a less busy node but with only real users.

It is important to note that these affinities are either met or unmet, having five user-placeholder pod on a node would be as bad as having one for a soft anti-pod-affinity. This means that if we for example would reschedule pods one at the time, we could end up in a situation where these affinities would fail to be used.

Improve further again - let placeholder pods use the default scheduler

Currently, we are scheduling the user-placeholder pods with the user-scheduler which would place them together, but we don't want this, so we should instead use the default scheduler that will instead try to distribute workloads evenly across the nodes through various scoring mechanisms.

This change should be done no matter what I think.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/d25bcd9e38aca2734f90e51c2bc9a0bf9d90ca3e/jupyterhub/templates/scheduling/user-placeholder/statefulset.yaml#L33-L35

Solution discussion

This would be both an image locality optimization and an optimization to schedule on the busier nodes in order to increase the likelihood of being able to scale down a node. It would also be very plausible to do.

Conclusions

The internal configurable timed pod deletion of user-placeholder pods seems like the best idea to me. I also like the idea of making a generic self-destruct binary. If there are no such go binary along with a docker image already, I'd like to make such as a standalone micro open source project!

Retrospective - What is the desired dynamics and the desired outcomes?

We want to figure out what the simplest rule set is to reach as far as possible towards the _desired dynamics_, that was understand to lead to the _desired outcomes_. To figure it out, we need to first clarify the desired outcomes.

Desired outcomes

We minimize the amount of wait required for users given a fixed amount of user placeholder pods to get started, and this may involve waiting for image pulling.
We minimize the average time required for the cluster-autoscaler to remove a node in a underutilized cluster.
We retain good performance for non-autoscaling clusters.

Ideal rule set

I think this single rule would lead to the desired outcomes:

Schedule real user pods on the node with most resource utilization but ignore the resources utilized by user-placeholder pods and pre-empt them if ending up needing to fit there later.

Currently, it isn't the case because the user-scheduler will consider the resource request of user-placeholder pods and also won't consider pre-empting a user-placeholder pod where possible to schedule.

Desired decisions during various relevant cluster states

Assume we can move around user placeholder pods, but not real user pods once they are scheduled. How would we move around the placeholder pods, and how would we schedule the real user pods?

The cluster has a low average resource utilization and could scale down if pods moved around properly.
The cluster could not fit all pods on one less node, and one node has pre-pullers that hasn't finished pulling yet.

Work to be done

[ ] Verify that the cluster autoscaler can scale down a node due to low resource utilization even though some not important pods like the user placeholders have moved around on it recently.
[ ] Search for existing self-destruct (kubectl delete $THIS_POD) software.
[ ] Make a self-destruct go binary and expose it in a public Docker image. I'd like to do this task myself!
[ ] Stop using the user-scheduler to schedule the user-placeholder pods.
Add anti-affinities in two ways:
- [ ] User pods should dislike placeholder pods.
- [ ] Placeholder pods should dislike user pods.
- NOTE: This may end up being technically hard, but when I originally wrote this logic I did not know about fromYaml or fromJson that allows us to parse a templates output back to an object again. If I recall correctly, related code is done once in helm templates within _scheduling-helpers.tpl which is used on the placeholder pods, and then again within jupyterhub_config.py for the real users. We can sustain the DRY principle if we really want.
[ ] Use and configure a self-destruct binary and use such Docker image instead of the pause container we currently use for user-placeholder pods.

Source

consideRatio

👀2 ❤1

Most helpful comment

@yuvipanda
I didn't understand the "we are not on it" idea. Do you mean these points to be a rule set to indicate that we should trigger rescheduling of user-placeholder pods? I think a suitable trigger rule could be:

There is a new node
The node has become possible to schedule on for our our pre-puller daemonset pods (they have the same set of hard affinities as the user pods and user placeholder pods).

I don't want to tunnel vision about having this logic in place yet though, I want to make sure I'm happy about what kind of distribution of real user pods and user placeholder pods we would like to see in various scenarios, and only then how to foster this.

Hmmm... If we do want a trigger to reschedule user-placeholder pods, it could perhaps be sent out before creating a real user pod instead or similar? Note that higher priority pods will be scheduled before lower priority pods if both are considered at the same cycle. We could also attempt to monitor pending pods, but the thing is that we need to react before the scheduling starts which also triggers on pending pods ^^. Also note that we would need to reschedule all the user-placeholder pods, not only some, as most but not all may not block the destination of where the user pod really wants to go, but we don't know what pod specifically will be blocking where the real user wants to go...

@betatim

Could the image pre-puller pod cordon the node or otherwise taint it, so that user pods and placeholder pods, can't schedule on it? Then when it exits/finishes pulling it uncordons the node which becomes available for user pods to schedule on.

This system is very much like how GPU nodes are handled, they are not made available until they have got their GPU devices attached and drivers installed. This requires a daemonset and communication with the API server using Cluster wide privileges. I think it is a too crude approach.

Another thought I had: could user pods have a required during scheduling anti-affinity to image puller pods? Would that prevent them from scheduling on a node with an active puller or also from scheduling on a node with a completed puller?

The puller pods all come from a daemonset, so they will be around at all time and not only during pulling which is done for the init containers that never really start up, they just enter their main container which is the pause container that simply puts the container to sleep. The closest practical idea like this in my mind is what i described under the "Required node labels" header in the original post. It only optimize one of two relevant optimizations though, the image locality one, but doesn't contribute with reducing the scaling down wait times.

consideRatio on 18 Sep 2019

👍2

All 15 comments

I've missed these amazingly thought out and researched issues when I've been out, @consideRatio <3

The self-destruct needs to happen in only the following case:

There is a new, empty node
We are not on it

We need to also try and make sure our self destruct would actually schedule us onto the new node, which might not always happen. Otherwise we'll end up with pod churn, which can cause problems. I deleted 60pods a second for an hour and discovered that nodes just do not like that and fail!

This makes me think we need the destruction to happen with a program with a global view of all pods and nodes than with a local view of itself. I wrote up a jq + bash script that does that, although it was far too aggressive. My hope is that we can use this descheduler strategy: https://github.com/kubernetes-sigs/descheduler#removepodsviolatinginterpodantiaffinity. It won't remove user pods since they are standalone, but should remove the statefulset pods. I'll check if it only cares about hard or soft affinities.

yuvipanda on 18 Sep 2019

Based on https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/pod_antiaffinity.go#L100 descheduler doesn't care about soft anti affinity, so isn't useful for us.

yuvipanda on 18 Sep 2019

Could the image pre-puller pod cordon the node or otherwise taint it, so that user pods and placeholder pods, can't schedule on it? Then when it exits/finishes pulling it uncordons the node which becomes available for user pods to schedule on.

Another thought I had: could user pods have a required during scheduling anti-affinity to image puller pods? Would that prevent them from scheduling on a node with an active puller or also from scheduling on a node with a completed puller?

betatim on 18 Sep 2019

There is a new node
The node has become possible to schedule on for our our pre-puller daemonset pods (they have the same set of hard affinities as the user pods and user placeholder pods).

@betatim

Could the image pre-puller pod cordon the node or otherwise taint it, so that user pods and placeholder pods, can't schedule on it? Then when it exits/finishes pulling it uncordons the node which becomes available for user pods to schedule on.

Another thought I had: could user pods have a required during scheduling anti-affinity to image puller pods? Would that prevent them from scheduling on a node with an active puller or also from scheduling on a node with a completed puller?

consideRatio on 18 Sep 2019

👍2

This system is very much like how GPU nodes are handled, they are not made available until they have got their GPU devices attached and drivers installed. This requires a daemonset and communication with the API server using Cluster wide privileges. I think it is a too crude approach.

I think this is actually very relavant. Let's take the idea of 'Ready' from a Kubernetes node. For our purposes, there are three kinds of readiness:

Ready to receive any pods. This is what Kubernetes currently counts as 'Ready'
Ready to receive user placeholder pods. This is same as (1)
Ready to receive user pods. This is not something we capture yet - it should be true after the images have been pulled, and not before.

"pod ready++" (https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md) is a feature that acknowledges this need and makes it a first class feature for pods. Nothing like it exists for nodes yet though.

yuvipanda on 18 Sep 2019

More discussion about exactly our needs in https://github.com/kubernetes/kubernetes/issues/75890

yuvipanda on 18 Sep 2019

And a possible implementation coming in a future k8s version :) https://github.com/kubernetes/enhancements/pull/1003

yuvipanda on 18 Sep 2019

@yuvipanda nice find!

It is quite a crude approach and a lot of machinery still. The crude parts are in my mind:

installing machinery that is quite more generic than we need
forbidding all pods except those we configure to allow with tolerations, but we cannot do this for all pods as we are only in control of ours.

I'd like it more if the only thing that was done is to label the node some way and adding a hard affinity to that label for our real user pods. This could make everything work good even if multiple hubs run in the same cluster etc. The implementation could be done within one or two init container by the pre-puller pods which run first and last for example.

consideRatio on 18 Sep 2019

I agree! I'm currently running the following in a loop:

# Get all nodes
from kubernetes import client, config 
config.load_kube_config()

v1 = client.CoreV1Api()
namespace = 'datahub-prod'

attractor_label = 'hub.jupyter.org/attract-placeholders'

def label_newest_nodes():
    nodes = sorted(v1.list_node(label_selector='hub.jupyter.org/node-purpose=user').items, key=lambda n: n.metadata.creation_timestamp, reverse=True)

    labeling_event = False

    for i, node in enumerate(nodes):
        if i == 0:
            # First node, ensure it has our attractor label
            if attractor_label not in node.metadata.labels:
                # Our youngest node doesn't have this label!
                node.metadata.labels[attractor_label] = 'true'
                v1.patch_node(node.metadata.name, node)
                print(f'Adding label to {node.metadata.name}')
                labeling_event = True
        else:
            if attractor_label in node.metadata.labels:
                # Setting value to None removes the labels
                node.metadata.labels[attractor_label] = None
                v1.patch_node(node.metadata.name, node)
                print(f'Removing label from {node.metadata.name}')
                labeling_event = True

    if labeling_event:
        print('deleting pods')
        v1.delete_collection_namespaced_pod(namespace, label_selector='component=user-placeholder')

The statefulset for placeholder pods has the following:

      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: hub.jupyter.org/attract-placeholders
                operator: In
                values:
                - "true"
            weight: 100
          - preference:
              matchExpressions:
              - key: hub.jupyter.org/node-purpose
                operator: In
                values:
                - user
            weight: 100
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: component
                  operator: In
                  values:
                  - singleuser-server
              topologyKey: kubernetes.io/hostname
            weight: 100

We set a label for the newest node, and then kill all the placeholder pods everytime we change the label. I should make it so it only kills placeholder pods not on the newest node...

Let's see how this goes!

yuvipanda on 18 Sep 2019

Dear pink friend

I write to you in order to process tough questions, confident you will look at it thoroughly.

Mental simulation of dynamics with required node labels

Assume we maintain a label on nodes lacking the latest images to be pulled. What would happen in general? Generally I conclude that user-placeholder pods are only scheduled when there is room for both them and the user pods

Event 1: We scale up because user-placeholder pods were evicted, perhaps even all and real user pods went pending.

Event 2: The pod becomes ready for pods, but not yet for user pods.

The pending user-placeholder pods will schedule there!

Event 3: A user pod leaves the already user ready node.

A pending real user pod will take its place if there are any, or a user-placeholder pod if there are no real user pods. Excellent!

Event 4: User pods drop away and we end up with a lot of cluster utilization

Undesired outcomes

Let U define a user pod, and P define a placeholder pod, and + define empty space on a node. Also assume the nodes only fit five spots.

User pods are scheduled on a node that hasn't yet pre-pulled the images
We have a PPP++ node and one U++++ node. The cluster autoscaler won't consider scaling down as it is allocated to more than 50%, even though it is with pods that it could relocate, and it won't consider scaling down the node with a user either as it can never remove a user pod.

Could had been an issue, but wasnt...

The cluster autoscaler won't try add a node if it won't allow the pod to schedule on it, and it will consider required node affinities! So if we require a user pod to have a label we dynamically set, we mess with the cluster-autoscalers logic, it won't add a node for the scale of the user pod thats pending due to this. So, if we add this mechanism, there need to be user-placeholder pods triggering the autoscaling at all time. It could probably bottleneck the scaleup speed if more users than placeholder pods arrived in a quick succession that require more nodes than the pending user-placeholders could represent.

Ways to influence the dynamics

required affinity for a hub.jupyter.org/images-are-pulled: true node label
preferred anti or actual affinity, for real user pods or for user placeholder pods
rescheduling of user-placeholder pods on certain triggers

When do we have certain issues btw?

Only during scale up is the issue of needing user pods to schedule on nodes with available images an issue
Only during low resource utilization where we in theory could remove a node, is the distribution of pods an issue. The distribution of pods will be influenced from events in the past though...

Current thinking summarized

Making the real user pods require the dynamically managed images-are-pulled label is fine as long as there is user placeholder pods that can still trigger scale up, then we solve a lot of issues there in the only way I see reasonable.

Then the issue that remains is that user pods may schedule on a busy node because it had lots of user placeholder pods on it but no real actual users on it... I imagine the states which I think are plausible states, which would make various affinities less effective to avoid the wrong scheduling behavior

1: PPP++, UU+++ --- It is simply correct to schedule U on the right, but P is harder to tell where it should schedule as the left node cannot scale down unless it is below 50% resource utilization, but at the same time, this would only be an issue if we have more placeholder pods than 50% of a node though.
2: PPPU+, UU+++ --- We should schedule U to the right again, actually in all examples below because the right side has the most U's.
2: PPP++, PU+++
2: PPPU+, PUU++

Hmmm... So....

P should avoid filling up a node as then a U cannot schedule there if it wants to schedule on the most resource utilized node. Without relocating, this could always happen though, even if it is a single P.
P should avoid grouping up to give the illusion of being the most resource utilized node for when new U is scheduling...

consideRatio on 18 Sep 2019

@yuvipanda I think your script would struggle to resolve the situation alone. If you end up with a new node that is schedulable at all, the first pending pod to schedule would be the user pods as they have higher priority. So, I figure the essence is you cannot make the user-placeholder pods schedule first on the fresh node unless you also disallow the real user pods from scheduling there using a label that you require as a hard affinity for the real users but not for the placeholders.

Oh hmm... Ah but I guess the real users would choose to not schedule there in the first place, they would schedule on the most resource utilized node, and THEN the user placeholder pods would schedule on the attracted node...

Nevermind, complicated, may work!

consideRatio on 18 Sep 2019

Im on a train of thoughts but also on rails..

Four affinity rules to consider

U to U affinity, never bad unless additional gutsy assumptions like it would be better to schedule on PPPP+ than U++++. We should probably ignore this assumption relating to the cluster autoscalers 50% limit.
U to P anti affinity, goes wrong only if Ps can block scale down: PPP++, +++++, but lets ignore this 50% CA limit issue.
P to U anti affinity, only bad if it tricks future U to schedule in the wrong way due to distribution caused by this or the 50% CA limit is considered.
P to P affinity, could negate a P to U anti affinity and make us schedule in a certain way here: UPPP+, +++++, which would force later scheduling of U on the empty node.

What are essential affinities that should carry the most weight? How does the weighting work btw? Hmmm...

U to U is the greatest and always most important.
U to anti P, and, P to anti U is good to maintain separation but less important than the U to U affinity.
P to P... Hmmmm... Is the dislike of U most important? I think so as it could block future U to schedule next to their own. So this can help slightly but should be valued the least.

helpful tech

cluster autoscaler to still consider scaling down a node even though it has more than 50% utilization. But it is too invasive for this chart to influence.
pre-emption even without the need to
linear affinity to pods

consideRatio on 19 Sep 2019

I'm deploying https://github.com/berkeley-dsep-infra/datahub/pull/1050 now, will keep you posted on how it goes! It isn't a long term solution, just a fix for now.

yuvipanda on 20 Sep 2019

👍1