This issue is about how to optimize the scheduling of BinderHub specific _Build Pods_ (BPs).
Out of scope of this issue is the discussion on how to schedule the user pods, which could be done with image locality (ImageLocalityPriority configuration) in mind.
We need to answer how we actually want the build-pods to schedule, it is not a obvious way to do it is typically hard to optimize both for performance and auto-scaling viability for example.
I'll now provide a boilerplate idea to start out from on how to schedule the build pods.
We must utilize a non-default scheduler. We could use the default kube-scheduler binary and customize its behavior through configuration, or we could make our own. I think we add too much complexity if we are to make our own though, even though making your own scheduler is certainly possible.
We utilize a custom scheduler, but like z2jh's scheduler we deploy a official kube-scheduler binary with a customized configuration that we can reference from the build pods specification using the spec.schedulerName field.
[
spec.schedulerName] If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler. --- Kubernetes PodSpec documentation.
We customize the kube-scheduler binary, just like in z2jh through a provided config, but we try to use node annotations somehow. For example, we could make the build pod annotate the node it runs on with the repo it attempts to build, and then later the scheduler can attempt to schedule on this repo.
We make the BinderHub builder pod annotate the node by communicating with the k8s API. To allow the BinderHub to communicate like this, it will require some RBAC details setup, for example like the z2jh's user-schedulers RBAC. It will need a ServiceAccount, a ClusterRole, and a ClusterRoleBinding, where the ClusterRole will define it is should be allowed to read and write annotations on nodes. For an example of a pod communicating with the k8s API, we can learn relevant parts from the z2jh's image-awaiter which also communicates with the k8s API.
We make the BinderHub image-cleaner binary also cleanup associated node annotations along with it cleaning nodes, which would also require associated RBAC permissions like the build pods would need to annotate in the first place.
https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#kube-scheduler-implementation
https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler
From this video, you should pick up the role of a _scheduler_, and that the default binary that can act as a scheduler can consider _predicates_ (aka. filtering) and _priorities_ (aka. scoring).
@betatim and I were speaking a lot about this, and these are my notes that can help us implement something.
--cpu-shares.It was great discussing this and seeing how we started with a fairly complicated idea like "write a custom scheduler" and now have a much simpler solution!
There is another implementation of rendezvous hashing here.
To get the possible values of the kubernetes.io/hostname label it seems we can describe each of the DIND pods in the daemonset. Their Node field contains a value that is (on our GKE and OVH clusters) the same as the one used in the label. This means that to compute the list of possible node names we get describe the daemonset to get the Selector (e.g. name=ovh-dind), select pods with that (kubectl get pods -l name=ovh-dind) then inspect the Node field of each pod we found. This is nice because we don't need to inspect the nodes themselves so we don't need a cluster level role.
The following would be my suggestion for how to split this into several steps that can be tackled individually:
at_most_every decorator from health.py to throttle API callsWhat do you think? And do you want to tackle one of these already?
Maybe we can start a new issue on "Resource requests for build and DIND pods" to discuss what the options are and how to configure things.
@betatim I think this may be quite fun to implement, but I have a long list of things to work on already so I figure I'll leave this to you :)
I'd be very happy to review whenever work is done and continue the discussion on implementation aspects.
I've started work on 1 (and a little bit of 2). PR coming soon.
I think #949 and follow ups implemented this so I'll close this. Maybe we can make new issues with some of the possible improvements/ideas we had beyond what is implemented.