When running jupytehub on k8s, we ideally want the scheduler to pack pods together onto nodes as much as possible. This helps with autoscaling a lot.
We currently sortof enable this by setting schedulerStrategy: pack. It uses podAffinity to do its thing. But podAffinity does is not weighted - if you have two nodes with 2 and 80 pods each, a new pod can get scheduled on either one. This limits its effectiveness a fair bit!
After digging around and talking to more people in the kubernetes community, I believe a real solution is:
kube-scheduler in our cluster, with a --scheduler-name set to something custom (so this will only schedule our pods)schedulerName for all our podshttps://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/ has more general info on this approach.
Defaults of the current algorithms in https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/defaults/defaults_test.go
/cc @consideRatio who has been looking into this.
also /cc @minrk and @betatim - this will also help mybinder.org a lot if we can make this happen!
A lot of thanks to @msau42 and @bsalamat from the Kubernetes Slack for helping me out and steering me away from a more complex setup involving Scheduler Extenders!
https://github.com/kubernetes/kubernetes/pull/59401/files has info on the current policy.json defaults, so we can just take 'em and modify.
@yuvipanda great investigation and I really appreciate that you keep me updated with your findings!
We currently sortof enable this by setting schedulerStrategy: pack. It uses podAffinity to do its thing. But podAffinity does is not weighted - if you have two nodes with 2 and 80 pods each, a new pod can get scheduled on either one. This limits its effectiveness a fair bit!
If you have two nodes: one has only a few pods while the other node has more pods, the scheduling is still random? It will not add together the weights of each singleuser pod it finds or something similar?

I believe you're correct - user pods have an affinity for nodes that have > 0 pods on them, but no concept of "I should compare how _many_ pods are on each node and go to the one with the most pods". Which is super annoying :-P
@consideRatio the default does not, but if you can tweak the policy.json file you can make it! https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/algorithm/priorities has list of priorities that can be tweaked.
Fixed by #891
@consideRatio awesome! Did you remove the schedulerStrategy: pack option, since that's not actually useful?
@yuvipanda Yepp it is no longer affecting anything and only remains in Schema.yaml where it is documented to no longer be of use, and that user-scheduler is to be preferred.