Zero-to-jupyterhub-k8s: Make 'pack' schedulerStrategy work properly

Created on 26 Feb 2018  路  9Comments  路  Source: jupyterhub/zero-to-jupyterhub-k8s

When running jupytehub on k8s, we ideally want the scheduler to pack pods together onto nodes as much as possible. This helps with autoscaling a lot.

We currently sortof enable this by setting schedulerStrategy: pack. It uses podAffinity to do its thing. But podAffinity does is not weighted - if you have two nodes with 2 and 80 pods each, a new pod can get scheduled on either one. This limits its effectiveness a fair bit!

After digging around and talking to more people in the kubernetes community, I believe a real solution is:

  1. Run another copy of the kube-scheduler in our cluster, with a --scheduler-name set to something custom (so this will only schedule our pods)
  2. Put the name of our kube-scheduler in schedulerName for all our pods
  3. Use a custom policy.json file to configure kube-scheduler to use https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithm/priorities/most_requested.go as a priority function. This should do what we need it to do!

https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/ has more general info on this approach.

Defaults of the current algorithms in https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/defaults/defaults_test.go

architecture

All 9 comments

/cc @consideRatio who has been looking into this.

also /cc @minrk and @betatim - this will also help mybinder.org a lot if we can make this happen!

A lot of thanks to @msau42 and @bsalamat from the Kubernetes Slack for helping me out and steering me away from a more complex setup involving Scheduler Extenders!

https://github.com/kubernetes/kubernetes/pull/59401/files has info on the current policy.json defaults, so we can just take 'em and modify.

@yuvipanda great investigation and I really appreciate that you keep me updated with your findings!

We currently sortof enable this by setting schedulerStrategy: pack. It uses podAffinity to do its thing. But podAffinity does is not weighted - if you have two nodes with 2 and 80 pods each, a new pod can get scheduled on either one. This limits its effectiveness a fair bit!

If you have two nodes: one has only a few pods while the other node has more pods, the scheduling is still random? It will not add together the weights of each singleuser pod it finds or something similar?

image

I believe you're correct - user pods have an affinity for nodes that have > 0 pods on them, but no concept of "I should compare how _many_ pods are on each node and go to the one with the most pods". Which is super annoying :-P

@consideRatio the default does not, but if you can tweak the policy.json file you can make it! https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/algorithm/priorities has list of priorities that can be tweaked.

Fixed by #891

@consideRatio awesome! Did you remove the schedulerStrategy: pack option, since that's not actually useful?

@yuvipanda Yepp it is no longer affecting anything and only remains in Schema.yaml where it is documented to no longer be of use, and that user-scheduler is to be preferred.

Was this page helpful?
0 / 5 - 0 ratings