I've been struggling to get the pre-puller to work. This is probably down to RBAC issues that I could get to the bottom of if I tried. However when looking through how the pre-puller works I noticed that it calls a series of tasks using helm hooks.
I was wondering if anyone has any thoughts on how well this works on a kubernetes cluster which can autoscale?
I imagine the scenario where I deploy jupyterhub onto a cluster with two nodes, the image gets pre-pulled and jupyterhub gets up and running. Then some users log onto the cluster and that causes an extra two nodes to be added to the cluster by the cluster autoscaler. Do those two new nodes get the image pre-pulled also? I have a feeling the answer is no.
As a workaround for both this issue and also my general pre-pulling issues I've been using a DaemonSet which runs a pre-pull container as an initContainer and then the Google pause container (original article). This results in the image being pulled on all nodes in the cluster, even if they are added at a later time. Using the initContainer and pause also avoids the container constantly repulling the image and restarting.
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: prepull
spec:
selector:
matchLabels:
name: prepull
template:
metadata:
labels:
name: prepull
spec:
initContainers:
- name: prepull
image: docker
command: ["docker", "pull", "{{ .Values.singleuser.image.name }}:{{ .Values.singleuser.image.tag }}"]
volumeMounts:
- name: docker
mountPath: /var/run/docker.sock
volumes:
- name: docker
hostPath:
path: /var/run/docker.sock
containers:
- name: pause
image: gcr.io/google_containers/pause
I can see some downsides, for instance a notebook may be created on the node before the pre-puller has finished causing it to be pulled twice. But over all this feels neater and more robust than the current method. I would be interested to hear from others, particularly @yuvipanda.
heya!
I totally agree this method is better. The only problem, as you said, is notebooks getting scheduled onto this before image has been pulled. But that seems better than the alternative :)
My hope is that we write a simple controller that can do the following:
This should ensure that notebooks don't get scheduled before the image is pulled. If this is implemented as a CRD, that'd also allow us to wait in helm install / upgrade for pulling to happen (which makes deploy checks easy - I want helm install to fail if the image can't be found, for example).
It would be awesome if you could help us with the CRD based controller, but if not using this as a daemonset is very appropriate. We might want to document this in z2jh too.
I totally agree about the helm install failing if the image can't be found.
Some more thoughts:
Another possibility could be to run an internal registry- could this be setup by zero-to-jupyterhub? Presumably pulling internally should be a lot faster than from an external registry, and this might be more useful on large shared Kubernetes clusters.
@manics I imagine that would improve the startup time of a node but doesn't necessarily solve the issues above.
Assuming everyone doesn't already know everything about CRDs, I'm posting about this talk from KubeCon which describes them.
I'm doing a lot of autoscaler friendly work for v0.6, and I think setting up pre-puller to work like this is important!
Here are the requirements I want for 0.6:
I'm going to play with a few options and see where I land!
Thanks for starting this, @jacobtomlinson!
Another requirement is that the images used for the prepuller themselves must be small! No point bringing in a large image to pull a smaller (or only a bit larger) image.
@jacobtomlinson not entirely done yet, but take a look at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/399 :)
Closed via #399 and #418!