Zero-to-jupyterhub-k8s: Killing container with id docker://notebook:Need to kill Pod

Created on 8 Sep 2018  ·  6Comments  ·  Source: jupyterhub/zero-to-jupyterhub-k8s

Hi,

I've been using a modified version of jupyter/datascience-notebook without any problems for a while. Unfortunately, my hub is no longer able to spawn any pods ever since I rebuilt my singleuser image with a few slight modifications. For some reason unknown to me, kubelet kills the pod at the end of the container creation process:

Killing container with id docker://notebook:Need to kill Pod
````
The image built from the same Dockerfile works perfectly if I use `jupyter/minimal-notebook` as the base image. I have also tried an earlier version of the former (`jupyter/datascience-notebook:1085ca054a5f`), but with no success.

If it is of any help, below you can find my configuration, relevant logs as well as my compute environment.

## Contents of `config.yaml`

proxy:
secretToken: $TOKEN

hub:
image:
name: gcr.io/$PROJECT/jupyter-hub
tag: latest
extraConfig: |
c.JupyterHub.cleanup_servers = False
c.JupyterHub.spawner_class = 'kubespawner.KubeSpawner'
c.KubeSpawner.start_timeout = 900

singleuser:
uid: 0
fsGid: 0
cmd: "startup.sh"
image:
name: gcr.io/$PROJECT/jupyter-singleuser
tag: latest
storage:
type: "static"
homeMountPath: "/mnt"
static:
pvcName: "nfs-dev"
subPath: "jupyter"

auth:
type: custom
custom:
className: $AUTHENTICATOR
config:
secret_key: $KEY

## Cluster events

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned jupyter-zafer to $GKE_NODE
Normal SuccessfulMountVolume 6m kubelet, $GKE_NODE MountVolume.SetUp succeeded for volume "nfs-dev"
Normal Pulled 6m kubelet, $GKE_NODE Container image "jupyterhub/k8s-network-tools:0.7.0" already present on machine
Normal Created 6m kubelet, $GKE_NODE Created container
Normal Started 6m kubelet, $GKE_NODE Started container
Normal Pulling 6m kubelet, $GKE_NODE pulling image "gcr.io/$PROJECT/jupyter-singleuser:latest"
Normal Pulled 37s kubelet, $GKE_NODE Successfully pulled image "gcr.io/$PROJECT/jupyter-singleuser:latest"
Normal Created 35s kubelet, $GKE_NODE Created container
Normal Started 34s kubelet, $GKE_NODE Started container
Normal Killing 1s kubelet, $GKE_NODE Killing container with id docker://notebook:Need to kill Pod

## Hub logs

[I 2018-09-07 22:01:45.714 JupyterHub log:158] 302 GET /hub/login -> /user/zafer/lab/tree/ (@10.128.0.5) 29.34ms
[I 2018-09-07 22:01:45.901 JupyterHub log:158] 302 GET /user/zafer/lab/tree/ -> /hub/user/zafer/lab/tree/ (@10.128.0.5) 1.04ms
[I 2018-09-07 22:01:46.206 JupyterHub reflector:199] watching for pods with label selector='component=singleuser-server' in namespace dev
[I 2018-09-07 22:01:46.218 JupyterHub reflector:199] watching for events with field selector='involvedObject.kind=Pod' in namespace dev
[W 2018-09-07 22:01:56.220 JupyterHub base:714] User zafer is slow to start (timeout=10)
[I 2018-09-07 22:01:56.220 JupyterHub base:1056] zafer is pending spawn
[I 2018-09-07 22:01:56.229 JupyterHub log:158] 200 GET /hub/user/zafer/lab/tree/ ([email protected]) 10144.58ms
[I 2018-09-07 22:01:59.065 JupyterHub proxy:301] Checking routes
[I 2018-09-07 22:02:59.064 JupyterHub proxy:301] Checking routes
[I 2018-09-07 22:03:59.064 JupyterHub proxy:301] Checking routes
[I 2018-09-07 22:04:59.064 JupyterHub proxy:301] Checking routes
[I 2018-09-07 22:05:59.063 JupyterHub proxy:301] Checking routes
[I 2018-09-07 22:06:59.065 JupyterHub proxy:301] Checking routes
[W 2018-09-07 22:07:21.142 JupyterHub users:439] Stream closed while handling /hub/api/users/zafer/server/progress
[I 2018-09-07 22:07:21.142 JupyterHub log:158] 200 GET /hub/api/users/zafer/server/progress ([email protected]) 324458.57ms
[W 2018-09-07 22:07:21.522 JupyterHub users:439] Stream closed while handling /hub/api/users/zafer/server/progress
[I 2018-09-07 22:07:21.522 JupyterHub log:158] 200 GET /hub/api/users/zafer/server/progress ([email protected]) 201329.22ms
[W 2018-09-07 22:07:55.819 JupyterHub user:504] zafer's server never showed up at http://10.8.0.11:8888/user/zafer/ after 30 seconds. Giving up
[I 2018-09-07 22:07:55.820 JupyterHub spawner:1770] Deleting pod jupyter-zafer
[I 2018-09-07 22:07:59.065 JupyterHub proxy:301] Checking routes
[E 2018-09-07 22:08:08.807 JupyterHub gen:974] Exception in Future .finish_user_spawn() done, defined at /home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError("Server at http://10.8.0.11:8888/user/zafer/ didn't respond in 30 seconds",)> after timeout
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
await spawn_future
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 522, in spawn
raise e
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/user.py", line 496, in spawn
resp = await server.wait_up(http=True, timeout=spawner.http_timeout)
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/utils.py", line 197, in wait_for_http_server
timeout=timeout
File "/home/jovyan/.local/lib/python3.6/site-packages/jupyterhub/utils.py", line 155, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: Server at http://10.8.0.11:8888/user/zafer/ didn't respond in 30 seconds

## Environment

❯ helm version
Client: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.10.0", GitCommit:"9ad53aac42165a5fadc6c87be0dea6b115f93090", GitTreeState:"clean"}

❯ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
jhub-dev 2 Fri Sep 7 01:53:09 2018 DEPLOYED jupyterhub-0.7.0 0.9.2 dev

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.6-gke.2", GitCommit:"384b4eaa132ca9a295fcb3e5dfc74062b257e7df", GitTreeState:"clean", BuildDate:"2018-08-15T00:10:14Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}

❯ gcloud compute instances list
NAME MACHINE_TYPE PREEMPTIBLE STATUS
gke-instance-1 n1-standard-1 RUNNING
gke-instance-2 n1-standard-1 RUNNING
gke-instance-3 n1-standard-1 RUNNING
```

Most helpful comment

I dug into the code and it looks like 30s is determined by the http_timeout parameter. After increasing that number, I was able to get a shell into the container and find out that a recursive chown I was doing in my startup script never terminated,

# ps -ef --sort=start_time
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 18:31 ?        00:00:00 tini -g -- startup.sh --ip="0.0.0.0
root         5     1  0 18:31 ?        00:00:00 /bin/bash /usr/local/bin/startup.sh
root         6     5  6 18:31 ?        00:00:01 chown -R 1001:100 /home/jovyan
root         7     0  0 18:31 ?        00:00:00 /bin/bash
root        16     7  0 18:31 ?        00:00:00 ps -ef --sort=start_time

which I think is because the Julia folder is very large:

# du -sh /home/jovyan/.[!.]*
4.0K    /home/jovyan/.bash_logout
4.0K    /home/jovyan/.bashrc
72K     /home/jovyan/.cache
12K     /home/jovyan/.conda
16K     /home/jovyan/.config
608M    /home/jovyan/.julia
36K     /home/jovyan/.jupyter
4.0K    /home/jovyan/.profile
8.0K    /home/jovyan/.yarn

Anyways, removing that line fixed the issue. Thanks for pointing me in the right direction!

All 6 comments

This usually happens when PV can't be properly unmounted. try ssh into the box and look for the mount entry of the stalled pod, then umount -fl it.

If it still fails, you can try kubectl delete pod --force --grace-period 0

@clkao,

My issue is not that the pod is stuck on terminating state, but rather that it is terminated by kubelet without my intervention right after it is spawned:

Events:
  Type    Reason                 Age   From                Message
  ----    ------                 ----  ----                -------
  Normal  Created                35s   kubelet, $GKE_NODE  Created container
  Normal  Started                34s   kubelet, $GKE_NODE  Started container
  Normal  Killing                1s    kubelet, $GKE_NODE  Killing container with id docker://notebook:Need to kill Pod

As a result, I'm not able to spin up single-user servers for my users. I'm not quite sure how to debug this further since nothing in the logs indicates a problem with my configuration as far as I can tell.

oop, sorry I failed to read. ;)

It appears somehow the start_timeout option is not effective, so hub tried to kill the pod right after 30s.

@clkao the start_timeout is something relating on how long the spawner wants to wait before the pod enters the running state, I think. It is important to have several minutes in case of a node needs to be added or an image needs to be pulled or similar.

The 30 second thing is probably related to kubernetes concept of pod health. It must respond and behave healthy in a certain time in order to do so, perhaps thats the 30 seconds of relevance? I don't know for sure but I dont think it is the start_timeout that is an issue here. I think grace seconds relates how how long a pod can take on it to terminate, but what is it called on how long one is supposed to wait before the pod becomes healthy? I dont know, but anyhow:

*GUESS: *
The pod startsup, but perhaps it never ends jupyter or something like that. What is the startup command? I think whats important is to see your Dockerfile in order to guess whats went wrong.

I dug into the code and it looks like 30s is determined by the http_timeout parameter. After increasing that number, I was able to get a shell into the container and find out that a recursive chown I was doing in my startup script never terminated,

# ps -ef --sort=start_time
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 18:31 ?        00:00:00 tini -g -- startup.sh --ip="0.0.0.0
root         5     1  0 18:31 ?        00:00:00 /bin/bash /usr/local/bin/startup.sh
root         6     5  6 18:31 ?        00:00:01 chown -R 1001:100 /home/jovyan
root         7     0  0 18:31 ?        00:00:00 /bin/bash
root        16     7  0 18:31 ?        00:00:00 ps -ef --sort=start_time

which I think is because the Julia folder is very large:

# du -sh /home/jovyan/.[!.]*
4.0K    /home/jovyan/.bash_logout
4.0K    /home/jovyan/.bashrc
72K     /home/jovyan/.cache
12K     /home/jovyan/.conda
16K     /home/jovyan/.config
608M    /home/jovyan/.julia
36K     /home/jovyan/.jupyter
4.0K    /home/jovyan/.profile
8.0K    /home/jovyan/.yarn

Anyways, removing that line fixed the issue. Thanks for pointing me in the right direction!

Wieee nice! Thanks for explaining the resolution @zcesur!

Was this page helpful?
0 / 5 - 0 ratings