see: https://github.com/kubernetes/test-infra/issues/5700#issuecomment-350391209
additionally a previous issue: https://github.com/kubernetes/test-infra/issues/4892
I'm looking into this.
/assign
/priority critical-urgent
/area boskos
ps aux on the node gives a large number of entries like:
root 32762 0.0 0.0 0 0 ? Zs 14:08 0:01 [python2] <defunct>
which trace back to:
$ ps -p 32762 -o ppid
PPID
31580
$ ps -p 31580
PID TTY TIME CMD
31580 ? 00:00:22 janitor
├─pod86ecb194-da3b-11e7-962d-42010a800055
│ ├─6ad84509b179480cb2cfe4d06430e44193e0d05db1eca530a7026f6a2ffdca0e
│ │ └─31580 /janitor --service-account=/etc/service-account/service-account.json --resource-type=gce-project,gke-project,gci-qa-project,gpu-project --pool-size=20
│ └─bc0a8f92ba6ecb05b6ea13c52e3dcc5cf9b1b74bbf78a128f2bfaceba60fb05c
│ └─31362 /pause
for context into how excessive this is:
$ ps aux | grep defunct | wc -l
20640
still looking for the root cause, we call gcloud a lot which I believe is also python so it's possible that gcloud has a bug instead...
This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.
$ kubectl taint nodes gke-prow-pool-n1-highmem-8-81ce4395-83j6 key=value:NoSchedule
node "gke-prow-pool-n1-highmem-8-81ce4395-83j6" tainted
What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with .Run() and .CombinedOutput(...) as far as I can tell, which should both ultimately call Wait() so we shouldn't be seeing zombies. :confused:
hummmm, xref https://github.com/kubernetes/kubernetes/issues/54313 - probably because of the million gcloud foo calls in the actual janitor script
I probably can add some health check for each janitor pod
@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up.
edit: in fact the health check is probably un-necessary the pod gets restarted, the problem is in the meantime this can hurt the node.
@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up
@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow.
We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with.
Also the rate they accumulate is pretty bad:
ps aux | wc -l
895
ps aux | grep defunct | wc -l
736
(this is after the pod has been restarted this evening)
/shrug
obviously ssh doesn't work here :-(
hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.
hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.
Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :(
There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper
maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier
"re-fixed" in https://github.com/kubernetes/test-infra/pull/5885
Additionally this is not particularly new, but for reference the complete diagnosis and solution are:
docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2BUT:
janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon insteadjanitor spawns some janitor.py it seems to wait on them properlygcloud processes which are waited on properly (as far as I can tell) gcloud / the google cloud sdk runs as python2 as wellwait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicionThis is the "PID1 problem" for docker.
You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.
For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. https://github.com/kubernetes/test-infra/pull/5885 does this.
Most helpful comment
Additionally this is not particularly new, but for reference the complete diagnosis and solution are:
docker psto fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.python2 <defunct>piling up on the machine, and suspect the janitor since it appears to be theppidfor these zombie processes, and we know it spawns somejanitor.pysubprocesses with python 2BUT:
janitorgo binary was previouslyENTRYPOINTin our image, making it PID1 inside the container, where a normal linux (not in docker) would have aninitdaemon insteadjanitorspawns somejanitor.pyit seems to wait on them properlygcloudprocesses which are waited on properly (as far as I can tell)gcloud/ the google cloud sdk runs aspython2as wellwait(...)ed it or setSIGCHILDtoSIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.gcloudspawns a subprocess and then aborts the subprocess is re-parented to thejanitorbinary and we see abortedpython2commands parented tojanitorcausing our original suspicionThis is the "PID1 problem" for docker.
You can solve this by making sure your
ENTRYPOINTis a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you
bash -c 'cmd').On Alpine's
/bin/shthis doesn't seem to be an issue, but for good measure I tossed in/bin/sh -c 'echo started && /bin/janitor "$@"to make sure it can't just exec the command.For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. https://github.com/kubernetes/test-infra/pull/5885 does this.