Test-infra: janitor leaks processes, destabilizing the node

Created on 8 Dec 2017 · 18Comments · Source: kubernetes/test-infra

see: https://github.com/kubernetes/test-infra/issues/5700#issuecomment-350391209
additionally a previous issue: https://github.com/kubernetes/test-infra/issues/4892

I'm looking into this.
/assign
/priority critical-urgent
/area boskos

areboskos prioritcritical-urgent ¯\_(ツ)¯

Source

BenTheElder

Most helpful comment

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
when the janitor spawns some janitor.py it seems to wait on them properly
these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
Fun fact: gcloud / the google cloud sdk runs as python2 as well
in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x

This is the "PID1 problem" for docker.

You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.

For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. https://github.com/kubernetes/test-infra/pull/5885 does this.

BenTheElder on 10 Dec 2017

👍2

All 18 comments

ps aux on the node gives a large number of entries like:
root 32762 0.0 0.0 0 0 ? Zs 14:08 0:01 [python2] <defunct>

which trace back to:

$ ps -p 32762 -o ppid
 PPID
31580
$ ps -p 31580
  PID TTY          TIME CMD
31580 ?        00:00:22 janitor

BenTheElder on 9 Dec 2017

    ├─pod86ecb194-da3b-11e7-962d-42010a800055
    │ ├─6ad84509b179480cb2cfe4d06430e44193e0d05db1eca530a7026f6a2ffdca0e
    │ │ └─31580 /janitor --service-account=/etc/service-account/service-account.json --resource-type=gce-project,gke-project,gci-qa-project,gpu-project --pool-size=20
    │ └─bc0a8f92ba6ecb05b6ea13c52e3dcc5cf9b1b74bbf78a128f2bfaceba60fb05c
    │   └─31362 /pause

BenTheElder on 9 Dec 2017

for context into how excessive this is:

$ ps aux | grep defunct | wc -l
20640

still looking for the root cause, we call gcloud a lot which I believe is also python so it's possible that gcloud has a bug instead...

BenTheElder on 9 Dec 2017

This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.

$ kubectl taint nodes gke-prow-pool-n1-highmem-8-81ce4395-83j6 key=value:NoSchedule
node "gke-prow-pool-n1-highmem-8-81ce4395-83j6" tainted

BenTheElder on 9 Dec 2017

What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with .Run() and .CombinedOutput(...) as far as I can tell, which should both ultimately call Wait() so we shouldn't be seeing zombies. :confused:

BenTheElder on 9 Dec 2017

hummmm, xref https://github.com/kubernetes/kubernetes/issues/54313 - probably because of the million gcloud foo calls in the actual janitor script

krzyzacy on 9 Dec 2017

I probably can add some health check for each janitor pod

krzyzacy on 9 Dec 2017

@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up.
edit: in fact the health check is probably un-necessary the pod gets restarted, the problem is in the meantime this can hurt the node.

BenTheElder on 9 Dec 2017

@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up

krzyzacy on 9 Dec 2017

@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow.

We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with.

BenTheElder on 9 Dec 2017

Also the rate they accumulate is pretty bad:

ps aux | wc -l
895
ps aux | grep defunct | wc -l
736

(this is after the pod has been restarted this evening)

BenTheElder on 9 Dec 2017

/shrug
obviously ssh doesn't work here :-(

krzyzacy on 9 Dec 2017

😕1

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

krzyzacy on 9 Dec 2017

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :(

BenTheElder on 9 Dec 2017

There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper

BenTheElder on 9 Dec 2017

maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier

krzyzacy on 9 Dec 2017

"re-fixed" in https://github.com/kubernetes/test-infra/pull/5885

BenTheElder on 10 Dec 2017

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
when the janitor spawns some janitor.py it seems to wait on them properly
these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
Fun fact: gcloud / the google cloud sdk runs as python2 as well
in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x