Test-infra: janitor leaks processes, destabilizing the node

Created on 8 Dec 2017  Â·  18Comments  Â·  Source: kubernetes/test-infra

see: https://github.com/kubernetes/test-infra/issues/5700#issuecomment-350391209
additionally a previous issue: https://github.com/kubernetes/test-infra/issues/4892

I'm looking into this.
/assign
/priority critical-urgent
/area boskos

areboskos prioritcritical-urgent ¯\_(ツ)¯

Most helpful comment

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

  • zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
  • we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

  • the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
  • when the janitor spawns some janitor.py it seems to wait on them properly
  • these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
  • Fun fact: gcloud / the google cloud sdk runs as python2 as well
  • in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
  • when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
  • all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
  • we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x

This is the "PID1 problem" for docker.

You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.

For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. https://github.com/kubernetes/test-infra/pull/5885 does this.

All 18 comments

ps aux on the node gives a large number of entries like:
root 32762 0.0 0.0 0 0 ? Zs 14:08 0:01 [python2] <defunct>

which trace back to:

$ ps -p 32762 -o ppid
 PPID
31580
$ ps -p 31580
  PID TTY          TIME CMD
31580 ?        00:00:22 janitor
    ├─pod86ecb194-da3b-11e7-962d-42010a800055
    │ ├─6ad84509b179480cb2cfe4d06430e44193e0d05db1eca530a7026f6a2ffdca0e
    │ │ └─31580 /janitor --service-account=/etc/service-account/service-account.json --resource-type=gce-project,gke-project,gci-qa-project,gpu-project --pool-size=20
    │ └─bc0a8f92ba6ecb05b6ea13c52e3dcc5cf9b1b74bbf78a128f2bfaceba60fb05c
    │   └─31362 /pause

for context into how excessive this is:

$ ps aux | grep defunct | wc -l
20640

still looking for the root cause, we call gcloud a lot which I believe is also python so it's possible that gcloud has a bug instead...

This node is now tainted (should have done this earlier), we will need to un-taint it after debugging.

$ kubectl taint nodes gke-prow-pool-n1-highmem-8-81ce4395-83j6 key=value:NoSchedule
node "gke-prow-pool-n1-highmem-8-81ce4395-83j6" tainted

What's weirder is that the zombie processes all seem to have the boskos janitor binary as their parent, which only execs with .Run() and .CombinedOutput(...) as far as I can tell, which should both ultimately call Wait() so we shouldn't be seeing zombies. :confused:

hummmm, xref https://github.com/kubernetes/kubernetes/issues/54313 - probably because of the million gcloud foo calls in the actual janitor script

I probably can add some health check for each janitor pod

@krzyzacy we specifically need to avoid process exhaustion on the host, so we should make sure these child processes get cleaned up.
edit: in fact the health check is probably un-necessary the pod gets restarted, the problem is in the meantime this can hurt the node.

@BenTheElder the node is still around? my current idea is after each sync loop in the janitor, I can look for leaked gcloud processes and wipe them up

@krzyzacy the node itself is currently fine, but it caused docker to restart which broke jobs. I have the node quarantined for debugging since prow demand is low currently anyhow.

We can look for leaked processes and wipe them up pretty easily, but this is a bit of a bandaid. they shouldn't be leaking to begin with.

Also the rate they accumulate is pretty bad:

ps aux | wc -l
895
ps aux | grep defunct | wc -l
736

(this is after the pod has been restarted this evening)

/shrug
obviously ssh doesn't work here :-(

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

hummmm, in janitor.py we are only using subprocess.check_call and subprocess.check_output which both of them should be blocking though. Trying to find some reference online.

Yeah, I'm starting to think gcloud might be leaking processes internally, I'm starting to look at building the "bandaid" in the go janitor :(

There is actually a small library specifically for this use case: https://github.com/ramr/go-reaper

maybe we can also recycle the janitor pods (like set a max-lifetime for like 1h for each instance), if that's easier

Additionally this is not particularly new, but for reference the complete diagnosis and solution are:

  • zombie processes pile up, filling the PID space and we cause docker ps to fail so the kubelet health check fails and decides to restart docker, which interrupts our jobs on the node.
  • we see lots of python2 <defunct> piling up on the machine, and suspect the janitor since it appears to be the ppid for these zombie processes, and we know it spawns some janitor.py subprocesses with python 2

BUT:

  • the janitor go binary was previously ENTRYPOINT in our image, making it PID1 inside the container, where a normal linux (not in docker) would have an init daemon instead
  • when the janitor spawns some janitor.py it seems to wait on them properly
  • these in turn spawns some gcloud processes which are waited on properly (as far as I can tell)
  • Fun fact: gcloud / the google cloud sdk runs as python2 as well
  • in the case of that a parent process was aborted when a process exits it is an orphaned child which the kernel reparents to PID1
  • when any orphaned process exits it still exists in the kernel's process table because no parent ever wait(...)ed it or set SIGCHILD to SIG_IGN, so they fill up on the machine, and don't have much information to diagnose this because the kernel isn't required to keep everything for them, just the pid -> exit status.
  • all this means that if gcloud spawns a subprocess and then aborts the subprocess is re-parented to the janitor binary and we see aborted python2 commands parented to janitor causing our original suspicion
  • we waste a little time searching for a bug in the janitor, not immediately realizing these could be gcloud, because I don't remember that gcloud is also python2.x

This is the "PID1 problem" for docker.

You can solve this by making sure your ENTRYPOINT is a suitable init daemon of varying levels, in our case we just need to make sure orphaned processes are reaped (wait(...)ed) so a shell is sufficient.
We do have to be careful though, to make sure the shell doesn't outsmart us and directly exec the subprocess and replace the shell (bash will do this if you bash -c 'cmd').
On Alpine's /bin/sh this doesn't seem to be an issue, but for good measure I tossed in /bin/sh -c 'echo started && /bin/janitor "$@" to make sure it can't just exec the command.

For other use cases a real init process of varying complexity might be desired to cleanly exit children (see the PID1 problem link), but we're already otherwise resilient to failed subprocess calls so we really just want these dead process entries cleaned up. https://github.com/kubernetes/test-infra/pull/5885 does this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zacharysarah picture zacharysarah  Â·  3Comments

stevekuznetsov picture stevekuznetsov  Â·  3Comments

fen4o picture fen4o  Â·  4Comments

xiangpengzhao picture xiangpengzhao  Â·  3Comments

lavalamp picture lavalamp  Â·  3Comments