Charts: RabbitMQ chart Reachability/Liveness probes create zombies

Created on 11 Dec 2017 · 6Comments · Source: helm/charts

Is this a request for help?: No

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

Version of Helm and Kubernetes:
helm v2.6.2, kubernetes v1.7.9, kubectl v1.8.2

Which chart:
stable/rabbitmq, master branch, both bitnami rabbitmq images and official rabbitmq images

What happened:
The chart deploys the application successfully, but Readiness/Liveness probes cause zombie processes. After several hours, the zombie processes take up so many pid's that the node becomes starved for resources and begins to fail launching new images. Previously running containers typically run ok as long as they don't spawn new processes, but any new scheduled pods fail. Sometimes it takes 5 or 6 attempts to ssh to the node with errors complaining it can't fork. Various system services also start crashing.

What you expected to happen:
Readiness/Liveness probes not to cause zombie processes.

How to reproduce it (as minimally and precisely as possible):

kubectl create namespace foo
helm install --name foo-rabbitmq --namespace foo charts/stable/rabbitmq --set persistence.enabled=false
kubectl -n foo exec -it $(kubectl get pods -n foo | grep Running | grep foo-rabbitmq | head -n 1 | awk '{print $1}') -- /bin/bash -c "ps aux | grep ' [Z]' | wc -l"

Watch for number of zombie processes to increase at a rate of about 120 processes per minute

Anything else we need to know:
I tried both the bitnami images (from 3.6.10-r2, 3.6.11-r7, etc all the way up to 3.6.14-r2) and the official rabbitmq images. Both do the same thing.

Ultimately it appears to be an issue where the rabbitmq-server process spawns children with PPID 0, so when they exit they don't get reaped. I suspect this will probably be an issue for any container which runs Readiness/Liveness probes which cause a daemonized process to spawn children.

It also creates the zombie processes when I kubectl exec to the pod and run the status command manually.

The bitnami images use tini as the init manager, and it sounds very similar to this issue: https://github.com/rook/rook/issues/724
I did try to add -s to the tini commandline in the entry shell script in the docker image from bitnami, but it did not change the behavior.

The official rabbitmq image does not use tini. Instead the entry shell script exec's rabbitmq-server directly, but it still has the behavior of zombies from the probes.

BUT in both cases, PID 1 is /pause instead of tini or rabbitmq-server that get exec'd by the relevant entry shell script.

Here is output from connecting directly to the kube nodes and seeing the growing zombie processes:

admin@ip-10-92-72-35:~$ for J in ip-10-92-73-20.ec2.internal ; do echo -n "$J: "; ssh core@$J 'ps aux | grep defunct | wc -l'; done
ip-10-92-73-20.ec2.internal: 251
admin@ip-10-92-72-35:~$ for J in ip-10-92-73-20.ec2.internal ; do echo -n "$J: "; ssh core@$J 'ps aux | grep defunct | wc -l'; done
ip-10-92-73-20.ec2.internal: 260
admin@ip-10-92-72-35:~$ for J in ip-10-92-73-20.ec2.internal ; do echo -n "$J: "; ssh core@$J 'ps aux | grep defunct | wc -l'; done
ip-10-92-73-20.ec2.internal: 499

Source

mrballcb

Most helpful comment

I have now resolved this. I built a new pause container from the master branch of kubernetes (kubernetes/build/pause), uploaded it to public docker repo, and configured a few of my nodes to use it with the --pod-infra-container-image argument. I deleted the pod a few times until the new one came up on a node using the new pause image. All zombie processes are now being reaped properly.

Once this new pause-amd64 release issue is taken care of at https://github.com/kubernetes/kubernetes/issues/50865, this will no longer be a problem.

Leaving this open for a bit to see if any conversation comes up around it, or recommendations if this should just be closed.

mrballcb on 12 Dec 2017

🎉2

All 6 comments

After reading about the pause container at https://www.ianlewis.org/en/almighty-pause-container, it's clear that the pause container should be reaping those zombies. We noticed this behavior about 1.5 weeks ago, approximately Dec 1, though I strongly believe it was happening prior to that and we just didn't realize it. Kube was running version 1.7.2 until Fri Dec 7, then a co-worker upgraded Kube from 1.7.2 to 1.7.9 on Fri evening. We are still seeing the same zombie not being reaped behavior in 1.7.9 as we saw in 1.7.2 last week.

I have verified that we are not disabling pid namespacing. This is the kubelet args:

# cat /etc/sysconfig/kubelet
DAEMON_ARGS="--allow-privileged=true --cgroup-root=/ --cloud-provider=aws --cluster-dns=100.64.0.10 --cluster-domain=cluster.local --enable-debugging-handlers=true --eviction-hard=memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5% --feature-gates=ExperimentalCriticalPodAnnotation=true --hostname-override=ip-10-92-73-174.ec2.internal --kubeconfig=/var/lib/kubelet/kubeconfig --network-plugin=cni --node-labels=kubernetes.io/role=node,node-role.kubernetes.io/node= --non-masquerade-cidr=100.64.0.0/10 --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 --pod-manifest-path=/etc/kubernetes/manifests --register-schedulable=true --require-kubeconfig=true --v=2 --cni-bin-dir=/opt/cni/bin/ --cni-conf-dir=/etc/cni/net.d/"

We are becoming aware that maybe it's a cluster configuration that's causing this. We are pursuing a method of disabling pause as pid 1 to get the docker shell entry script to be pid 1 by exec'ing the appropriate process. Another coworker has a Kube 1.7.6 cluster that pid 1 is never the pause container but is instead the app being executed (in a php pod), so we are trying to figure out what the differences are.

mrballcb on 11 Dec 2017

Once this new pause-amd64 release issue is taken care of at https://github.com/kubernetes/kubernetes/issues/50865, this will no longer be a problem.

Leaving this open for a bit to see if any conversation comes up around it, or recommendations if this should just be closed.

mrballcb on 12 Dec 2017

🎉2

@mrballcb Thanks for taking the time to document this issue. I would suggest if possible that the relevant issues be marked as critical, as this bug causes cluster instability.
There is now a updated pause container available in the public repo:
gcr.io/google_containers/pause-amd64:3.1.

macropin on 22 Dec 2017

👍1

Also worth mentioning this log I've seen which seems to indicate that the container should reap zombies via tini, but isn't setup to do so:
[WARN tini (28960)] Tini is not running as PID 1 and isn't registered as a child subreaper. Zombie processes will not be re-parented to Tini, so zombie reaping won't work. To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.

macropin on 10 Jan 2018

I would like to mention that using pause:3.1 is likely solving your problem because by default Kubernetes 1.7.x (on newer docker versions) uses the same PID namespace for containers in a pod. This behaviour was enabled in 1.7 and is yet again disabled in 1.8 and up. The kubelet flag --docker-disable-shared-pid is going to be deprecated in k8s 1.10. Without the shared pid namespaces, zombies will not be re-parented to PID 1 (pause) but will be re-parented to PID 1, which will then be the container's process.