Containerd: exec leaves many defunct/zombie processes

Created on 15 May 2020  路  4Comments  路  Source: containerd/containerd

Description

Steps to reproduce the issue:
Not alway reproducible but happened frequently

Describe the results you received:
Here is the list of processes in container:

# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Apr23 ?        00:01:34 /usr/local/bin/pilot-agent proxy --customConfigFile /etc/istio/proxy/envoy_static.json
root        21     1  0 Apr23 ?        03:39:45 /usr/local/bin/envoy -c /etc/istio/proxy/envoy_static.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster istio-proxy --service-node sidecar~xx.xx.xx.xx~
root      1069     1  0 May08 ?        00:00:00 [rprobe.sh] <defunct>
root      6228     1  0 Apr24 ?        00:00:00 [rprobe.sh] <defunct>
root      7433     1  0 11:40 ?        00:00:00 [rprobe.sh] <defunct>
root      9802     1  0 Apr25 ?        00:00:00 [rprobe.sh] <defunct>
root     11263     1  0 May09 ?        00:00:00 [rprobe.sh] <defunct>
root     11264     1  0 May09 ?        00:00:00 [tee] <defunct>
root     18890     0  0 13:37 pts/0    00:00:00 bash
root     21807 18890  0 14:07 pts/0    00:00:00 ps -ef
root     37151     1  0 May12 ?        00:00:00 [rprobe.sh] <defunct>
root     37376     1  0 May03 ?        00:00:00 [rprobe.sh] <defunct>
root     49359     1  0 May04 ?        00:00:00 [grep] <defunct>
root     53958     1  0 May06 ?        00:00:00 [tee] <defunct>
root     55185     1  0 Apr28 ?        00:00:00 [rprobe.sh] <defunct>
root     57663     1  0 Apr28 ?        00:00:00 [rprobe.sh] <defunct>
root     62584     1  0 May11 ?        00:00:00 [rprobe.sh] <defunct>

I run crictl exec -it bash, and we can see the parent of bash is 0.
There are some defunct processes, the rprobe.sh is a script executed every 10 seconds by kubelet as we configured readiness probe. I'm sure rprobe.sh will not fork rprobe.sh nestedly. Per my understanding, the parent process of rprobe.sh should be 0 just as my bash, but some times, we found so many defunct processes whose parent became 1 which has not been configured to reap child processes. We've seen such issue frequently, but not always, and we can find there are thousands of defunct processes in a node.

Describe the results you expected:
All exec processes should be reaped.

Output of containerd --version:

containerd github.com/containerd/containerd v1.2.7-10-g28d34be 28d34be6650cb60fac87a4b8a2132685867e0955

Any other relevant information:

kinbug

Most helpful comment

if the script /opt/rprobe.sh are executed longer than timeoutSeconds and the script will fork child processes, it can cause this issue.
for example, create a script like this below and set the timeoutSeconds to 1s

#!/bin/bash
sleep 10

it will spawn two processes.

4 S root     36218     0  1  80   0 -  4514 wait   08:08 ?        00:00:00 /bin/bash /opt/rprobe.sh
0 S root     36234 36218  0  80   0 -  1094 hrtime 08:08 ?        00:00:00 sleep 10

when the readiness probe timeout, it will kill the bash directly. The sleep process's parent process will become to 1.

0 S root     36234     1  0  80   0 -  1094 hrtime 08:08 ?        00:00:00 sleep 10

the pid 1 process doesnt wait for this process and dont collect this process. finally the sleep process will be in defunct status

All 4 comments

Thanks for reporting. Could you mind to provide more information about pod description? Thanks

  containers:
  - args:
    - proxy
    - --customConfigFile
    - /etc/istio/proxy/envoy_static.json
    env:
    - name: OWNING_NS_NAME
      value: xxx
    image: xxx-sidecar
    imagePullPolicy: IfNotPresent
    name: istio-proxy
    ports:
    - containerPort: 80
      protocol: TCP
    - containerPort: 443
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - /opt/rprobe.sh
      failureThreshold: 1
      initialDelaySeconds: 30
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 100m
        memory: 1G
      requests:
        cpu: 10m
        memory: 100M

@fuweid it is a sidecar container.

if the script /opt/rprobe.sh are executed longer than timeoutSeconds and the script will fork child processes, it can cause this issue.
for example, create a script like this below and set the timeoutSeconds to 1s

#!/bin/bash
sleep 10

it will spawn two processes.

4 S root     36218     0  1  80   0 -  4514 wait   08:08 ?        00:00:00 /bin/bash /opt/rprobe.sh
0 S root     36234 36218  0  80   0 -  1094 hrtime 08:08 ?        00:00:00 sleep 10

when the readiness probe timeout, it will kill the bash directly. The sleep process's parent process will become to 1.

0 S root     36234     1  0  80   0 -  1094 hrtime 08:08 ?        00:00:00 sleep 10

the pid 1 process doesnt wait for this process and dont collect this process. finally the sleep process will be in defunct status

Thanks for pointing out, so it's not a bug, but misuse.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pierreozoux picture pierreozoux  路  4Comments

brandond picture brandond  路  4Comments

czm4514 picture czm4514  路  4Comments

mrueg picture mrueg  路  5Comments

fahedouch picture fahedouch  路  4Comments