Description
Steps to reproduce the issue:
Not alway reproducible but happened frequently
Describe the results you received:
Here is the list of processes in container:
# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Apr23 ? 00:01:34 /usr/local/bin/pilot-agent proxy --customConfigFile /etc/istio/proxy/envoy_static.json
root 21 1 0 Apr23 ? 03:39:45 /usr/local/bin/envoy -c /etc/istio/proxy/envoy_static.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster istio-proxy --service-node sidecar~xx.xx.xx.xx~
root 1069 1 0 May08 ? 00:00:00 [rprobe.sh] <defunct>
root 6228 1 0 Apr24 ? 00:00:00 [rprobe.sh] <defunct>
root 7433 1 0 11:40 ? 00:00:00 [rprobe.sh] <defunct>
root 9802 1 0 Apr25 ? 00:00:00 [rprobe.sh] <defunct>
root 11263 1 0 May09 ? 00:00:00 [rprobe.sh] <defunct>
root 11264 1 0 May09 ? 00:00:00 [tee] <defunct>
root 18890 0 0 13:37 pts/0 00:00:00 bash
root 21807 18890 0 14:07 pts/0 00:00:00 ps -ef
root 37151 1 0 May12 ? 00:00:00 [rprobe.sh] <defunct>
root 37376 1 0 May03 ? 00:00:00 [rprobe.sh] <defunct>
root 49359 1 0 May04 ? 00:00:00 [grep] <defunct>
root 53958 1 0 May06 ? 00:00:00 [tee] <defunct>
root 55185 1 0 Apr28 ? 00:00:00 [rprobe.sh] <defunct>
root 57663 1 0 Apr28 ? 00:00:00 [rprobe.sh] <defunct>
root 62584 1 0 May11 ? 00:00:00 [rprobe.sh] <defunct>
I run crictl exec
There are some defunct processes, the rprobe.sh is a script executed every 10 seconds by kubelet as we configured readiness probe. I'm sure rprobe.sh will not fork rprobe.sh nestedly. Per my understanding, the parent process of rprobe.sh should be 0 just as my bash, but some times, we found so many defunct processes whose parent became 1 which has not been configured to reap child processes. We've seen such issue frequently, but not always, and we can find there are thousands of defunct processes in a node.
Describe the results you expected:
All exec processes should be reaped.
Output of containerd --version:
containerd github.com/containerd/containerd v1.2.7-10-g28d34be 28d34be6650cb60fac87a4b8a2132685867e0955
Any other relevant information:
Thanks for reporting. Could you mind to provide more information about pod description? Thanks
containers:
- args:
- proxy
- --customConfigFile
- /etc/istio/proxy/envoy_static.json
env:
- name: OWNING_NS_NAME
value: xxx
image: xxx-sidecar
imagePullPolicy: IfNotPresent
name: istio-proxy
ports:
- containerPort: 80
protocol: TCP
- containerPort: 443
protocol: TCP
readinessProbe:
exec:
command:
- /opt/rprobe.sh
failureThreshold: 1
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 1G
requests:
cpu: 10m
memory: 100M
@fuweid it is a sidecar container.
if the script /opt/rprobe.sh are executed longer than timeoutSeconds and the script will fork child processes, it can cause this issue.
for example, create a script like this below and set the timeoutSeconds to 1s
#!/bin/bash
sleep 10
it will spawn two processes.
4 S root 36218 0 1 80 0 - 4514 wait 08:08 ? 00:00:00 /bin/bash /opt/rprobe.sh
0 S root 36234 36218 0 80 0 - 1094 hrtime 08:08 ? 00:00:00 sleep 10
when the readiness probe timeout, it will kill the bash directly. The sleep process's parent process will become to 1.
0 S root 36234 1 0 80 0 - 1094 hrtime 08:08 ? 00:00:00 sleep 10
the pid 1 process doesnt wait for this process and dont collect this process. finally the sleep process will be in defunct status
Thanks for pointing out, so it's not a bug, but misuse.
Most helpful comment
if the script
/opt/rprobe.share executed longer thantimeoutSecondsand the script will fork child processes, it can cause this issue.for example, create a script like this below and set the
timeoutSecondsto 1sit will spawn two processes.
when the readiness probe timeout, it will kill the bash directly. The sleep process's parent process will become to 1.
the pid 1 process doesnt wait for this process and dont collect this process. finally the sleep process will be in defunct status