------------- FEATURE REQUEST TEMPLATE --------------------
Describe IN DETAIL the feature/behavior/change you would like to see.
My team wanted to use docker's multi-stage build which is available on docker >= 17.05. We had tried docker 17.09 by specifying it in the kops cluster config but after the switch we started seeing pods stuck in terminating state because their container could not be killed. The behavior is similar to https://github.com/docker/for-linux/issues/1. The only solution was to restart the docker service on the node that is having the problem. It got worse after we created a kubernetes CronJob resource that runs every 5 minutes.
We have been testing docker 18.03, by using our modified nodeup executable based on kops 1.9.0, and also testing another cluster with docker 17.03. So far, we haven't seen the problem reappearing in any of the clusters but we will continue to monitor them. We understand that the validated docker version for k8s 1.9 is 17.03.x but we still want to support multi-stage build on our cluster. We'd be happy to open a pr with our nodeup change.
Versions:
Feel free to provide a design supporting your feature request.
This is the cronjob spec that we use to reproduce the problem.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
namespace: default
spec:
startingDeadlineSeconds: 30
concurrencyPolicy: Forbid
schedule: "*/2 * * * *"
jobTemplate:
spec:
parallelism: 1
completions: 1
backoffLimit: 1
activeDeadlineSeconds: 60
template:
spec:
containers:
- name: hello
image: hello-world
restartPolicy: Never
Update: I reproduced the issue again, this time with a debian image kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08, after running the CronJob for 4 hours. I also enabled debug log for docker daemon and this is what I got from greping the log with the container id of the first zombie pod:
Apr 20 23:36:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:36:08.852603579Z" level=debug msg="Calling POST /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/start"
Apr 20 23:36:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:36:08.921334436Z" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:36:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:36:08.927344196Z" level=debug msg="Calling GET /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/json"
Apr 20 23:37:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:08.966488414Z" level=debug msg="Calling GET /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/json"
Apr 20 23:37:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:08.969425649Z" level=debug msg="Calling POST /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/stop?t=30"
Apr 20 23:37:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:08.969462609Z" level=debug msg="Sending kill signal 15 to container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:37:08 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:08.971371712Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0: rpc error: code = Unknown desc = containerd: container not found"
Apr 20 23:37:38 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:38.975168187Z" level=info msg="Container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0 failed to exit within 30 seconds of signal 15 - using the force"
Apr 20 23:37:38 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:38.975539946Z" level=debug msg="Sending kill signal 9 to container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:37:38 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:37:38.985650139Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0: rpc error: code = Unknown desc = containerd: container not found"
Apr 20 23:38:10 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:10.995119918Z" level=debug msg="Calling GET /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/json"
Apr 20 23:38:10 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:10.997696415Z" level=debug msg="Calling POST /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/stop?t=30"
Apr 20 23:38:10 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:10.997992274Z" level=debug msg="Sending kill signal 15 to container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:38:10 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:10.998585925Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0: rpc error: code = Unknown desc = containerd: container not found"
Apr 20 23:38:40 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:40.999039585Z" level=info msg="Container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0 failed to exit within 30 seconds of signal 15 - using the force"
Apr 20 23:38:40 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:40.999307052Z" level=debug msg="Sending kill signal 9 to container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:38:41 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:38:40.999881607Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0: rpc error: code = Unknown desc = containerd: container not found"
Apr 20 23:39:12 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:12.957926317Z" level=debug msg="Calling GET /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/json"
Apr 20 23:39:12 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:12.972695460Z" level=debug msg="Calling GET /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/json"
Apr 20 23:39:12 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:12.976839162Z" level=debug msg="Calling POST /v1.31/containers/ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0/stop?t=30"
Apr 20 23:39:12 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:12.977147551Z" level=debug msg="Sending kill signal 15 to container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0"
Apr 20 23:39:12 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:12.977789931Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0: rpc error: code = Unknown desc = containerd: container not found"
Apr 20 23:39:42 ip-xxxxxx dockerd[2692]: time="2018-04-20T23:39:42.979087621Z" level=info msg="Container ec2f5f3d9816912ec02445a51da0c696010b8ba3d6dd8aa5929a20c22f1128d0 failed to exit within 30 seconds of signal 15 - using the force"
I found that the handling of these error messages has been changed since docker 17.12, specifically in moby/moby#35809, which hopefully is what fixed the problem on our cluster with docker 18.03.
This would be good to have as currently blocked from using Kubernetes clusters to do multi stage builds, specifically using OpenFAAS & a CICD Pipeline.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
It would be good to use multi-stage build. Waiting for new version with docker version > 17.05 support.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
This would be good to have as currently blocked from using Kubernetes clusters to do multi stage builds, specifically using OpenFAAS & a CICD Pipeline.