BUG REPORT
kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0-beta.2", GitCommit:"be2cfcf9e44b5162a294e977329d6c8194748c4e", GitTreeState:"clean", BuildDate:"2018-06-07T16:19:15Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version):uname -a):[reset] cleaning up running containers using crictl with socket /var/run/containerd/containerd.sock
[reset] failed to stop the running containers using crictl: exit status 1. Trying to use docker instead[reset] docker doesn't seem to be running. Skipping the removal of running Kubernetes containers
The containerd logs show the request for some odd named pods
Jun 15 16:14:59 srv-4pp5n containerd[2416]: time="2018-06-15T16:14:59Z" level=info msg="StopPodSandbox for "W0615""
Jun 15 16:14:59 srv-4pp5n containerd[2416]: time="2018-06-15T16:14:59Z" level=error msg="StopPodSandbox for "W0615" failed" error="an error occurred when try to find sandbox "W0615": does not exist"
kubeadm reset ought to read the config file initially used with init (so pick up the cri-socket) and then correctly remove the pods
kubeadm init --config=kubeadm.conf with
apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
api:
advertiseAddress: 0.0.0.0
networking:
serviceSubnet: fd00:1234::/110
kubernetesVersion: 1.11.0-beta.2
cloudProvider: external
featureGates:
"CoreDNS": false
criSocket: /var/run/containerd/containerd.sock
then
kubeadm reset --cri-socket=/var/run/containerd/containerd.sock
If you run kubeadm reset without the cri-socket it still tries to talk to docker and misses the containers completely. Isn't the choice to use containerd in the k8s databases?
crictl is required for non-docker runtimes. If you add crictl to your path then this should work 馃憤
This should throw an error if crisocket is set to not the default and crictl is not found.
I already have crictl on the path - it is returning error 1. Not found is usually error 127.
As I mentioned above containerd is responding with an error to an incorrect pod name
Jun 15 16:14:59 srv-4pp5n containerd[2416]: time="2018-06-15T16:14:59Z" level=info msg="StopPodSandbox for "W0615""
Jun 15 16:14:59 srv-4pp5n containerd[2416]: time="2018-06-15T16:14:59Z" level=error msg="StopPodSandbox for "W0615" failed" error="an error occurred when try to find sandbox "W0615": does not exist"
馃槚missed that line, my fault.
@luxas i'm thinking we have two choices here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/reset.go#L226
does not exist and continue the loop if foundwdyt?
@NeilW Can you run this command and show the output here?
sudo crictl -r /var/run/containerd/containerd.sock pods --quiet
I suspect that crictl complains about cri socket format and that causes next crictl command (crictl stopp) to fail.
It would also help if you add -v 1 to the 'kubeadm reset' command line. It will show you full crictl command lines used.
@chuckha, @luxas: would it make sense to proceed with PR #64611 further?
ubuntu@srv-lp627:~$ sudo crictl -r /var/run/containerd/containerd.sock pods --quiet
W0618 08:46:27.243792 4261 util_unix.go:75] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
371242003d03680603d1e39c7d9af94fbe452558c2a471a207cbbdcfe62a5690
18bbee4d3328498d50eda42fc7e2b3418e7a291b31f355164850d99e9b6b4b41
bf28e32a00f4af622a2dd9b3895a41f0f148cf8e41050a725a465410677800c0
7180b58e4b556b894202a4473d42982507a3e027330ea01fa07ce3bbbdc149e2
3fcac0caed4230e5a5c279629e9e309d122bebd7ebc942a0b7577995631123e4
eb4c32c736677700e361ce60925139e1b740b13e8ebcc18e38fe81f990fc903e
ubuntu@srv-lp627:~$ echo $?
0
ubuntu@srv-lp627:~$ sudo crictl -r /var/run/containerd/containerd.sock pods --quiet 2>/dev/null
371242003d03680603d1e39c7d9af94fbe452558c2a471a207cbbdcfe62a5690
18bbee4d3328498d50eda42fc7e2b3418e7a291b31f355164850d99e9b6b4b41
bf28e32a00f4af622a2dd9b3895a41f0f148cf8e41050a725a465410677800c0
7180b58e4b556b894202a4473d42982507a3e027330ea01fa07ce3bbbdc149e2
3fcac0caed4230e5a5c279629e9e309d122bebd7ebc942a0b7577995631123e4
eb4c32c736677700e361ce60925139e1b740b13e8ebcc18e38fe81f990fc903e
ubuntu@srv-lp627:~$ echo $?
0
Looks like stderr and stdout are combined
ubuntu@srv-lp627:~$ sudo kubeadm reset -v=1 --cri-socket=/var/run/containerd/containerd.sock
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
I0618 08:52:02.077846 9015 reset.go:127] [reset] getting init system
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
I0618 08:52:02.122465 9015 reset.go:144] [reset] executing command "awk '$2 ~ path {print $2}' path=/var/lib/kubelet /proc/mounts | xargs -r umount"
[reset] removing kubernetes-managed containers
[reset] cleaning up running containers using crictl with socket /var/run/containerd/containerd.sock
I0618 08:52:02.164741 9015 reset.go:208] [reset] listing running pods using crictl
I0618 08:52:02.164758 9015 reset.go:211] [reset] Executing command /usr/local/bin/crictl -r /var/run/containerd/containerd.sock pods --quiet
I0618 08:52:02.174779 9015 reset.go:219] [reset] Stopping and removing running containers using crictl
I0618 08:52:02.174812 9015 reset.go:225] [reset] Executing command /usr/local/bin/crictl -r /var/run/containerd/containerd.sock stopp W0618
[reset] failed to stop the running containers using crictl: exit status 1. Trying to use docker insteadI0618 08:52:02.182168 9015 checks.go:141] validating if the service is enabled and active
[reset] docker doesn't seem to be running. Skipping the removal of running Kubernetes containers
I0618 08:52:02.197064 9015 reset.go:161] [reset] checking for etcd manifest
I0618 08:52:02.197099 9015 reset.go:163] Found one at /etc/kubernetes/manifests/etcd.yaml
[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
I0618 08:52:02.197132 9015 reset.go:172] [reset] deleting content of /var/lib/kubelet
I0618 08:52:02.200177 9015 reset.go:172] [reset] deleting content of /etc/cni/net.d
I0618 08:52:02.200237 9015 reset.go:172] [reset] deleting content of /var/lib/dockershim
I0618 08:52:02.200258 9015 reset.go:172] [reset] deleting content of /var/run/kubernetes
I0618 08:52:02.200303 9015 reset.go:172] [reset] deleting content of /var/lib/etcd
I0618 08:52:02.200553 9015 reset.go:177] [reset] removing contents from the config and pki directories
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
Fails with the path for the socket, but switching to using the URL works fine
it seems to be related to this issue
Looks like stderr and stdout are combined
Yes, they are. The reason seems to be that the code is bended to work around absence of fakeexec.Output in k8s.io/utils/exec/testing/fake_exec.go
@luxas What would be a good way to change DefaultCRISocket to "unix:///var/run/dockershim.sock"? We can't just change it in app/apis/kubeadm/, can we?
Proposed solution:
This solution would avoid breaking older configs (with a proper deprecation policy) and unblock crictl people.
We would also have to add this change into the upgrade path as well 馃
ignore all errors here and remove the part where we try to fallback to docker. Fallingback is different behavior than the rest of the crictl/docker interactions in the rest of the codebase and may be surprising.
+1
i don't think it's very optimal to have this fallback.
but i also do not understand the exact reason it was added:
https://github.com/kubernetes/kubernetes/pull/55717
Proposed solution:
yep and we need to tcp:// prefix the default for Windows too like you mention in the other issue.
change the validation to accept a full path or a unix domain socket. Raise a deprecation warning on accepting a full path.
wouldn't it be OK to just start using the full URIs (unix vs tcp sockets) without the deprecation policy - i.e. like doing an [action required] change? might be too breaking, but upgrades complicate this for us.
So I'm firmly in the camp that we can address this in 1.12, but let's be clear.
Folks are advertising 1st class CRIs with absolutely 0 upstream CI-signal. Until that core issue is addressed, multiple CRIs are best effort on break-fix.
/cc @runcom @countspongebob @kubernetes/sig-node-feature-requests @Random-Liu @BenTheElder
Check the error for does not exist and continue the loop if found
we can temporary filter out stderr output as @NeilW showed above.
+1 for continuing the loop. Even if we can't stop one pod it worth trying to stop the rest.
ignore all errors here and remove the part where we try to fallback to docker.
+1 for removing fallback to docker. Not sure about ignoring errors. We should at least show warnings if pod can't be stopped or removed.
BTW, some of this is done in this PR, so I'd suggest to continue with it. Especially if this work is scheduled for 1.12
I'm late to the discussion, so excuse me if I missed details in the long thread of discussion.
The issue seems like a generic problem with switching to using CRI to replace docker CLI in kubeadm. I think we can add a general kubeadm test to cover this code path by configuring criSocket to unix:///var/run/docker.sock with dockershim. By not making docker a special case, we'd be able to capture more issues. If folks can point me to kubeadm integration or e2e testing, I can help when I have spare cycles.
Folks are advertising 1st class CRIs with absolutely 0 upstream CI-signal. Until that core issue is addressed, multiple CRIs are best effort on break-fix.
I'm guessing the "0 upstream CI-signal" refers to the kubeadm integration testing specifically... I am not aware of any presubmit, blocking kubeadm test. If there is one (or sig-cluster-lifecycle wants to add one), we should make sure it has sufficient test coverage for generic CRI runtimes.
A little bit of clarification on CRI testing overall: There are no classes for CRI implementations today. Each CRI is implementation is maintained independently. sig-node has been working on defining a better testing policy to track conformance & features for these runtimes/OSs. Some runtimes also run more tests than the minimal requirements (available in the sig-node test dashboard).
So, where we are with this issue?
@NeilW Can you check if current master works for you? There were quite a bit of changes in this area recently, so this issue can be already fixed.
Using 12.0-alpha1 kubeadm
Still have to specify the socket to use to the reset command
kubeadm reset --cri-socket=/var/run/containerd/containerd.sock
Without it, the command uses docker
Reset doesn't remove the containers at all now. Doesn't even try.
Additionally reset doesn't seem to put the kubelet back into a condition where it can be bootstrapped again. Running init after a reset gives
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.
Aug 03 16:12:09 srv-9jksa systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 03 16:12:09 srv-9jksa kubelet[14516]: F0803 16:12:09.002871 14516 server.go:190] failed to load Kubelet config file /var/lib/kubelet/kubelet-config.yaml, error failed to read kubelet config file "/var/lib/kubelet/kubelet-config.yaml", error: open /var/lib/kubelet/kubelet-config.yaml: no such file or directory
That's after removing the pods by hand using crictl stopp
@NeilW thank you for testing it. That's very strange that kubeadm reset doesn't remove containers. Can you show the output of kubeadm reset --cri-socket=/var/run/containerd/containerd.sock -v2 ?
Reconfirmed the original fault this morning with 1.11.0.
On a fresh kubeadm 1.11.0 install, upgraded to v1.12.0-alpha 1 and ran the reset.
The output was:
$ sudo kubeadm reset --cri-socket=/var/run/containerd/containerd.sock -v2
[reset] WARNING: changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] are you sure you want to proceed? [y/N]: y
[preflight] running pre-flight checks
I0806 10:08:03.045581 15490 reset.go:128] [reset] getting init system
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
I0806 10:08:03.112743 15490 reset.go:145] [reset] executing command "awk '$2 ~ path {print $2}' path=/var/lib/kubelet /proc/mounts | xargs -r umount"
I0806 10:08:03.218546 15490 reset.go:151] [reset] removing kubernetes-managed containers
I0806 10:08:03.277552 15490 reset.go:160] [reset] checking for etcd manifest
I0806 10:08:03.277596 15490 reset.go:162] Found one at /etc/kubernetes/manifests/etcd.yaml
[reset] deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]
I0806 10:08:03.277649 15490 reset.go:172] [reset] deleting content of /var/lib/kubelet
I0806 10:08:03.289725 15490 reset.go:172] [reset] deleting content of /etc/cni/net.d
I0806 10:08:03.294055 15490 reset.go:172] [reset] deleting content of /var/lib/dockershim
I0806 10:08:03.294101 15490 reset.go:172] [reset] deleting content of /var/run/kubernetes
I0806 10:08:03.294256 15490 reset.go:172] [reset] deleting content of /var/lib/etcd
I0806 10:08:03.294685 15490 reset.go:177] [reset] removing contents from the config and pki directories
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
Version details
ubuntu@srv-dgk3c:~$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.0-alpha.1", GitCommit:"94c2c6c8423d722f436305cd67ef515a8800d723", GitTreeState:"clean", BuildDate:"2018-08-01T23:29:53Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Crictl output after reset
ubuntu@srv-dgk3c:~$ sudo crictl ps
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
a889944e7fbcc sha256:1d3d7afd77d133d0e2ae37b06de8a2c5919a27fc0c45c4cdaa2f3e75cfcddb11 8 minutes ago Running kube-proxy 0
0870f4d992d0c sha256:db5802f13c2c334d83caf090ed3e31cb3beaa3a4704d94e0e0a4ba33b79fbd71 8 minutes ago Running cloud-controller-manager 0
901bc60ce02de sha256:55b70b420785d146c77d2752a2acb23336da3bda6fb0edbcd6e4e8d368e16f96 9 minutes ago Running kube-controller-manager 0
ab030f52e6858 sha256:0e4a34a3b0e6f3956a131a4f3e2244bc6bfc2c3ac5e6f384a2094d39bc207094 9 minutes ago Running kube-scheduler 0
82aa92ae53f60 sha256:214c48e87f58fbc373cdc610e51fa41dc05877865ae1b328f0e8d9afe3c98bf7 9 minutes ago Running kube-apiserver 0
Running again gets rid of some more containers
$ sudo crictl ps
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
a889944e7fbcc sha256:1d3d7afd77d133d0e2ae37b06de8a2c5919a27fc0c45c4cdaa2f3e75cfcddb11 16 minutes ago Running kube-proxy 0
82aa92ae53f60 sha256:214c48e87f58fbc373cdc610e51fa41dc05877865ae1b328f0e8d9afe3c98bf7 17 minutes ago Running kube-apiserver 0
Repeated re-running doesn't get rid of any more by the look of it.
@NeilW Very interesting! Looks like kubeadm either can't get list of running pods or crictl doesn't return error when it fails to remove pods. Can you show the output of sudo crictl -r /var/run/containerd/containerd.sock pods -q? This is what kubeadm does to get the list of running pods.
Sure
$ containerd -v
containerd github.com/containerd/containerd v1.1.2 468a545b9edcd5932818eb9de8e72413e616e86e
$ sudo crictl -v
crictl version 1.11.1
$ sudo crictl -r /var/run/containerd/containerd.sock pods -q
W0806 12:32:05.691629 11784 util_unix.go:75] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
e22ed3dc5178078c3f2eb8038345389a83ed350aaa317406515a872262bd9bed
761a7c6f189ef6a6d4093ec07b22cae691371c85e1253b38d974d2a680fd3c51
a096cacfedb225cf9d6780aa04f5b505fc40d0f380576c72b66676af446fba21
42e74b45387b39fdbba2ac472f6cdfd3bf57740d5fc6a1ad56d1ce411b3261ef
1e38b39d22ac183a76e29e28905984efbdc62b84706119f218842f310ee11d59
305f7d7a9bf7b96330f77410d4b6e0a52631e88496b0c578b489d4dddac90091
The normal view is
$ sudo crictl -r /var/run/containerd/containerd.sock pods
W0806 12:34:43.130669 13428 util_unix.go:75] Using "/var/run/containerd/containerd.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/containerd/containerd.sock".
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
e22ed3dc51780 5 minutes ago Ready kube-proxy-fmc7h kube-system 0
761a7c6f189ef 5 minutes ago Ready cloud-controller-manager-9tsvl kube-system 0
a096cacfedb22 5 minutes ago Ready etcd-srv-1jc04 kube-system 0
42e74b45387b3 5 minutes ago Ready kube-apiserver-srv-1jc04 kube-system 0
1e38b39d22ac1 5 minutes ago Ready kube-scheduler-srv-1jc04 kube-system 0
305f7d7a9bf7b 5 minutes ago Ready kube-controller-manager-srv-1jc04 kube-system 0
There's a magic glyph in the pods filters for both docker and crictl
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/runtime/runtime.go#L111
That looks out of sync with the way the creation side names and stamps pods.
N.B. crictl has a '--namespace ' filter now, which may allow a bit of refactoring here.
sudo crictl -r /var/run/containerd/containerd.sock pods --namespace 'kube-system' -q
(Getting 'kube-system' from whatever universal constant name holds the system namespace name)
The same magic name glyph is on the docker side as well
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/runtime/runtime.go#L120
@NeilW Thank you for pointing that out! I've submitted PR to fix that.
he same magic name glyph is on the docker side as well https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/runtime/runtime.go#L120
It works for docker, but doesn't work for CRI as CRI pod names don't start with k8s_:
$ docker ps -a --filter name=k8s_ -q
b2cd8ca143a8
320eb64247e2
fc9f53c503b0
1516e0564f82
056a449c09e0
f3b308c2021f
09d037505e39
8de57119ff3d
4419b359f26d
1261b649f1a9
Most helpful comment
So I'm firmly in the camp that we can address this in 1.12, but let's be clear.
Folks are advertising 1st class CRIs with absolutely 0 upstream CI-signal. Until that core issue is addressed, multiple CRIs are best effort on break-fix.
/cc @runcom @countspongebob @kubernetes/sig-node-feature-requests @Random-Liu @BenTheElder