Cluster-api: Fix the identified flake in CAPD

Created on 30 Jan 2020 · 6Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
The CAPD e2e flake that i'm seeing most is that kubeadm init can't run because docker isn't available on the system yet. This error can be reproduced by this minimal bash script:

#!/usr/bin/env sh

set -o xtrace

id=$(docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules kindest/node:v1.16.3)

docker exec -it ${id} kubeadm init --ignore-preflight-errors=all

Now, This is exactly what capd is doing with a few more things in between. So most of the time it works since kindest/nodes are very fast to start, but sometimes it results in this error:

I0130 14:51:01.444642      70 version.go:251] remote version is much newer: v1.17.2; falling back to: stable-1.16
[init] Using Kubernetes version: v1.16.6
[preflight] Running pre-flight checks
[preflight] WARNING: Couldn't create the interface used for talking to the container runtime: docker is required for container runtime: exec: "docker": executable file not found in $PATH
    [WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
    [WARNING Swap]: running with swap on is not supported. Please disable swap
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: docker is required for container runtime: exec: "docker": executable file not found in $PATH
To see the stack trace of this error execute with --v=5 or higher

This can be fixed by waiting for a little bit, but simply running crictl ps on the system is a long enough pause to allow the node to start fully. To test this, simply update the bash script to include this line before kubeadm init:

docker exec -it ${id} crictl ps

My plan here is to add a docker exec crictl ps in a loop in the docker machine controller until it passes with a very short timeout (2 seconds at most), but this should execute in well under that. In the grand scheme of things this won't be noticeable by anyone since the tests take ~100 seconds locally and ~200 seconds on prow

/kind bug
/assign
/lifecycle active
/milestone v0.3.0

kinbug lifecyclactive

Source

chuckha

Most helpful comment

i was under the impression that "wait for containerd" was not required. we did have that in kind though for docker.

at least in kind the startup time is _incredibly_ small, especially compared to docker, however we _also_ are able to pre-load the images so we wouldn't need to wait anyhow, vs with docker where we need to docker load once it's ready

BenTheElder on 30 Jan 2020

👍2

All 6 comments

The CAPD e2e flake that i'm seeing most is that kubeadm init can't run because docker isn't available on the system yet.

hm, the kindest nodes have containerd / crictl on them, no?
i don't understand if this is an issue with docker on the host or with containerd on the nodes.

docker exec -it ${id} crictl ps

this seems as a workaround for waiting for containerd to be ready on the nodes, which AFAIK should not be required. when kind moved to containerd one of the discussed benefits were that containerd starts right away, but docker does not.

neolit123 on 30 Jan 2020

@neolit123 A minor conflation. Kubeadm will say that docker is required on the system (see the error log) when no container runtime is available.

containerd does start right away, but not right-away-enough, I haven't nailed down the timing because I don't think it's important, but it's a very small amount of time between starting the node and having a container runtime available. The problem is that since we are doing nothing between docker ps and kubeadm init we are hitting that very tiny sliver of time when the runtime is not available. I suspect that window is much larger when using docker.

chuckha on 30 Jan 2020

A minor conflation. Kubeadm will say that docker is required on the system (see the error log) when no container runtime is available.

i see, this error might be a bit misleading because instead it's not finding any CR.
cc @rosti

containerd does start right away, but not right-away-enough, I haven't nailed down the timing because I don't think it's important, but it's a very small amount of time between starting the node and having a container runtime available. The problem is that since we are doing nothing between docker ps and kubeadm init we are hitting that very tiny sliver of time when the runtime is not available. I suspect that window is much larger when using docker.

if you have confirmed that this is truly required, the change seems fine.
i was under the impression that "wait for containerd" was not required. we did have that in kind though for docker.
@BenTheElder might have better context on this topic.

neolit123 on 30 Jan 2020

Try running the bash script provided. It, at least on my system, fails very reliably.

chuckha on 30 Jan 2020

i have added LGTM to https://github.com/kubernetes-sigs/cluster-api/pull/2224 which will hopefully solve the flakes. but more discussion on this would be interesting, in case we see the same issues for kind, kinder and other users.

neolit123 on 30 Jan 2020