Amazon-vpc-cni-k8s: Issues with Weave and AWS CNI

Created on 15 Oct 2020 · 11Comments · Source: aws/amazon-vpc-cni-k8s

What happened:

Pods are taking a long time to come up. Based on the logs below, it appears to occur when using both aws-cni and also weave using amazon linux 2.

This as also been using kernel 5.x.x

... Container Runtime is set no not ready as it cannot find the cni config file
4711 docker_service.go:260] Docker Info: &{ID:ULND:...D5KO:6O4F Containers:0 ContainersRunning:0 ContainersPaused:0 ContainersStopped:0 Images:2 Driver:overlay2 DriverStatus....
4711 docker_service.go:255] Docker cri networking managed by cni
4711 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
4711 kubelet.go:2193] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

... Kubelet having issues finding the node
Oct 14 12:07:59 1.2.3.4 kubelet: E1014 4711 kubelet.go:2273] node "1.2.3.4.eu-central-1.compute.internal" not found
Oct 14 12:07:59 1.2.3.4 kubelet: I1014 4711 kubelet_node_status.go:73] Successfully registered node 1.2.3.4.eu-central-1.compute.internal

... Docker has issues pulling an image
Oct 14 12:08:17 1.2.3.4 kubelet: E1014 4711 kuberuntime_manager.go:803] container start failed: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://123456789.dkr.ecr.region.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

... CNI Error again
Oct 14 12:08:35 1.2.3.4 kubelet: W1014 4711 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
Oct 14 12:08:36 1.2.3.4 kubelet: E1014 4711 kubelet.go:2193] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

... At this point, openvswitch initiates on the node, which then appears to implement settings
Oct 14 12:22:30 1.2.3.4 kernel: openvswitch: Open vSwitch switching datapath
Oct 14 12:22:30 1.2.3.4 kernel: device datapath entered promiscuous mode
Oct 14 12:22:30 1.2.3.4 kernel: weave: port 1(vethwedu) entered blocking state
Oct 14 12:22:30 1.2.3.4 kernel: weave: port 1(vethwedu) entered disabled state

What you expected to happen:
Expected that the machine would'nt take so long for pods to go into a ready state

Issue appears to happen using aws-cni and weave

How to reproduce it (as minimally and precisely as possible):
N/A

Anything else we need to know?:

Environment: EKS

Kubernetes version (use kubectl version):
CNI Version
weave 0.3.0
aws-cni: N/A
OS (e.g: cat /etc/os-release): amzn2
Kernel (e.g. uname -a): 4.14.198

bug

Source

mclzn

All 11 comments

Hey @mclzn,

Could you also check the CNI (aws-node) version you have running? Also if you can run the script

sudo bash /opt/cni/bin/aws-cni-support.sh)

to collect logs and share with us, that could help us to find the cause.

Thanks

haouc on 16 Oct 2020

Hey @mclzn ,

Did you mean using aws-node (AWS VPC CNI plugin) with Weave Net? It is recommended to remove aws-node if you are using Weave Net. This is a resource you could refer to. https://www.weave.works/docs/net/latest/kubernetes/kube-addon/#-installing-on-eks

Thanks

haouc on 16 Oct 2020

Hi @haouc ,
we were able to reproduce the issue using different cni plugins (weave and the aws cni-plugin as mentioned) in different eks 1.17 clusters. In both cases only one of the cnis was deployed.
Got an mail address for me to send you the logs from the support script?

After some troubleshooting and cruising through different issues we tried to set up an cluster on the 1.16 control plane where we dont see these issues.

Thanks

gietschess on 16 Oct 2020

Hey @gietschess ,
Thanks for providing more information. Yes, please send the logs to [email protected]. We will investigate and keep you updated.

Thanks

haouc on 16 Oct 2020

👍1

+1 This is happening on aws-eks with k8s 1.14 and we were able to reproduce it with k8s 1.18 with aws vpc cni 1.75. Please do take a look at this.

bharath-12345 on 19 Oct 2020

Hi there,
we found the reason for the issues in our environment.
We acidentially pulled the kube-proxy image from eu-west-1 instead of eu-central-1. Due to internal ip address restrictions we weren't able to pull from all ip addresses behind the 602401143452.dkr.ecr.eu-central-1.amazonaws.com dns entry.
After changing to eu-central-1 everything works as expected.

May this helps others resolving this.

gietschess on 19 Oct 2020

👍1

@gietschess thanks for updating with the root cause. Since this is an environmental issue, I will close this issue for now.

haouc on 19 Oct 2020

Sorry, should keep this issue open since others' root cause haven't been located and I am taking a look on the root cause. Thanks.

haouc on 19 Oct 2020

@bharath-12345 and @mclzn , if you are still facing this issue, could you help collect the logs and send to me ([email protected])? Thanks.

haouc on 21 Oct 2020

@bharath-12345 and @mclzn , Can you please share the logs if you are still facing this issue?

Thank you.