What happened:
Pods Failing to launch in my cluster. All pods are in ContainerCreating state
Include log lines if possible
-->
Attach logs
Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9742b380336c57443151508a577ace541c2b6c44b278f73f9afc7a2c23053220" network for pod "cluster-autoscaler-7f8f7cdb4d-64lt4": networkPlugin cni failed to set up pod "cluster-autoscaler-7f8f7cdb4d-64lt4_kube-system" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "9742b380336c57443151508a577ace541c2b6c44b278f73f9afc7a2c23053220" network for pod "cluster-autoscaler-7f8f7cdb4d-64lt4": networkPlugin cni failed to teardown pod "cluster-autoscaler-7f8f7cdb4d-64lt4_kube-system" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Environment:
kubectl version): 1.17cat /etc/os-release): NAME="Amazon Linux" VERSION="2"uname -a): 4.14.186-146.268.amzn2.x86_64Hi @Arsen-Uulu
Can you please share me the logs after running this script on the node - sudo bash /opt/cni/bin/aws-cni-support.sh. Also please share the o/p of aws-node DaemonSet.
Is this with custom networking or have you enabled security group per pod?
Thanks.
@jayanthvn Synced up with @Arsen-Uulu offline. Looks like they are not using Customer networking/per pod feature.
Error message in their log
{"level":"error","ts":"2020-12-10T17:09:31.682Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: open /var/run/aws-node/ipam.json.tmp156063755: no such file ordirectory"}
{"level":"error","ts":"2020-12-10T17:23:28.745Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: open /var/run/aws-node/ipam.json.tmp135855914: no such file ordirectory"}
Asked @Arsen-Uulu to run kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/v1.7/aws-k8s-cni.yaml and looks like it mitigated the issue.
@SaranBalaji90 I am also having this issue. Using 1.7.7 on EKS 1.18 (managed nodes). The only way I could get the CNI to stop restarting was to remove the liveness and readiness probes, but that doesn't resolve the underlying issue. The CNI is failing to respond on port 50051.
Later: It looks like 1.7.7 is NOT working on managed node groups (or SGs for pods). 1.7.7 does appear to work on self-managed nodes. No issues with the liveness or readiness probes on self managed workers.
Looks like @jicowan doesn't have managed ng now on the affected cluster. Once he spins up new NG tomorrow, we can investigate why CNI is timing out on his cluster.
I created a new cluster [1.18] and upgraded the CNI to 1.7.7 and it worked. I can't explain why the probes failed or why they would only work on self-managed workers. Perhaps iptables got corrupted on my managed nodes.
I also tried to reproduce the issue with v1.18 cluster and with cni v1.7.7 on a managed node group and it is working as expected.
I tested both v1.7.7 and v1.7.6 and was not able to reproduce the issue.
I did encountered Liveness probe failed when I deleted my managed add on and applied the cni v1.7.7 manually. It was because the worker role ARN didn't have AmazonEKS_CNI_Policy attached. After attaching the policy everything worked as expected.
I am not sure if @jicowan and @Arsen-Uulu ran into similar issues but since they have deleted the clusters we will not be able to investigate the root cause.
Synced up with @SaranBalaji90 and closing the issue given everything is working as expected for both. Please re open if you run into this again.
I am able to reproduce now @abhipth and @SaranBalaji90. Create a cluster with a managed node group. Add the VPC CNI as a managed add on. The aws-node will start failing the probes. The switch to managed add-ons overwrites the service account, but doesn't configure the service account with the role selected from the drop down.
Thanks for reporting @jicowan, tracking this issue here - https://github.com/aws/amazon-vpc-cni-k8s/issues/1338
@abhipth I have fixed my issue.
What I did is I updated cni plugin version directly on an existing EKS Cluster, but after applying https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/v1.7/aws-k8s-cni.yaml issue was resolved.