What happened:
New pods started failing to come up after upgrading to eks CNI v1.7.0 from v1.6.0. I was able to upgrade to v1.6.3 without any issue. I started to see the errors when I upgraded to 1.7.0. I also tried to upgrade to other version ( v1.7.2 and v1.7.5) but I am seeing the same issue.
Here is the error I am seeing.
Warning FailedCreatePodSandBox 28s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7e3423d27fc6f36276de03aa7f41ef6b6f02121f800b65b64b8073c6a207b696" network for pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7": networkPlugin cni failed to set up pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7_default" network: invalidcharacter '{' after top-level value
Here is the cni log
Anything else we need to know?:
Environment:
v1.17.9-eks-4c69765.4.58-27.104.amzn2.x86_64Hi @YesemKebede
Can you please confirm if you have set AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout?
I checked IPAMD logs and I see IP allocation seems fine on the first look. We will further investigate the issue.
Thanks.
@jayanthvn AWS_VPC_K8S_PLUGIN_LOG_FILE is set to /var/log/aws-routed-eni/plugin.log
Thanks @YesemKebede . We will look into it asap.
Hi @YesemKebede
Can also please confirm how you upgraded from 1.6.3 to 1.7.X?
Thank you!
@jayanthvn I followed this Doc
I upgraded from 1.6.3 to 1.7.5 and had the same problem. No new pod could be started and they had that same error. I had both AWS_VPC_K8S_CNI_LOG_FILE and AWS_VPC_K8S_PLUGIN_LOG_FILE set to stdout and had this same problem. Removing them so they get sent to files as per their default config solved the issue for me.
Google led me here: https://github.com/Azure/azure-container-networking/issues/195#issuecomment-408990663
@sophomeric Yes setting AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout will cause a similar issue(#1251). But here it wasn't set.
We are experiencing the same issue on newly provisioned clusters with the following difference in versions:
v1.17.11-eks-cfdc40v1.9.0-rc2 in chaining modeIf it helps, although I am not 100% sure about the Kubernetes version being exactly the same patch version, that same configuration was working last week on a different cluster with the same characteristics.
Hi @Aggouri
Can you please confirm the CNI version for the two clusters?
kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2
Thanks.
@jayanthvn
Can you please confirm the CNI version for the two clusters?
The cluster was provisioned a few hours ago:
$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1
Sadly, I am unable to provide the CNI plugin version of the previous cluster as it was already torn down. If it helps, I know it was provisioned at the beginning of last week and used the defaults EKS came with for Kubernetes version 1.17.x.
Thanks for conforming @Aggouri . We are actively looking into the issue. Will update asap.
@jayanthvn upgraded from 1.6.3 to 1.7.5 having a problem.
{"level":"error","ts":"2020-10-27T10:44:04.889-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Error received from DelNetwork gRPC call for container ba592f75d2b25963c4bd64f218ae0930917fa39e3efffd2231b313f8eb42d344: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051
Hi,
We have found the RC, for now please add pluginLogFile and pluginLogLevel in 05-cilium.conflist. We will fix this issue in the next release.
cat /etc/cni/net.d/05-cilium.conflist
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni",
"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
},
{
"name": "cilium",
"type": "cilium-cni",
"enable-debug": false
}
]
}
I was able to repro and below is the o/p after fixing the conflist -
dev-dsk-varavaj-2b-72f02457 % kubectl describe daemonset aws-node -n kube-system | grep 1.7.5
Image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
Image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-nginx-86b7cfc89-jvzvw 1/1 Running 0 18h 192.168.10.206 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
my-nginx-86b7cfc89-p4q2t 1/1 Running 0 18m 192.168.67.156 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
aws-node-95jtw 1/1 Running 0 23m 192.168.0.43 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
aws-node-cnrkq 1/1 Running 0 24m 192.168.81.109 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
aws-node-j64z5 1/1 Running 0 23m 192.168.51.208 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
cilium-5gr4s 1/1 Running 0 18h 192.168.51.208 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
cilium-d4nff 1/1 Running 0 18h 192.168.0.43 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
cilium-node-init-kwsj6 1/1 Running 0 18h 192.168.0.43 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
cilium-node-init-pv4jw 1/1 Running 0 18h 192.168.51.208 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
cilium-node-init-pxdfv 1/1 Running 0 18h 192.168.81.109 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
cilium-operator-6554b44b9d-f88zj 1/1 Running 0 18h 192.168.51.208 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
cilium-operator-6554b44b9d-j8tlb 1/1 Running 0 18h 192.168.0.43 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
cilium-qg6tf 1/1 Running 0 18h 192.168.81.109 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
coredns-5c97f79574-9nnkk 1/1 Running 0 18h 192.168.68.203 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
coredns-5c97f79574-jnsm2 1/1 Running 0 18h 100.64.95.97 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
kube-proxy-bmv86 1/1 Running 0 18h 192.168.81.109 ip-192-168-81-109.us-west-2.compute.internal <none> <none>
kube-proxy-j7c8f 1/1 Running 0 18h 192.168.0.43 ip-192-168-0-43.us-west-2.compute.internal <none> <none>
kube-proxy-ss98z 1/1 Running 0 18h 192.168.51.208 ip-192-168-51-208.us-west-2.compute.internal <none> <none>
Thank you!
is there any ETA for a new release including this fix?
Thanks!
Most helpful comment
Hi,
We have found the RC, for now please add
pluginLogFileandpluginLogLevelin05-cilium.conflist. We will fix this issue in the next release.I was able to repro and below is the o/p after fixing the conflist -
Thank you!