Amazon-vpc-cni-k8s: New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium

Created on 19 Oct 2020  路  15Comments  路  Source: aws/amazon-vpc-cni-k8s

What happened:

New pods started failing to come up after upgrading to eks CNI v1.7.0 from v1.6.0. I was able to upgrade to v1.6.3 without any issue. I started to see the errors when I upgraded to 1.7.0. I also tried to upgrade to other version ( v1.7.2 and v1.7.5) but I am seeing the same issue.

Here is the error I am seeing.

 Warning  FailedCreatePodSandBox  28s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7e3423d27fc6f36276de03aa7f41ef6b6f02121f800b65b64b8073c6a207b696" network for pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7": networkPlugin cni failed to set up pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7_default" network: invalidcharacter '{' after top-level value

Here is the cni log

Anything else we need to know?:

  • We have Cilium running in chaining mode (v1.8.4)

Environment:

  • Kubernetes version :v1.17.9-eks-4c6976
  • CNI Version: Tried different versions but seeing same issue for (1.7.0, 1.7.2, 1.7.5)
  • Kernel: 5.4.58-27.104.amzn2.x86_64
bug prioritP1

Most helpful comment

Hi,

We have found the RC, for now please add pluginLogFile and pluginLogLevel in 05-cilium.conflist. We will fix this issue in the next release.

cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

I was able to repro and below is the o/p after fixing the conflist -

dev-dsk-varavaj-2b-72f02457 % kubectl describe daemonset aws-node -n kube-system | grep 1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5

NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
my-nginx-86b7cfc89-jvzvw   1/1     Running   0          18h   192.168.10.206   ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
my-nginx-86b7cfc89-p4q2t   1/1     Running   0          18m   192.168.67.156   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>

NAME                               READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aws-node-95jtw                     1/1     Running   0          23m   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
aws-node-cnrkq                     1/1     Running   0          24m   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
aws-node-j64z5                     1/1     Running   0          23m   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-5gr4s                       1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-d4nff                       1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-kwsj6             1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-pv4jw             1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-node-init-pxdfv             1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-f88zj   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-j8tlb   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-qg6tf                       1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-9nnkk           1/1     Running   0          18h   192.168.68.203   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-jnsm2           1/1     Running   0          18h   100.64.95.97     ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
kube-proxy-bmv86                   1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
kube-proxy-j7c8f                   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
kube-proxy-ss98z                   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>

Thank you!

All 15 comments

Hi @YesemKebede

Can you please confirm if you have set AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout?

I checked IPAMD logs and I see IP allocation seems fine on the first look. We will further investigate the issue.

Thanks.

@jayanthvn AWS_VPC_K8S_PLUGIN_LOG_FILE is set to /var/log/aws-routed-eni/plugin.log

Thanks @YesemKebede . We will look into it asap.

Hi @YesemKebede

Can also please confirm how you upgraded from 1.6.3 to 1.7.X?

Thank you!

@jayanthvn I followed this Doc

I upgraded from 1.6.3 to 1.7.5 and had the same problem. No new pod could be started and they had that same error. I had both AWS_VPC_K8S_CNI_LOG_FILE and AWS_VPC_K8S_PLUGIN_LOG_FILE set to stdout and had this same problem. Removing them so they get sent to files as per their default config solved the issue for me.

Google led me here: https://github.com/Azure/azure-container-networking/issues/195#issuecomment-408990663

@sophomeric Yes setting AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout will cause a similar issue(#1251). But here it wasn't set.

We are experiencing the same issue on newly provisioned clusters with the following difference in versions:

  • Kubernetes version: v1.17.11-eks-cfdc40
  • Cilium v1.9.0-rc2 in chaining mode

If it helps, although I am not 100% sure about the Kubernetes version being exactly the same patch version, that same configuration was working last week on a different cluster with the same characteristics.

Hi @Aggouri

Can you please confirm the CNI version for the two clusters?

kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2

Thanks.

@jayanthvn

Can you please confirm the CNI version for the two clusters?

The cluster was provisioned a few hours ago:

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

Sadly, I am unable to provide the CNI plugin version of the previous cluster as it was already torn down. If it helps, I know it was provisioned at the beginning of last week and used the defaults EKS came with for Kubernetes version 1.17.x.

Thanks for conforming @Aggouri . We are actively looking into the issue. Will update asap.

@jayanthvn upgraded from 1.6.3 to 1.7.5 having a problem.

{"level":"error","ts":"2020-10-27T10:44:04.889-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Error received from DelNetwork gRPC call for container ba592f75d2b25963c4bd64f218ae0930917fa39e3efffd2231b313f8eb42d344: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051

Hi,

We have found the RC, for now please add pluginLogFile and pluginLogLevel in 05-cilium.conflist. We will fix this issue in the next release.

cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

I was able to repro and below is the o/p after fixing the conflist -

dev-dsk-varavaj-2b-72f02457 % kubectl describe daemonset aws-node -n kube-system | grep 1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5

NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
my-nginx-86b7cfc89-jvzvw   1/1     Running   0          18h   192.168.10.206   ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
my-nginx-86b7cfc89-p4q2t   1/1     Running   0          18m   192.168.67.156   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>

NAME                               READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aws-node-95jtw                     1/1     Running   0          23m   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
aws-node-cnrkq                     1/1     Running   0          24m   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
aws-node-j64z5                     1/1     Running   0          23m   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-5gr4s                       1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-d4nff                       1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-kwsj6             1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-pv4jw             1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-node-init-pxdfv             1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-f88zj   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-j8tlb   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-qg6tf                       1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-9nnkk           1/1     Running   0          18h   192.168.68.203   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-jnsm2           1/1     Running   0          18h   100.64.95.97     ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
kube-proxy-bmv86                   1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
kube-proxy-j7c8f                   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
kube-proxy-ss98z                   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>

Thank you!

1275 is merged so closing this issue.

is there any ETA for a new release including this fix?
Thanks!

Was this page helpful?
0 / 5 - 0 ratings