Amazon-vpc-cni-k8s: Coredns stuck on ContainerCreating with `FailedCreatePodSandBox` warning for CNI versions 1.7.6 with Cilium 1.9.1

Created on 7 Dec 2020  路  17Comments  路  Source: aws/amazon-vpc-cni-k8s

What happened:
New cluster with nodes restarted.
coredns stuck on ContainerCreating when using CNI v1.7.6 and Cilium 1.9.1.
Other pods are also experiencing the same behavior ( ContainerCreating )

coredns:v1.6.6-eksbuild.1

Attach logs

Warning  FailedCreatePodSandBox  29s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "112861d5995ca8f44c1dc17f00c947d72a44cf69c9deda34fbaf56b204742874" network for pod "coredns-6d857998c6-gxsd7": networkPlugin cni failed to set up pod "coredns-6d857998c6-gxsd7_kube-system" network: invalid character '{' after top-level value

What you expected to happen:
I expected coredns and other pods to be in running state

How to reproduce it (as minimally and precisely as possible):
Deploy cni version 1.7.6 and cilium 1.9.1 on EKS 1.17

Anything else we need to know?:
We have Cilium running in chaining mode (v1.9.1)
[(https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/)]

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • CNI Version
amazon-k8s-cni-init:v1.7.6
amazon-k8s-cni:v1.7.6
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
    Linux REDACTED.compute.internal 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
bug

Most helpful comment

Hi @shaikatz

Sorry for the delay. Will take this up in rel 1.7.9 planned for January.

All 17 comments

Hi @mmochan ,

Can you please check if you are hitting this issue - https://github.com/aws/amazon-vpc-cni-k8s/issues/1265. RC for this issue - https://github.com/aws/amazon-vpc-cni-k8s/issues/1265#issuecomment-717349630.

Thanks.

Hi jayanthvn

AWS_VPC_K8S_PLUGIN_LOG_FILE is being set as expected

    Environment:
      ADDITIONAL_ENI_TAGS:                 {}
      AWS_VPC_CNI_NODE_PORT_SUPPORT:       true
      AWS_VPC_ENI_MTU:                     9001
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:  false
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:  false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:        false
      AWS_VPC_K8S_CNI_LOGLEVEL:            DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:            /host/var/log/aws-routed-eni/ipamd.log
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:       prng
      AWS_VPC_K8S_CNI_VETHPREFIX:          eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:         /var/log/aws-routed-eni/plugin.log
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:        DEBUG
      DISABLE_INTROSPECTION:               false
      DISABLE_METRICS:                     false
      ENABLE_POD_ENI:                      false
      MY_NODE_NAME:                         (v1:spec.nodeName)
      WARM_ENI_TARGET:                     1

But /host/etc/cni/net.d/05-cilium.conflist doesn't match 05-cilium.conflist in issue - [#1265]

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

Thanks

Hi @mmochan

Yes you will have to add these 2 lines -

"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
 "pluginLogLevel": "Debug"

Something like this -

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

Hi @jayanthvn

Great that works, all pods now running.

Are you able to give an ETA for a permanent fix?

Thanks for your help.

Mike

Good to know it works, #1275 is merged and we are planning for the next release, I will provide you the dates in a week or so.

Thanks again @jayanthvn

Hi @mmochan

Yes you will have to add these 2 lines -

"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
 "pluginLogLevel": "Debug"

Something like this -

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

Unfortunately does not work for me.

Containers stuck with another error like:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bcea7cf2eac9dc94fed5316d1cde99c999102f732c3e32708bf5e1e05c666086" network for pod "coredns-59458dc98-7fqnj": networkPlugin cni failed to set up pod "coredns-59458dc98-7fqnj_kube-system" network: unable to create endpoint: Cilium API client timeout exceeded

And there are errors with stack traces in cilium-agent on the node like:
2020-12-14T16:17:03.756239102Z level=warning msg="Error fetching program/map!" subsys=datapath-loader
2020-12-14T16:17:03.756242896Z level=warning msg="Unable to load program" subsys=datapath-loader
2020-12-14T16:17:03.756820568Z level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" file-path=529_next/bpf_lxc.o identity=16387 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=eni684e9679747
2020-12-14T16:17:03.756830693Z level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.756870133Z level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 file-path=529_next_fail identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.757081515Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=6.262089438s bpfWaitForELF="4.951碌s" bpfWriteELF="150.352碌s" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ mapSync="6.33碌s" policyCalculation="9.073碌s" prepareBuild="304.815碌s" proxyConfiguration="12.346碌s" proxyPolicyCalculation="25.738碌s" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=6.263766628s waitingForCTClean="465.631碌s" waitingForLock="4.574碌s"
2020-12-14T16:17:03.757217667Z level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:11.932298540Z level=error msg="Command execution failed" cmd="[tc filter replace dev eni684e9679747 egress prio 1 handle 1 bpf da obj 529_next/bpf_lxc.o sec to-container]" error="exit status 1" subsys=datapath-loader

Could you please help?

btw, is there docker image with the fix above to check it?

Thanks!

p.s. this is my issue https://github.com/cilium/cilium/issues/14379#issuecomment-743364211
I upgraded to:
Kubernetes version | 1.18
Amazon VPC CNI plug-in | 1.7.5
DNS (CoreDNS) | 1.7.0
KubeProxy | 1.18.9

CNI Plugin v1.7 does not work with Cilium 1.9!
I've tested on created EKS cluster from scratch.
The workaround above does not help.

@jayanthvn @mmochan Could you please re-check?
Thanks!

Hi @kovalyukm,

I was just able to run Celium 1.9 in chaining mode with CNIv1.7.5. I added the the following lines to /etc/cni/net.d/05-cilium.conflist (as mentioned in this comment).

"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"

Can you make sure:

  • you correctly updated /etc/cni/net.d/05-cilium.conflist on your instances
  • You installed Celium in chaining mode, as described here

Please let me know if that works.

Hi @couralex6 ,

  • Yes, I've updated /etc/cni/net.d/05-cilium.conflist as described:
cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

and tried like in:

cat /etc/cni/net.d/10-aws.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    }
  ]
}
  • And installed Cilium in chaining mode like:
    cni:
#      customConf: true
      chainingMode: aws-cni
    masquerade: false

    tunnel: disabled
    nodeinit:
      # enables node initialization DaemonSet
      enabled: true

Maybe there is issue in software versions. I use EKS Kubernetes 1.18 and Cilium 1.9.1.
Have you tried with these versions?

Thanks!

@kovalyukm

It was also an EKS 1.18 cluster.

Your /etc/cni/net.d/10-aws.conflist looks right. You shouldn't have to modify it though.

Did you install Celium through Helm as described here: https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/ ?

@couralex6

Seems you use Cilium 1.9.0.

Yes, I use Cilium doc to manage it.

The workaround works with Cilium 1.9.0, but doesn't work with Cilium 1.9.1. (Seems this version is broken - https://github.com/cilium/cilium/issues/14403#issuecomment-745500480)

Thanks, waiting for CNIv1.7.8 with fix.

@couralex6 @jayanthvn

CNIv1.7.8 does not work, the same error like "invalid character '{' after top-level value".

Hi @kovalyukm

Sure we will try Cilium 1.9.1 and get back to you. But @mmochan has tried with Cilium 1.9.1 and the recommended work around.

@jayanthvn what is the ETA to release that fix that doesn't require any manual modification of the nodes? 2 versions already released after this PR was merged, but this fix was ignored in both of them.

Hi @shaikatz

Sorry for the delay. Will take this up in rel 1.7.9 planned for January.

Hi @jayanthvn,

Can you give an ETA on 1.7.9 release date?

Thanks

Mike

Was this page helpful?
0 / 5 - 0 ratings