What happened:
New cluster with nodes restarted.
coredns stuck on ContainerCreating when using CNI v1.7.6 and Cilium 1.9.1.
Other pods are also experiencing the same behavior ( ContainerCreating )
coredns:v1.6.6-eksbuild.1
Attach logs
Warning FailedCreatePodSandBox 29s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "112861d5995ca8f44c1dc17f00c947d72a44cf69c9deda34fbaf56b204742874" network for pod "coredns-6d857998c6-gxsd7": networkPlugin cni failed to set up pod "coredns-6d857998c6-gxsd7_kube-system" network: invalid character '{' after top-level value
What you expected to happen:
I expected coredns and other pods to be in running state
How to reproduce it (as minimally and precisely as possible):
Deploy cni version 1.7.6 and cilium 1.9.1 on EKS 1.17
Anything else we need to know?:
We have Cilium running in chaining mode (v1.9.1)
[(https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/)]
Environment:
kubectl version):Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
amazon-k8s-cni-init:v1.7.6
amazon-k8s-cni:v1.7.6
cat /etc/os-release):NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
uname -a):Hi @mmochan ,
Can you please check if you are hitting this issue - https://github.com/aws/amazon-vpc-cni-k8s/issues/1265. RC for this issue - https://github.com/aws/amazon-vpc-cni-k8s/issues/1265#issuecomment-717349630.
Thanks.
Hi jayanthvn
AWS_VPC_K8S_PLUGIN_LOG_FILE is being set as expected
Environment:
ADDITIONAL_ENI_TAGS: {}
AWS_VPC_CNI_NODE_PORT_SUPPORT: true
AWS_VPC_ENI_MTU: 9001
AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER: false
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: false
AWS_VPC_K8S_CNI_EXTERNALSNAT: false
AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
AWS_VPC_K8S_CNI_LOG_FILE: /host/var/log/aws-routed-eni/ipamd.log
AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
AWS_VPC_K8S_CNI_VETHPREFIX: eni
AWS_VPC_K8S_PLUGIN_LOG_FILE: /var/log/aws-routed-eni/plugin.log
AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG
DISABLE_INTROSPECTION: false
DISABLE_METRICS: false
ENABLE_POD_ENI: false
MY_NODE_NAME: (v1:spec.nodeName)
WARM_ENI_TARGET: 1
But /host/etc/cni/net.d/05-cilium.conflist doesn't match 05-cilium.conflist in issue - [#1265]
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni"
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
},
{
"name": "cilium",
"type": "cilium-cni",
"enable-debug": false
}
]
}
Thanks
Hi @mmochan
Yes you will have to add these 2 lines -
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
Something like this -
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni"
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
},
{
"name": "cilium",
"type": "cilium-cni",
"enable-debug": false
}
]
}
Hi @jayanthvn
Great that works, all pods now running.
Are you able to give an ETA for a permanent fix?
Thanks for your help.
Mike
Good to know it works, #1275 is merged and we are planning for the next release, I will provide you the dates in a week or so.
Thanks again @jayanthvn
Hi @mmochan
Yes you will have to add these 2 lines -
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log", "pluginLogLevel": "Debug"Something like this -
{ "cniVersion": "0.3.1", "name": "aws-cni", "plugins": [ { "name": "aws-cni", "type": "aws-cni", "vethPrefix": "eni" "pluginLogFile": "/var/log/aws-routed-eni/plugin.log", "pluginLogLevel": "Debug" }, { "type": "portmap", "capabilities": {"portMappings": true}, "snat": true }, { "name": "cilium", "type": "cilium-cni", "enable-debug": false } ] }
Unfortunately does not work for me.
Containers stuck with another error like:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bcea7cf2eac9dc94fed5316d1cde99c999102f732c3e32708bf5e1e05c666086" network for pod "coredns-59458dc98-7fqnj": networkPlugin cni failed to set up pod "coredns-59458dc98-7fqnj_kube-system" network: unable to create endpoint: Cilium API client timeout exceeded
And there are errors with stack traces in cilium-agent on the node like:
2020-12-14T16:17:03.756239102Z level=warning msg="Error fetching program/map!" subsys=datapath-loader
2020-12-14T16:17:03.756242896Z level=warning msg="Unable to load program" subsys=datapath-loader
2020-12-14T16:17:03.756820568Z level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" file-path=529_next/bpf_lxc.o identity=16387 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=eni684e9679747
2020-12-14T16:17:03.756830693Z level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.756870133Z level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 file-path=529_next_fail identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.757081515Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=6.262089438s bpfWaitForELF="4.951碌s" bpfWriteELF="150.352碌s" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ mapSync="6.33碌s" policyCalculation="9.073碌s" prepareBuild="304.815碌s" proxyConfiguration="12.346碌s" proxyPolicyCalculation="25.738碌s" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=6.263766628s waitingForCTClean="465.631碌s" waitingForLock="4.574碌s"
2020-12-14T16:17:03.757217667Z level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:11.932298540Z level=error msg="Command execution failed" cmd="[tc filter replace dev eni684e9679747 egress prio 1 handle 1 bpf da obj 529_next/bpf_lxc.o sec to-container]" error="exit status 1" subsys=datapath-loader
Could you please help?
btw, is there docker image with the fix above to check it?
Thanks!
p.s. this is my issue https://github.com/cilium/cilium/issues/14379#issuecomment-743364211
I upgraded to:
Kubernetes version | 1.18
Amazon VPC CNI plug-in | 1.7.5
DNS (CoreDNS) | 1.7.0
KubeProxy | 1.18.9
CNI Plugin v1.7 does not work with Cilium 1.9!
I've tested on created EKS cluster from scratch.
The workaround above does not help.
@jayanthvn @mmochan Could you please re-check?
Thanks!
Hi @kovalyukm,
I was just able to run Celium 1.9 in chaining mode with CNIv1.7.5. I added the the following lines to /etc/cni/net.d/05-cilium.conflist (as mentioned in this comment).
"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
Can you make sure:
/etc/cni/net.d/05-cilium.conflist on your instancesPlease let me know if that works.
Hi @couralex6 ,
cat /etc/cni/net.d/05-cilium.conflist
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni",
"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
},
{
"name": "cilium",
"type": "cilium-cni",
"enable-debug": false
}
]
}
and tried like in:
cat /etc/cni/net.d/10-aws.conflist
{
"cniVersion": "0.3.1",
"name": "aws-cni",
"plugins": [
{
"name": "aws-cni",
"type": "aws-cni",
"vethPrefix": "eni",
"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "DEBUG"
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
}
]
}
cni:
# customConf: true
chainingMode: aws-cni
masquerade: false
tunnel: disabled
nodeinit:
# enables node initialization DaemonSet
enabled: true
Maybe there is issue in software versions. I use EKS Kubernetes 1.18 and Cilium 1.9.1.
Have you tried with these versions?
Thanks!
@kovalyukm
It was also an EKS 1.18 cluster.
Your /etc/cni/net.d/10-aws.conflist looks right. You shouldn't have to modify it though.
Did you install Celium through Helm as described here: https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/ ?
@couralex6
Seems you use Cilium 1.9.0.
Yes, I use Cilium doc to manage it.
The workaround works with Cilium 1.9.0, but doesn't work with Cilium 1.9.1. (Seems this version is broken - https://github.com/cilium/cilium/issues/14403#issuecomment-745500480)
Thanks, waiting for CNIv1.7.8 with fix.
@couralex6 @jayanthvn
CNIv1.7.8 does not work, the same error like "invalid character '{' after top-level value".
Hi @kovalyukm
Sure we will try Cilium 1.9.1 and get back to you. But @mmochan has tried with Cilium 1.9.1 and the recommended work around.
@jayanthvn what is the ETA to release that fix that doesn't require any manual modification of the nodes? 2 versions already released after this PR was merged, but this fix was ignored in both of them.
Hi @shaikatz
Sorry for the delay. Will take this up in rel 1.7.9 planned for January.
Hi @jayanthvn,
Can you give an ETA on 1.7.9 release date?
Thanks
Mike
Most helpful comment
Hi @shaikatz
Sorry for the delay. Will take this up in rel 1.7.9 planned for January.