Amazon-vpc-cni-k8s: CNI pod crashloopbackoff due to incorrect instance metadata while fetching ENI

Created on 5 Jun 2020 · 9Comments · Source: aws/amazon-vpc-cni-k8s

The agent in the program looks for all the attached ENIs in EC2 instance metadata before starting the pod. However, in very rare cases like the one which I observed yesterday, there was a discrepancy in the instance metadata to the actual ENIs attached. Due to this IPAMd fails and enters into a crashloopbackoff state. I have collected logs and can share the information in private.
@mogren

bug prioritP0

Source

cshivashankar

👍4

All 9 comments

@cshivashankar can you please share some of your log output. Does it resemble anything to #486 or #950 of which we're presently having the same issue and in-contact with AWS Support right now.

InAnimaTe on 9 Jun 2020

Hi @InAnimaTe Logs indicated following
Timed out waiting for IPAM daemon to start:
When I captured logs from eks log collectors and analyzed ipamd logs it looked something like this
(This can be found under var_log\aws-routed-eni)

{"level":"error","ts":"2020-06-04T08:26:43.180Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}

{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"runtime/asm_amd64.s:1357","msg":"Synced successfully with APIServer"}
{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for CNI pod aws-node-glbkt"}
{"level":"error","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000099 eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000099' does not exist\n\tstatus code: 400, request id: 192e237e-dbb4-46d3-bcc1-1e184ddd19cf"}
{"level":"debug","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000099' does not exist, ID: eni-00000099"}
{"level":"error","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000088' does not exist\n\tstatus code: 400, request id: 45415af0-835d-49c1-9b60-dc235eeeacd5"}
{"level":"debug","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000088' does not exist, ID: eni-00000088"}
{"level":"debug","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:314","msg":"DescribeAllENIs success: ENIs: 1, tagged: 0"}
{"level":"info","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:360","msg":"Setting up host network... "}

I checked both the bugs you mentioned, it doesn't look like either of them

cshivashankar on 9 Jun 2020

Yeah your issue looks different. Thanks for sharing your logs so quickly. Hopefully this one can get nabbed quickly!

InAnimaTe on 9 Jun 2020

👍1

I am having similar issue (could be same as mentioned above). POD not getting scheduled on those nodes
```NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
orderer-0 1/1 Running 0 78m 100.116.130.31 ip-100-116-130-136.us-west-2.compute.internal
raft0-0 0/1 ContainerCreating 0 69m ip-100-116-135-44.us-west-2.compute.internal
raft1-0 1/1 Running 0 78m 100.116.138.243 ip-100-116-137-235.us-west-2.compute.internal
raft2-0 1/1 Running 0 78m 100.116.131.20 ip-100-116-130-136.us-west-2.compute.internal
raft3-0 1/1 Running 0 78m 100.116.134.95 ip-100-116-133-195.us-west-2.compute.internal


**can be traced back to AWS-CNI pod instance crash on that node**

```aws-node-42g8r                   0/1     Error     6          9m52s   100.116.135.44    ip-100-116-135-44.us-west-2.compute.internal    <none>           <none>```



**kubectl describe po aws-node-42g8r**

Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned kube-system/aws-node-42g8r to ip-100-116-135-44.us-west-2.compute.internal
Warning Unhealthy 11m kubelet, ip-100-116-135-44.us-west-2.compute.internal Readiness probe errored: rpc error: code = Unknown desc = container not running (5035fe1baa42897f47183d48d6be83c73cc91aac71e30bfc5dde612f3d1472ab)
Normal Pulling 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
Normal Pulled 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
Normal Created 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Created container aws-node
Normal Started 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Started container aws-node
Warning BackOff 3m53s (x32 over 12m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Back-off restarting failed container



Crash pod logs

mk:~ sku $ kubectl logs -f aws-node-42g8r
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:


System information :

Kernel Version: 4.14.177-139.254.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.16.8-eks-e16311
Kube-Proxy Version: v1.16.8-eks-e16311
ProviderID: aws:///us-west-2b/i-0260dbe1061
```

RaspiRepo on 12 Jun 2020

is any recommendation fix for this issue ?

RaspiRepo on 12 Jun 2020

Hi @RaspiRepo, unfortunately not yet. I added a work-around for the issue with the metadata in #1011, but that change has not yet made it into a new release.

mogren on 15 Jun 2020

@RaspiRepo, @cshivashankar: I just published a new release candidate, v1.6.3-rc1 that we are currently testing to prepare for the final release. Like all pre-releases, it has been made available in us-west-2, in case you would like to test it out.

mogren on 16 Jun 2020

Thanks @mogren I will test and update.

cshivashankar on 16 Jun 2020

👍1

We just released v1.6.3 with the fix from #1011.

mogren on 19 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings