The agent in the program looks for all the attached ENIs in EC2 instance metadata before starting the pod. However, in very rare cases like the one which I observed yesterday, there was a discrepancy in the instance metadata to the actual ENIs attached. Due to this IPAMd fails and enters into a crashloopbackoff state. I have collected logs and can share the information in private.
@mogren
@cshivashankar can you please share some of your log output. Does it resemble anything to #486 or #950 of which we're presently having the same issue and in-contact with AWS Support right now.
Hi @InAnimaTe Logs indicated following
Timed out waiting for IPAM daemon to start:
When I captured logs from eks log collectors and analyzed ipamd logs it looked something like this
(This can be found under var_log\aws-routed-eni)
{"level":"error","ts":"2020-06-04T08:26:43.180Z","caller":"aws-k8s-agent/main.go:30","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info"}
{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"runtime/asm_amd64.s:1357","msg":"Synced successfully with APIServer"}
{"level":"info","ts":"2020-06-06T00:39:58.944Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for CNI pod aws-node-glbkt"}
{"level":"error","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000099 eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000099' does not exist\n\tstatus code: 400, request id: 192e237e-dbb4-46d3-bcc1-1e184ddd19cf"}
{"level":"debug","ts":"2020-06-06T00:39:58.960Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000099' does not exist, ID: eni-00000099"}
{"level":"error","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-043a9bb8856789abc eni-00000088]: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-00000088' does not exist\n\tstatus code: 400, request id: 45415af0-835d-49c1-9b60-dc235eeeacd5"}
{"level":"debug","ts":"2020-06-06T00:39:59.037Z","caller":"ipamd/ipamd.go:328","msg":"Could not find interface: The networkInterface ID 'eni-00000088' does not exist, ID: eni-00000088"}
{"level":"debug","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:314","msg":"DescribeAllENIs success: ENIs: 1, tagged: 0"}
{"level":"info","ts":"2020-06-06T00:39:59.114Z","caller":"ipamd/ipamd.go:360","msg":"Setting up host network... "}
I checked both the bugs you mentioned, it doesn't look like either of them
Yeah your issue looks different. Thanks for sharing your logs so quickly. Hopefully this one can get nabbed quickly!
I am having similar issue (could be same as mentioned above). POD not getting scheduled on those nodes
```NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
orderer-0 1/1 Running 0 78m 100.116.130.31 ip-100-116-130-136.us-west-2.compute.internal
raft0-0 0/1 ContainerCreating 0 69m
raft1-0 1/1 Running 0 78m 100.116.138.243 ip-100-116-137-235.us-west-2.compute.internal
raft2-0 1/1 Running 0 78m 100.116.131.20 ip-100-116-130-136.us-west-2.compute.internal
raft3-0 1/1 Running 0 78m 100.116.134.95 ip-100-116-133-195.us-west-2.compute.internal
**can be traced back to AWS-CNI pod instance crash on that node**
```aws-node-42g8r 0/1 Error 6 9m52s 100.116.135.44 ip-100-116-135-44.us-west-2.compute.internal <none> <none>```
**kubectl describe po aws-node-42g8r**
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned kube-system/aws-node-42g8r to ip-100-116-135-44.us-west-2.compute.internal
Warning Unhealthy 11m kubelet, ip-100-116-135-44.us-west-2.compute.internal Readiness probe errored: rpc error: code = Unknown desc = container not running (5035fe1baa42897f47183d48d6be83c73cc91aac71e30bfc5dde612f3d1472ab)
Normal Pulling 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
Normal Pulled 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.1"
Normal Created 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Created container aws-node
Normal Started 11m (x4 over 13m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Started container aws-node
Warning BackOff 3m53s (x32 over 12m) kubelet, ip-100-116-135-44.us-west-2.compute.internal Back-off restarting failed container
Crash pod logs
mk:~ sku $ kubectl logs -f aws-node-42g8r
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
System information :
Kernel Version: 4.14.177-139.254.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.16.8-eks-e16311
Kube-Proxy Version: v1.16.8-eks-e16311
ProviderID: aws:///us-west-2b/i-0260dbe1061
```
is any recommendation fix for this issue ?
Hi @RaspiRepo, unfortunately not yet. I added a work-around for the issue with the metadata in #1011, but that change has not yet made it into a new release.
@RaspiRepo, @cshivashankar: I just published a new release candidate, v1.6.3-rc1 that we are currently testing to prepare for the final release. Like all pre-releases, it has been made available in us-west-2, in case you would like to test it out.
Thanks @mogren I will test and update.
We just released v1.6.3 with the fix from #1011.