Amazon-vpc-cni-k8s: Pod restarting at node startup after upgrading to Kubernetes 1.15

Created on 13 Mar 2020 · 10Comments · Source: aws/amazon-vpc-cni-k8s

aws-node pods restart 1 time in every node startup, with this message in logs:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

This behavior has started just after upgrading EKS cluster to 1.15.10. Specifically, I think the problem is in kube-proxy version 1.15.10. I downgraded kube-proxy to 1.14.7 and the issue stopped. After upgrading it again to 1.15.10, it restarted.

In the IPAMD logs on the node, I see the service restarting always at this point:

2020-03-13T11:16:54.321Z [INFO]     Setting up host network... 
2020-03-13T11:16:54.322Z [DEBUG]    Trying to find primary interface that has mac : xx:xx:xx:xx:xx:xx
2020-03-13T11:16:54.322Z [DEBUG]    Discovered interface: lo, mac: 
2020-03-13T11:16:54.322Z [DEBUG]    Discovered interface: eth0, mac: xx:xx:xx:xx:xx:xx
2020-03-13T11:16:54.322Z [INFO]     Discovered primary interface: eth0
2020-03-13T11:16:54.322Z [DEBUG]    Setting RPF for primary interface: /proc/sys/net/ipv4/conf/eth0/rp_filter
2020-03-13T11:16:54.323Z [DEBUG]    Found the Link that uses mac address xx:xx:xx:xx:xx:xx and its index is 2 (attempt 1/5)


2020-03-13T11:16:55.840Z [INFO]     Starting L-IPAMD v1.6.0  ...
2020-03-13T11:16:56.160Z [INFO]     Testing communication with server
2020-03-13T11:16:56.161Z [INFO]     Running with Kubernetes cluster version: v1.15+. git version: v1.15.10-eks-bac369. git tree state: clean. commit: bac3690554985327ae4d13e42169e8b1c2f37226. platform: linux/amd64

Anybody getting the same issue after upgrading? Thanks!

bug

Source

nachomillangarcia

All 10 comments

Similar issue #872

nithu0115 on 17 Mar 2020

I'm wondering if the kube-proxy thing might be a red herring. Perhaps we just need to up the timeout for k8s API connectivity from 30 seconds to something like 1 minute?

That line:

checking for IPAM connectivity ...  failed.

occurs when IPAMd fails to connect to the Kubernetes API service after 30 seconds. kube-proxy is required in order for the Daemonset running IPAM-D to connect to the Kubernetes API service (because kube-proxy sets up the iptables rules on the host for the pod traffic). I'm wondering if kube-proxy 1.15.10 is taking a little longer (>30 seconds or so) to come up on the host after an upgrade and therefore a domino effect is happening with the IPAM-D timing out trying to connect to the k8s API server.

jaypipes on 18 Mar 2020

It seems totally that. I don't see the error when kube-proxy is fully working, only at startup.

Would be great to customize that timeout

nachomillangarcia on 18 Mar 2020

@nachomillangarcia did you update both kube-proxy and aws-node at the same time?

SaranBalaji90 on 18 Mar 2020

No, I was using aws-node 1.6 weeks before upgrading kube-proxy, no errors so far.

nachomillangarcia on 18 Mar 2020

@nachomillangarcia thanks for confirming. I initially thought you noticed the behavior on existing nodes as well. But reading your issue description again, seems like it happens on node startup only.

One thing we can do to confirm this quickly would be, to look at when kube-proxy went to running state on the worker node and compare that with ipamd restart times.

SaranBalaji90 on 19 Mar 2020

Hi @nachomillangarcia, have you tried with v1.6.1? Also, did this restart only happen when kube-proxy was updated?

mogren on 29 Apr 2020

We have made some changes to the master upgrade process that should mitigate this problem. Please open a new issue if there are any kube-proxy or CNI issues.

mogren on 19 May 2020

👎2 😕1

Facing the Same issue on aws-node pod restarts 1 time on at every node startup, it will work after that
Error:

kubectl logs aws-node-f8tw6   --previous -n kube-system

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0904 13:53:37.150548       8 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

I tried adding sleep in aws-node to rule that this is happening because kube-proxy is taking time to start, verified that kube-proxy started before aws-node.

tibin-mfl on 4 Sep 2020

Happened intermittently after upgrading EKS from 1.14 to 1.15 with proxy v1.14.9-eksbuild.1 and aws node v1.6.3-eksbuild.1. When this happens, node takes a very long time to register as healthy and aws node restarts several times.