amazon-vpc-cni-k8s 🚀 - EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value"

There's quite a few similar issues but they aren't exactly the same error. Not sure if this is a duplicate issue or not...

https://github.com/aws/amazon-vpc-cni-k8s/issues/1049#issuecomment-649983328 (logs don't match)
https://github.com/aws/amazon-vpc-cni-k8s/issues/1055 (perpetual problem)
https://github.com/aws/amazon-vpc-cni-k8s/issues/1054 (different Reason for last state of pod)

max-rocket-internet on 10 Jul 2020

Same for v1.6.3 😐

max-rocket-internet on 15 Jul 2020

👍6

@max-rocket-internet Thanks for testing with v1.6.3 as well. I suspect the issue here is that kube-proxy have not yet set up the iptables rules to resolve 172.20.0.1, and the old client code we still use (See #522 for details) doesn't handle this well.

I guess a work-around to avoid the restarts would be to retry a few times before returning an error, instead of letting kubelet do the retries. That would at least hide these errors from the logs.

mogren on 15 Jul 2020

Makes sense but why wasn't this a problem with v1.5.x releases? Or is it related to moving to EKS/AMI 1.16?

max-rocket-internet on 17 Jul 2020

we had the same issue after upgrading from eks 1.15 to 1.16. We were just bumping the image version inside DaemonSet to 1.6.X. What solved our issue is to apply the full yaml provided by aws doc:
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.6/config/v1.6/aws-k8s-cni.yaml

it did changes both to the DaemonSet and the ClusterRole.

Good luck !

hristov-hs on 24 Jul 2020

But there's no changes to that file between the v1.6.2 and v1.6.3 releases? https://github.com/aws/amazon-vpc-cni-k8s/commits/master/config/v1.6/aws-k8s-cni.yaml

max-rocket-internet on 29 Jul 2020

Any update @mogren?

max-rocket-internet on 4 Aug 2020

@max-rocket-internet Hey, sorry for the lack of updates on this. Been out for a bit without much network access, so haven't been able to track this one down. I agree that there is no config change between v1.6.2 and v1.6.3, but since v1.5.x, we have updated the readiness and liveness probe configs.

Between kubernetes 1.15 and 1.16 kube-proxy has changed, so that could be related. We have not yet been able to reproduce this yet doing master upgrades.

mogren on 4 Aug 2020

👍1

We had the same problem. Updating master and nodegroups from 1.15 to 1.16. We had to fix the version of kubeproxy
(kube-proxy:v1.16.13 -> kube-proxy:v1.16.12) and recreate nodes

schmitz-chris on 7 Aug 2020

👀1

^^

kubectl set image daemonset.apps/kube-proxy \
    -n kube-system \
    kube-proxy=602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.16.12

marcelbirkner on 7 Aug 2020

👍2

Just an FYI, we were encountering this issue on k8s 1.17, kube-proxy 1.16.13, aws cni 1.6.3 and 1.7.1. Turns out the issue was a bad PSP for kube-proxy that had readOnlyRootFilesystem: true. Kube proxy logs will show that it's unable to configure some iptables rules due to the readonly root fs. However this doesn't seem to crash kube-proxy. readOnlyRootFilesystem: false fixes things.

Niksko on 1 Sep 2020

So two suggestions now:

kube-proxy should have readOnlyRootFilesystem: false
Downgrade kube-proxy to kube-proxy:v1.16.12

Any confirmation from AWS about the issue and resolution?

max-rocket-internet on 2 Sep 2020

@max-rocket-internet @Niksko Could you please provide the kube-proxy configs that you have where this issue shows up? Is this on EKS clusters? What Kubernetes version? Do you have some custom PSP for these clusters? Are the worker nodes using a custom AMI or have some script locking iptables on startup?

Is there something you know that changed in 1.16.13 that makes starting kube-proxy take slightly longer time, triggering this issue?

mogren on 2 Sep 2020

@mogren this is on EKS, version 1.17. We discovered this as part of adding custom PSPs to all components. No scripts locking iptables on startup, using the standard EKS AMIs.

The behaviour we were seeing was that the aws-node pod _never_ became ready, and was crash-looping. Apologies if that caused any confusion. I think it's not unreasonable to conclude that:

kube-proxy sets up iptables rules that are required by aws-node
setting the filesystem to readonly on kube-proxy causes these rules to never be set up, so aws-node crash loops
a race condition between kube-proxy and aws-node could cause aws-node to come up before the iptables rules have been configured, causing an initial crash before working as normal (when kube-proxy creates the rules).

Niksko on 3 Sep 2020

👍2

I am facing a similar issue, aws-node pod restarts 1 time on every node startup, it will work after that.
Error:

kubectl logs aws-node-f8tw6   --previous -n kube-system

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0904 13:53:37.150548       8 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

I tried adding sleep in aws-node to rule that this is happening because kube-proxy is taking time to start, verified that kube-proxy started before aws-node.

tibin-mfl on 4 Sep 2020

Hi @tibin-mfl

Can you please share cni logs - https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and also kube-proxy pod logs. It will help us verify why kube-proxy is having a delay start.

Thanks.

jayanthvn on 4 Sep 2020

Hi, just wanted to chime in and we're seeing the same thing. Like others have mentioned the pod seems to restart once when the node first starts up and it's fine after that. We're not using any custom PSPs.

EKS version: 1.17
AMI version: v1.17.9-eks-4c6976
kube-proxy version: 1.17.7
CNI version: 1.6.3

I can see these errors in kube-proxy logs on one of the nodes where aws-node restarted:

udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck":{"max":0,"maxPer|..., bigger context ...|":"","configSyncPeriod":"15m0s","conntrack":{"max":0,"maxPerCore":32768,"min":131072,"tcpCloseWaitTi|...
I0905 23:12:39.826265       7 feature_gate.go:243] feature gates: &{map[]}
E0905 23:12:40.388938       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:41.516857       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:43.567271       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:48.167166       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:56.325941       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:13:14.684106       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
I0905 23:13:14.684134       7 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag
I0905 23:13:14.684150       7 server_others.go:145] Using iptables Proxier.
W0905 23:13:14.684259       7 proxier.go:286] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0905 23:13:14.684410       7 server.go:571] Version: v1.17.7
I0905 23:13:14.684773       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0905 23:13:14.684803       7 conntrack.go:52] Setting nf_conntrack_max to 131072
I0905 23:13:14.684850       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0905 23:13:14.684894       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0905 23:13:14.685092       7 config.go:313] Starting service config controller
I0905 23:13:14.685101       7 shared_informer.go:197] Waiting for caches to sync for service config
I0905 23:13:14.685139       7 config.go:131] Starting endpoints config controller
I0905 23:13:14.685149       7 shared_informer.go:197] Waiting for caches to sync for endpoints config
I0905 23:13:14.785879       7 shared_informer.go:204] Caches are synced for service config
I0905 23:13:14.785932       7 shared_informer.go:204] Caches are synced for endpoints config

And this is in the aws-node logs:

{"log":"Copying portmap binary ... Starting IPAM daemon in the background ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:26.418457689Z"}
{"log":"Checking for IPAM connectivity ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.458122639Z"}
{"log":"Copying additional CNI plugin binaries and config files ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.474182395Z"}
{"log":"Foregrounding IPAM daemon ... \n","stream":"stdout","time":"2020-09-03T11:06:46.474202946Z"}
{"log":"ERROR: logging before flag.Parse: W0903 14:22:54.564615       9 reflector.go:341] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109452564 (109453466)\n","stream":"stderr","time":"2020-09-03T14:22:54.564769814Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:30:26.713005       9 reflector.go:341] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109555468 (109679596)\n","stream":"stderr","time":"2020-09-03T18:30:26.713161405Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:45:56.655601       9 reflector.go:341] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109679596 (109687399)\n","stream":"stderr","time":"2020-09-03T18:45:56.655715674Z"}
````

It seems like this has started happening for us as part of the 1.17 upgrade, we haven't restarted all our nodes since the upgrade and I can see that the pods that are still running (on AMI `v1.16.12-eks-904af05`) the `aws-node` pod didn't restart:

aws-node-26tfq aws-node-2pnwq aws-node-4f52v aws-node-5qsll aws-node-6z6wq aws-node-92hvs aws-node-c8srx aws-node-chkhb aws-node-djlkb aws-node-g7drp aws-node-g9rgn aws-node-gbdq5 aws-node-gc5zl aws-node-hc48d aws-node-hx9bl aws-node-j9dcn aws-node-jj4qs aws-node-kwbjl aws-node-ljcv8 aws-node-lv74f aws-node-q2w2w aws-node-s7qw4 aws-node-tck8w aws-node-tjhtf aws-node-tzpb2 aws-node-vm4nh aws-node-xnnj2 aws-node-zchs9 ``` 1/1 Running 1 10h
1/1 Running 1 3h33m
1/1 Running 1 4d22h
1/1 Running 1 5d22h
1/1 Running 0 40d
1/1 Running 0 40d
1/1 Running 1 5d22h
1/1 Running 1 5d4h
1/1 Running 0 40d
1/1 Running 1 5d5h
1/1 Running 0 40d
1/1 Running 1 2d22h
1/1 Running 1 2d22h
1/1 Running 1 5d22h
1/1 Running 1 24d
1/1 Running 1 39d
1/1 Running 1 2d22h
1/1 Running 1 153m
1/1 Running 1 39d
1/1 Running 1 12d
1/1 Running 1 2d22h
1/1 Running 1 2d22h
1/1 Running 1 5d4h
1/1 Running 1 2d22h
1/1 Running 0 40d
1/1 Running 1 2d22h
1/1 Running 2 153m
1/1 Running 1 2d22h

I'm happy to share the full logs if they're helpful, just give me an email address to send them!

mogggggg on 6 Sep 2020

👀1

Hi @mogggggg

Thanks for letting us know. Please kindly share the full logs from the log collector script https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and also kube-proxy pod logs. You can email it [email protected].

Thanks.

jayanthvn on 6 Sep 2020

👍1

Hi @mogggggg

Thanks for sharing the logs. Will review and get back on that asap.

jayanthvn on 7 Sep 2020

Could you please provide the kube-proxy configs that you have where this issue shows up?

It's default EKS

Is this on EKS clusters?

Yes

What Kubernetes version?

v1.17.9-eks-4c6976

Do you have some custom PSP for these clusters?

Nope

Are the worker nodes using a custom AMI or have some script locking iptables on startup?

AMI is v20200723 and no custom scripts or except adding some users in user-data.

max-rocket-internet on 7 Sep 2020

I am facing a similar issue, aws-node pod restarts 1 time on every node startup, it will work after that.
EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

@jayanthvn Just wanted to know does aws-node pod depend on Kube-proxy or vice versa?

tibin-mfl on 7 Sep 2020

@tibin-mfl Yes, the CNI pod (aws-node) needs kube-proxy to set up the cluster IPs before it can start up.

mogren on 8 Sep 2020

👍2

Hi @mogggggg

Sorry for the delayed response, as you have mentioned it looks like kube-proxy is waiting to retrieve node info and during that time frame aws-node starts and is unable to communicate to the API Server because iptables isnt updated and hence it restarts. I will try to repro and we will see how to mitigate this issue.

Thanks for your patience.

jayanthvn on 10 Sep 2020

👍1

Currently, the workaround is adding a busybox as Init Container to wait for the kube-proxy start.

  initContainers:
  - name: init-kubernetes-api
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup kubernetes.default.svc.cluster.local ${KUBE_DNS_PORT_53_TCP_ADDR}; do echo waiting for kubernetes Service endpoint; sleep 2; done"]

focaaby on 21 Sep 2020

👍5

I added the initContainer to my aws-node, as a temporary fix, the same was suggested from aws support as well. Now the problem is sometimes aws-node take more than 5 minutes to be up.

sum by(daemonset, namespace) (kube_daemonset_status_number_unavailable{job="kube-state-metrics",namespace=~"kube-system"}) >0

Screenshot 2020-09-25 at 7 33 07 AM

tibin-mfl on 25 Sep 2020

@tibin-mfl thanks for reporting this, that is definitely concerning. Do you have kube-proxy logs from any of these nodes? It would be very interesting to see why kube-proxy was taking that long to start up!

mogren on 25 Sep 2020

Hi @jayanthvn, just wondering if you've had any luck narrowing down what could be causing this - I can gather more data on my end if that's helpful?

We use prometheus-operator which has a built-in alert that triggers when a pod restarts X number of times - unfortunately this issue causes the alert to trigger when we scale up a number of nodes.

mogggggg on 28 Sep 2020

@tibin-mfl thanks for reporting this, that is definitely concerning. Do you have kube-proxy logs from any of these nodes? It would be very interesting to see why kube-proxy was taking that long to start up!

@mogren I attached Kube-proxy error log

{"log":"udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck\":{\"max\":0,\"maxPer|..., bigger context ...|\":\"\",\"configSyncPeriod\":\"15m0s\",\"conntrack\":{\"max\":0,\"maxPerCore\":32768,\"min\":131072,\"tcpCloseWaitTi|...\n","stream":"stderr","time":"2020-09-04T13:53:07.16323908Z"}
{"log":"I0904 13:53:07.163201       1 feature_gate.go:243] feature gates: \u0026{map[]}\n","stream":"stderr","time":"2020-09-04T13:53:07.164205369Z"}
{"log":"E0904 13:53:07.657821       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:07.657910856Z"}
{"log":"E0904 13:53:08.817715       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:08.817826081Z"}
{"log":"E0904 13:53:11.028140       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:11.028236757Z"}
{"log":"E0904 13:53:15.789086       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:15.789186713Z"}
{"log":"E0904 13:53:24.954484       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:24.954581764Z"}
{"log":"E0904 13:53:43.712021       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:43.712148589Z"}
{"log":"I0904 13:53:43.712044       1 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2020-09-04T13:53:43.712191645Z"}
{"log":"I0904 13:53:43.712063       1 server_others.go:145] Using iptables Proxier.\n","stream":"stderr","time":"2020-09-04T13:53:43.712196787Z"}
{"log":"W0904 13:53:43.712165       1 proxier.go:286] clusterCIDR not specified, unable to distinguish between internal and external traffic\n","stream":"stderr","time":"2020-09-04T13:53:43.712204824Z"}
{"log":"I0904 13:53:43.712828       1 server.go:571] Version: v1.17.9\n","stream":"stderr","time":"2020-09-04T13:53:43.712928066Z"}
{"log":"I0904 13:53:43.713270       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072\n","stream":"stderr","time":"2020-09-04T13:53:43.713356276Z"}
{"log":"I0904 13:53:43.713305       1 conntrack.go:52] Setting nf_conntrack_max to 131072\n","stream":"stderr","time":"2020-09-04T13:53:43.713403074Z"}
{"log":"I0904 13:53:43.713538       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400\n","stream":"stderr","time":"2020-09-04T13:53:43.714641593Z"}
{"log":"I0904 13:53:43.713592       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600\n","stream":"stderr","time":"2020-09-04T13:53:43.714670464Z"} a

Complete kubeproxy log https://pastebin.pl/view/3d4cc276 attached here

tibin-mfl on 28 Sep 2020

👀1

Hi @mogggggg

Sorry I didn't get a chance to repro the issue with higher node scale. As you mentioned even we are suspecting kube-proxy is not able to get the node info when the node scale is high. I will try to repro it this week. Can you please let me know while scaling the nodes at what node scale do you see the alert and how many number nodes you scale at a time? Also are these spot instances? Next time when you repro, can you please increase the debug level of kube-proxy - --v=5

Thanks for your patience and help on this issue.

jayanthvn on 28 Sep 2020

Hi @jayanthvn,
I don't think the amount of nodes in the cluster is playing a particular role - our smallest cluster is 10 nodes and the restarting of the aws-node pod happens just the same as it does on one of our larger clusters (which is 40 nodes). We have four clusters total, two of which are 100% spot instances and two that are 100% on-demand this issue happens on all of them.

I've just emailed you a copy of the cni-troubleshoot script output with kube-proxy logging turned up - let me know if I can do anything else!

mogggggg on 29 Sep 2020

Hi @mogggggg

Thanks for the logs and the setup information. Will review the logs and update you asap.

jayanthvn on 29 Sep 2020

Hello everyone, I'm also having the same issue.
my K8s cluster is still on v1.15 and before upgrading to v1.16 I wanted to make sure all my controllers are on the recommended version by aws in this page.

My main controllers version now:
kube-proxy: v1.6.12 (works)
core-dns: 1.6.6 (works)
amazon-vpc-cni-k8s: 1.7.5 (doesn't work)

$ kubectl version --short
Client Version: v1.15.0
Server Version: v1.15.11-eks-065dce

The deployment is done exactly as in the release docs: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.5
I tried all the solutions mentioned in this repo:
1- downgrading kube-proxy from 1.6.13 to 1.6.12 with/without adding this label. Check this answer
2- I tried the following versions of amazon-vpc-cni-k8s: v1.7.5,v1.7.4, v1.7.3, v1.7.2 and finally wanted to roll-back to 1.6.3 but the issue is always the same.


  Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  3m56s                 default-scheduler                                      Successfully assigned kube-system/aws-node-98fnz to ip-some-ip..eu-west-1.compute.internal
  Normal   Pulling    3m54s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5"
  Normal   Pulled     3m39s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5"
  Normal   Created    3m38s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Created container aws-vpc-cni-init
  Normal   Started    3m38s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Started container aws-vpc-cni-init
  Normal   Started    3m29s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Started container aws-node
  Warning  Unhealthy  3m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:04.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  3m15s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:14.998Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  3m6s                  kubelet, ip-some-ip.eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:24.966Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m55s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:34.954Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m45s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:45.006Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m36s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:57:54.957Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:04.959Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m25s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Liveness probe failed: {"level":"info","ts":"2020-10-20T13:58:05.694Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  2m16s                 kubelet, ip-some-ip..eu-west-1.compute.internal  Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:14.960Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Killing    2m5s                  kubelet, ip-some-ip..eu-west-1.compute.internal  Container aws-node failed liveness probe, will be restarted
  Warning  Unhealthy  116s (x4 over 2m15s)  kubelet, ip-some-ip..eu-west-1.compute.internal  (combined from similar events): Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:34.963Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Pulling    115s (x2 over 3m37s)  kubelet, ip-some-ip.eu-west-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"
  Normal   Created    113s (x2 over 3m29s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Created container aws-node
  Normal   Pulled     113s (x2 over 3m30s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"

safaa-alnabulsi on 20 Oct 2020

👍1

Hi @safaa-alnabulsi

Just to confirm again, you are also seeing aws-node restart once and then there are no issues?

based on the logs you have attached -

 Normal   Killing    2m5s                  kubelet, ip-some-ip..eu-west-1.compute.internal  Container aws-node failed liveness probe, will be restarted
  Warning  Unhealthy  116s (x4 over 2m15s)  kubelet, ip-some-ip..eu-west-1.compute.internal  (combined from similar events): Readiness probe failed: {"level":"info","ts":"2020-10-20T13:58:34.963Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Pulling    115s (x2 over 3m37s)  kubelet, ip-some-ip.eu-west-1.compute.internal  Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"
  Normal   Created    113s (x2 over 3m29s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Created container aws-node
  Normal   Pulled     113s (x2 over 3m30s)  kubelet, ip-some-ip..eu-west-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5"

jayanthvn on 20 Oct 2020

@jayanthvn no, it keeps restarting after it's in "CrashLoopBackOff" state.

safaa-alnabulsi on 20 Oct 2020

Hi @safaa-alnabulsi

This looks to be a different issue, I feel it is better to be tracked as a new issue since this thread is mainly to track delay in kube-proxy to get the node information and aws-node waits for kube-proxy to come up. But eventually aws-node is up. Please open a new issue and share us the logs - sudo bash /opt/cni/bin/aws-cni-support.sh. You can email the logs to [email protected]. Thank you!

jayanthvn on 20 Oct 2020

👍1

Just checking in here to say I'm having the same issue with nodes taking forever to come online w/ CNI 1.7 on 1.17/EKS.3 and 1.18/EKS.1 on Botlerocket. I did file a support ticket as well (7461101551) that has a buncha logs attached to it.

patcable on 28 Oct 2020

I am facing this issue as well with 1.18 upgrade from 1.17. AWS CNI version 1.7.5

IPAMD is trying to connect to 172.20.0.1, but the iptable rules needed to make that happen is not there yet.

{"level":"info","ts":"2020-10-28T17:56:04.712Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.7.5  ..."}
{"level":"info","ts":"2020-10-28T17:56:34.714Z","caller":"aws-k8s-agent/main.go:41","msg":"Testing communication with server"}
{"level":"error","ts":"2020-10-28T17:57:04.714Z","caller":"aws-k8s-agent/main.go:28","msg":"Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout"}
{"level":"info","ts":"2020-10-28T17:57:44.699Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.7.5  ..."}
{"level":"info","ts":"2020-10-28T17:58:14.701Z","caller":"aws-k8s-agent/main.go:41","msg":"Testing communication with server"}
{"level":"error","ts":"2020-10-28T17:58:44.702Z","caller":"aws-k8s-agent/main.go:28","msg":"Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout"}

IP table rules on node (iptables -t nat -L -n)

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

Chain DOCKER (0 references)
target     prot opt source               destination

Chain KUBE-KUBELET-CANARY (0 references)
target     prot opt source               destination

Chain KUBE-MARK-DROP (0 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x8000

Chain KUBE-MARK-MASQ (0 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
MARK       all  --  0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

johnthedev97 on 28 Oct 2020

Would it make sense to add the approach that @focaaby mentioned here in the init container of aws-node?

giannisbetas on 28 Oct 2020

@giannisbetas I did that and had the same result (things still took a while to enter the ready state)

patcable on 29 Oct 2020

Has anyone seen this issue but do not have the liveness/readiness probes defined on the CNI daemonset?

We delete the pre-installed daemonset off of the cluster when starting in order to replace it with a Pulumi manifest we have. At some point, AWS appears to have added a liveness/readiness probe here (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.7/aws-k8s-cni.yaml#L158) that we were missing.

We 100% hit this issue every time a node startsup, where kubeproxy starts _after_ the CNI daemonset. Once we kill the CNI pod, it restarts and the node goes to Ready state. We were wondering why more AWS customers aren't running into this issue. I think many of them may have the probes defined, so the CNI is killed when it fails to talk to IPAMd, and then is restarted without them knowing.

@mogggggg @patcable would you mind attaching your daemonset manifest to the ticket? I'm just curious if you're using the release yaml's exactly or if you've modified them at all.

jeid64 on 29 Oct 2020

@jeid64 We just use the release YAMLs that are linked in the AWS docs, no customising on our end.

mogggggg on 29 Oct 2020

I got my issue resolved, in my case I am managing the entire EKS stack using terraform and the problem happened while updating the code base to support upgrade to 1.18 and AWS CNI 1.7.5 (from 1.6.4). Culprits in my case were

PSP (We use a specific PSP tailored for aws-node daemonset) and looks like 1.7.5 needs two additional privileges
a. NET_ADMIN
b. HostPath /var/run/aws-node
Terraform was updating the node groups to new AMI before CNI was updated, I had to change the order to update the CNI vesion before the node groups are updated.

Sharing in case it helps any one

johnthedev97 on 30 Oct 2020

👍1

I just wanted to give an update from our end - we've upgraded our dev cluster to 1.18 and aws-node to it's latest version and the issue looks like it's resolved!

mogggggg on 9 Nov 2020

my aws-node pod restarted with the following error on startup

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... ok.
cp: cannot create regular file '/host/opt/cni/bin/aws-cni': Text file busy
Copying additional CNI plugin binaries and config files ... %

Is this a related issue? or should I create a new issue for this?

tibin-mfl on 9 Dec 2020

Hi @tibin-mfl

Did you upgrade your cluster or the just the CNI image? What is the CNI version? Is it bottlerocket?

Thanks.

jayanthvn on 9 Dec 2020

Amazon-vpc-cni-k8s: EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value"

Most helpful comment

All 45 comments

Related issues