EKS: v1.11.5
CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)
E.g:
$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME READY STATUS RESTARTS AGE
aws-node-2t4n4 1/1 Running 1 20h
aws-node-45j6l 1/1 Running 1 3h
aws-node-4mstw 1/1 Running 0 22h
aws-node-95smx 1/1 Running 0 1h
aws-node-9cz4c 1/1 Running 1 1h
aws-node-9nkzt 1/1 Running 0 3h
aws-node-9pfgq 1/1 Running 0 2h
aws-node-cr5ds 1/1 Running 1 1h
aws-node-hhtrt 1/1 Running 2 1h
aws-node-j8brm 1/1 Running 0 6d
aws-node-jvvgc 1/1 Running 1 1h
aws-node-kd7ld 1/1 Running 1 22h
aws-node-mr7dh 1/1 Running 1 1h
aws-node-ntn57 1/1 Running 1 1h
aws-node-tntxp 1/1 Running 1 2h
aws-node-vk6cz 1/1 Running 0 2h
aws-node-vtpz7 1/1 Running 1 4h
aws-node-xm9wz 1/1 Running 1 1h
Even if I describe pod aws-node-hhtrt there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c this message:
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
I tried to run /opt/cni/bin/aws-cni-support.sh on the node with pod aws-node-hhtrt but I get this error:
[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1223 100 1223 0 0 1223 0 0:00:01 --:--:-- 0:00:01 1194k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 912 100 912 0 0 912 0 0:00:01 --:--:-- 0:00:01 890k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106 100 106 0 0 106 0 0:00:01 --:--:-- 0:00:01 103k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 83 100 83 0 0 83 0 0:00:01 --:--:-- 0:00:01 83000
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 28 100 28 0 0 28 0 0:00:01 --:--:-- 0:00:01 28000
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6268 100 6268 0 0 6268 0 0:00:01 --:--:-- 0:00:01 6121k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 10255: Connection refused
Hey @liwenwu-amazon,
Do you have any idea about this?
We are now seeing it in a production cluster:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
aws-node-52mcx 1/1 Running 1 6m
$ kubectl -n kube-system describe pod aws-node-52mcx
Name: aws-node-52mcx
Namespace: kube-system
Node: ip-10-1-25-141.ec2.internal/10.1.25.141
Start Time: Tue, 15 Jan 2019 09:54:30 +0100
Labels: controller-revision-hash=3287759043
k8s-app=aws-node
pod-template-generation=2
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
...
State: Running
Started: Tue, 15 Jan 2019 09:55:17 +0100
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 15 Jan 2019 09:54:45 +0100
Finished: Tue, 15 Jan 2019 09:55:15 +0100
Ready: True
Restart Count: 1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 6m54s kubelet, ip-10-1-25-141.ec2.internal MountVolume.SetUp succeeded for volume "cni-bin-dir"
Normal SuccessfulMountVolume 6m54s kubelet, ip-10-1-25-141.ec2.internal MountVolume.SetUp succeeded for volume "cni-net-dir"
Normal SuccessfulMountVolume 6m54s kubelet, ip-10-1-25-141.ec2.internal MountVolume.SetUp succeeded for volume "dockersock"
Normal SuccessfulMountVolume 6m54s kubelet, ip-10-1-25-141.ec2.internal MountVolume.SetUp succeeded for volume "log-dir"
Normal SuccessfulMountVolume 6m53s kubelet, ip-10-1-25-141.ec2.internal MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
Normal Pulling 6m8s (x2 over 6m52s) kubelet, ip-10-1-25-141.ec2.internal pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
Normal Pulled 6m7s (x2 over 6m41s) kubelet, ip-10-1-25-141.ec2.internal Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
Normal Created 6m7s (x2 over 6m39s) kubelet, ip-10-1-25-141.ec2.internal Created container
Normal Started 6m7s (x2 over 6m39s) kubelet, ip-10-1-25-141.ec2.internal Started container
(I'm sorry if you are waiting for @liwenwu-amazon's reply) but you should collect following logs:
$ kubectl -n kube-system logs aws-node-52mcx -p
$ kubectl -n kube-system get event
First kubectl logs -p collects previous container logs, so if the process in previous container exited with some reason, we can see the message.
Second kubectl event collects entire event logs in kube-system namespace. You collected describe pod but it contains only Running pod's event. It is better to check entire event in the namespace.
Note, kubectl logs -p cannot collect logs if previous containers are already GCed. Also, kubectl get event might be gone in 2 hours (depends on settings on the cluster).
Thanks for the reply @nak3
I already mentioned the logs in issue text. Again, this is the message from the previous pod:
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
Getting event data is a good idea but in this scenario it produces only Normal type events:
$ kubectl -n kube-system get event | grep aws-node
13m 13m 1 aws-node.157a0690cdccdc01 DaemonSet Normal SuccessfulCreate daemonset-controller Created pod: aws-node-b6jrg
13m 13m 1 aws-node-b6jrg.157a0690fe521337 Pod Normal SuccessfulMountVolume kubelet, ip-10-1-24-47.ec2.internal MountVolume.SetUp succeeded for volume "cni-bin-dir"
13m 13m 1 aws-node-b6jrg.157a0690fe53507e Pod Normal SuccessfulMountVolume kubelet, ip-10-1-24-47.ec2.internal MountVolume.SetUp succeeded for volume "log-dir"
13m 13m 1 aws-node-b6jrg.157a0690fe53085f Pod Normal SuccessfulMountVolume kubelet, ip-10-1-24-47.ec2.internal MountVolume.SetUp succeeded for volume "cni-net-dir"
13m 13m 1 aws-node-b6jrg.157a0690fe52e142 Pod Normal SuccessfulMountVolume kubelet, ip-10-1-24-47.ec2.internal MountVolume.SetUp succeeded for volume "dockersock"
13m 13m 1 aws-node-b6jrg.157a06914d3edae2 Pod Normal SuccessfulMountVolume kubelet, ip-10-1-24-47.ec2.internal MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
13m 13m 2 aws-node-b6jrg.157a0691a62f4b04 Pod spec.containers{aws-node} Normal Pulling kubelet, ip-10-1-24-47.ec2.internal pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m 13m 2 aws-node-b6jrg.157a06945ded769c Pod spec.containers{aws-node} Normal Pulled kubelet, ip-10-1-24-47.ec2.internal Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m 13m 2 aws-node-b6jrg.157a0694d9e113d9 Pod spec.containers{aws-node} Normal Created kubelet, ip-10-1-24-47.ec2.internal Created container
13m 13m 2 aws-node-b6jrg.157a0694debaabe7 Pod spec.containers{aws-node} Normal Started kubelet, ip-10-1-24-47.ec2.internal Started container
I see. (I'm sorry. I missed you mentioned that have already checked previous pod.). How about /var/log/aws-routed-eni/{ipamd.log,plugin.log}? Have you already checked them?
All good!
Good idea. I had a look at these logs just now and there is a few errors.
For this pod:
Containers:
aws-node:
....
Image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
...
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 15 Jan 2019 11:19:46 +0100
Finished: Tue, 15 Jan 2019 11:20:16 +0100
Looking at /var/log/aws-routed-eni/ipamd.log around this time, nothing interesting:
2019-01-15T10:19:46Z [INFO] Starting L-IPAMD v1.3.0 ...
2019-01-15T10:19:46Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Starting L-IPAMD v1.3.0 ...
2019-01-15T10:20:18Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Running with Kubernetes cluster version: v1.11+. git version: v1.11.5-eks-6bad6d. git tree state: clean. commit: 6bad6d9c768dc0864dab48a11653aa53b5a47043. platform: linux/amd64
2019-01-15T10:20:18Z [INFO] Communication with server successful
2019-01-15T10:20:18Z [INFO] Starting Pod controller
But in /var/log/aws-routed-eni/plugin.log:
2019-01-15T10:19:48Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:19:48Z [INFO] Received CNI add request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:19:49Z [INFO] Received CNI add request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:19:50Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [ERROR] Error received from DelNetwork grpc call for pod eu01-stg01-icash-web-668bf5f99b-2k6zt namespace default container 2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"
That last error repeats over and over until:
2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0 ...
2019-01-15T10:20:20Z [INFO] Received CNI del request: ContainerID(55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=ingress01-nginx-ingress-controller-tsqr6;K8S_POD_INFRA_CONTAINER_ID=55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:20:20Z [ERROR] Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>
2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0 ...
Both error repeat for a few more times and then there is no more in this log.
I hope that helps!
@max-rocket-internet
for Failed to communicate with K8S Server. Please check instance security groups or http proxy setting, you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp
for error when running /opt/cni/bin/aws-cni-support.sh, please see #285
for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>, please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete
for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368:
, please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete
OK, maybe it's unrelated to this problem then but here it is:
2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368
2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container
you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp
I can do this but we have 4 different clusters in different regions and they all have the same problem. That's why I thought it was more likely a problem with the CNI or related config than an AWS service. But OK I'll open a support ticket also.
2 more things...
Doesn't my log output above show the CNI starting multiple times? Since its printing Starting CNI Plugin v1.3.0 ... over and over?
We see the message Failed to communicate with K8S Server a lot. 241 messages over the last week But only a few pod restarts. So I think perhaps the 2 are not related? Can you tell me if that message causes the pod to restart?
Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the kubernetes.svc.default.cluster.local DNS address. If you run the following command on your EKS cluster, you'll see that kube-proxy uses the public endpoint.
$ kubectl get po \
-n kube-system \
-l k8s-app=kube-proxy \
-o jsonpath='{range .items[0].spec.containers[0].command[2]}{@}{"\n"}{end}'
We need to update CNI to have a similar flag so it isn't dependent on DNS or kube-proxy being up.
Cool, thanks for the update @micahhausler
Any ETA for a fix?
Duplicate of https://github.com/aws/amazon-vpc-cni-k8s/issues/282
We are targeting a fix as part of v1.4 (https://github.com/aws/amazon-vpc-cni-k8s/milestone/5)
I think #282 is a different issue, reopening this one. The digging so far there seems to be an open issue in Kubernetes where ENV variables are not always set. The operator-framework has a workaround for it that we are looking into.
Fixed by #364.
JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.
➜ kubectl logs -n kube-system aws-node-xm4dd
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0418 16:22:29.501108 11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Pod restart didn't help.
All other nodes were ok and still are ok. I terminated the broken node to replace with another one.
@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?
The node is dead but if I get the problem again, I'll run the script, thank you for sharing.
I faced this again.
➜ k logs -n kube-system aws-node-mgf7v
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0603 11:32:43.025918 11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Warning FailedCreatePodSandBox 8m15s kubelet, ip-10-51-31-71.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to set up pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to teardown pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 3m (x25 over 8m15s) kubelet, ip-10-51-31-71.ec2.internal Pod sandbox changed, it will be killed and re-created.
# /opt/cni/bin/aws-cni-support.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 61678: Connection refused
@kivagant-ba The Failed to communicate with K8S Server error is worrying, could the masters be overloaded? How many nodes and pods are you running?
It's EKS and I don't think the masters are overloaded. The cluster is almost empty and only random nodes periodically experience the issue.
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:05:06Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Right now we have 6 nodes on the cluster.
I had this issue (or an extremely similar one) after using eksctl to upgrade aws-node, coredns and kube-proxy in an EKS cluster going from 1.11 to 1.12 and then 1.13.
I did 2 things which ultimately fixed this issue for me in all my clusters (unfortunately I haven't narrowed it down to 1 thing yet):
image field manually (eksctl changes several other fields)aws-node 1.5.0 which as of the past few days, AWS now recommend: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.3
➜ k logs -n kube-system aws-node-c7kv6
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0822 20:13:30.043898 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Added +7 nodes at once to the cluster and none of them can start while the older nodes work fine.
@tiffanyfay, where I can sent cni logs privately?
@kivagant-ba mogren[at]amazon.com and tfj[at]amazon.com please! Thanks.
Before I send this... I'm watching for the get pod command right now and I see how pods start one after another. ... aaand that's it. It took almost an hour for oldest of them to finally start.
@kivagant-ba Also, this PR that got merged into 1.14 seems relevant https://github.com/kubernetes/kubernetes/pull/70994.
@tiffanyfay can I contact you using an official AWS support channel?
UP: I created a support request to move the conversation to more official area.
We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?
UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up
If the issue is in CNI v1.5.3, the problem should be visible in the log. The change to not copy the config until ipamd is running happened in commit https://github.com/aws/amazon-vpc-cni-k8s/commit/f9acdeb62644b68cd2c94ef2725ffcbc64fbec76 and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file and not make the node Ready.
We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?
UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up
This solution worked for me as well!
@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?
I've ran into the same issue and running that script on all master didn't seem to resolve the issue:
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0108 17:11:43.934767 9 memcache.go:138] couldn't get current server API group list; will keep using cached value.
Hi @aweis89, the support script does not resolve any issues. Rather, it simply gathers information useful during troubleshooting.
Can you let us know what version of the CNI plugin you are deploying and what version of Kubernetes you are using? Thanks much!
Hey @jaypipes!
I've tried with a couple different version from 1.5.0 to the current one I'm using 1.5.5 but the issue is still persistent. The K8S version is 1.14.10. The full log I'm seeing is:
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Notably 100.64.0.1 seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).
Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.
I'm not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it's not properly set up to talk to the apiserver (even if there's a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that's the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1
Most helpful comment
Any ETA for a fix?