Amazon-vpc-cni-k8s: aws-node pod restarts without any obvious errors

Created on 9 Jan 2019  Â·  33Comments  Â·  Source: aws/amazon-vpc-cni-k8s

EKS: v1.11.5
CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)

E.g:

$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-2t4n4   1/1     Running   1          20h
aws-node-45j6l   1/1     Running   1          3h
aws-node-4mstw   1/1     Running   0          22h
aws-node-95smx   1/1     Running   0          1h
aws-node-9cz4c   1/1     Running   1          1h
aws-node-9nkzt   1/1     Running   0          3h
aws-node-9pfgq   1/1     Running   0          2h
aws-node-cr5ds   1/1     Running   1          1h
aws-node-hhtrt   1/1     Running   2          1h
aws-node-j8brm   1/1     Running   0          6d
aws-node-jvvgc   1/1     Running   1          1h
aws-node-kd7ld   1/1     Running   1          22h
aws-node-mr7dh   1/1     Running   1          1h
aws-node-ntn57   1/1     Running   1          1h
aws-node-tntxp   1/1     Running   1          2h
aws-node-vk6cz   1/1     Running   0          2h
aws-node-vtpz7   1/1     Running   1          4h
aws-node-xm9wz   1/1     Running   1          1h

Even if I describe pod aws-node-hhtrt there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c this message:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting  

I tried to run /opt/cni/bin/aws-cni-support.sh on the node with pod aws-node-hhtrt but I get this error:

[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1223  100  1223    0     0   1223      0  0:00:01 --:--:--  0:00:01 1194k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   912  100   912    0     0    912      0  0:00:01 --:--:--  0:00:01  890k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   106  100   106    0     0    106      0  0:00:01 --:--:--  0:00:01  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    83  100    83    0     0     83      0  0:00:01 --:--:--  0:00:01 83000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    28  100    28    0     0     28      0  0:00:01 --:--:--  0:00:01 28000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6268  100  6268    0     0   6268      0  0:00:01 --:--:--  0:00:01 6121k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 10255: Connection refused
bug documentation

Most helpful comment

Any ETA for a fix?

All 33 comments

Hey @liwenwu-amazon,
Do you have any idea about this?

We are now seeing it in a production cluster:

$ kubectl -n kube-system get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
aws-node-52mcx                                                    1/1     Running   1          6m

$ kubectl -n kube-system describe pod aws-node-52mcx
Name:           aws-node-52mcx
Namespace:      kube-system
Node:           ip-10-1-25-141.ec2.internal/10.1.25.141
Start Time:     Tue, 15 Jan 2019 09:54:30 +0100
Labels:         controller-revision-hash=3287759043
                k8s-app=aws-node
                pod-template-generation=2
Annotations:    scheduler.alpha.kubernetes.io/critical-pod:
Status:         Running
...
    State:          Running
      Started:      Tue, 15 Jan 2019 09:55:17 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 09:54:45 +0100
      Finished:     Tue, 15 Jan 2019 09:55:15 +0100
    Ready:          True
    Restart Count:  1
   ...
Events:
  Type    Reason                 Age                   From                                  Message
  ----    ------                 ----                  ----                                  -------
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-bin-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "cni-net-dir"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "dockersock"
  Normal  SuccessfulMountVolume  6m54s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "log-dir"
  Normal  SuccessfulMountVolume  6m53s                 kubelet, ip-10-1-25-141.ec2.internal  MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
  Normal  Pulling                6m8s (x2 over 6m52s)  kubelet, ip-10-1-25-141.ec2.internal  pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Pulled                 6m7s (x2 over 6m41s)  kubelet, ip-10-1-25-141.ec2.internal  Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
  Normal  Created                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Created container
  Normal  Started                6m7s (x2 over 6m39s)  kubelet, ip-10-1-25-141.ec2.internal  Started container

(I'm sorry if you are waiting for @liwenwu-amazon's reply) but you should collect following logs:

  $ kubectl -n kube-system logs aws-node-52mcx -p
  $ kubectl -n kube-system get event

First kubectl logs -p collects previous container logs, so if the process in previous container exited with some reason, we can see the message.
Second kubectl event collects entire event logs in kube-system namespace. You collected describe pod but it contains only Running pod's event. It is better to check entire event in the namespace.

Note, kubectl logs -p cannot collect logs if previous containers are already GCed. Also, kubectl get event might be gone in 2 hours (depends on settings on the cluster).

Thanks for the reply @nak3

I already mentioned the logs in issue text. Again, this is the message from the previous pod:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting

Getting event data is a good idea but in this scenario it produces only Normal type events:

$ kubectl -n kube-system get event | grep aws-node
13m         13m          1       aws-node.157a0690cdccdc01                                    DaemonSet                                               Normal    SuccessfulCreate         daemonset-controller                  Created pod: aws-node-b6jrg
13m         13m          1       aws-node-b6jrg.157a0690fe521337                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-bin-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53507e                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "log-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe53085f                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "cni-net-dir"
13m         13m          1       aws-node-b6jrg.157a0690fe52e142                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "dockersock"
13m         13m          1       aws-node-b6jrg.157a06914d3edae2                              Pod                                                     Normal    SuccessfulMountVolume    kubelet, ip-10-1-24-47.ec2.internal   MountVolume.SetUp succeeded for volume "aws-node-token-cnmp7"
13m         13m          2       aws-node-b6jrg.157a0691a62f4b04                              Pod         spec.containers{aws-node}                   Normal    Pulling                  kubelet, ip-10-1-24-47.ec2.internal   pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a06945ded769c                              Pod         spec.containers{aws-node}                   Normal    Pulled                   kubelet, ip-10-1-24-47.ec2.internal   Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0"
13m         13m          2       aws-node-b6jrg.157a0694d9e113d9                              Pod         spec.containers{aws-node}                   Normal    Created                  kubelet, ip-10-1-24-47.ec2.internal   Created container
13m         13m          2       aws-node-b6jrg.157a0694debaabe7                              Pod         spec.containers{aws-node}                   Normal    Started                  kubelet, ip-10-1-24-47.ec2.internal   Started container

I see. (I'm sorry. I missed you mentioned that have already checked previous pod.). How about /var/log/aws-routed-eni/{ipamd.log,plugin.log}? Have you already checked them?

All good!

Good idea. I had a look at these logs just now and there is a few errors.

For this pod:

Containers:
  aws-node:
....
    Image:          602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 15 Jan 2019 11:19:46 +0100
      Finished:     Tue, 15 Jan 2019 11:20:16 +0100

Looking at /var/log/aws-routed-eni/ipamd.log around this time, nothing interesting:

2019-01-15T10:19:46Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:19:46Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-01-15T10:20:18Z [INFO] Testing communication with server
2019-01-15T10:20:18Z [INFO] Running with Kubernetes cluster version: v1.11+. git version: v1.11.5-eks-6bad6d. git tree state: clean. commit: 6bad6d9c768dc0864dab48a11653aa53b5a47043. platform: linux/amd64
2019-01-15T10:20:18Z [INFO] Communication with server successful
2019-01-15T10:20:18Z [INFO] Starting Pod controller

But in /var/log/aws-routed-eni/plugin.log:

2019-01-15T10:19:48Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:48Z [INFO] Received CNI add request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns(/proc/5870/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI add request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:49Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:49Z [INFO] Received CNI del request: ContainerID(0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Netns(/proc/6153/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=fluent-bit-np426;K8S_POD_INFRA_CONTAINER_ID=0d0d4165483b8565a2718326cfd02b593ddcd43e2234742d640ead8e90731935) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:19:50Z [INFO] Received CNI del request: ContainerID(2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=eu01-stg01-icash-web-668bf5f99b-2k6zt;K8S_POD_INFRA_CONTAINER_ID=2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:19:50Z [ERROR] Error received from DelNetwork grpc call for pod eu01-stg01-icash-web-668bf5f99b-2k6zt namespace default container 2192c36a267d314cac5d81635db4173ad79c802af65c5e3853147dc9bb1c0544: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

That last error repeats over and over until:

2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...
2019-01-15T10:20:20Z [INFO] Received CNI del request: ContainerID(55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=ingress01-nginx-ingress-controller-tsqr6;K8S_POD_INFRA_CONTAINER_ID=55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-01-15T10:20:20Z [ERROR] Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>
2019-01-15T10:20:20Z [INFO] Starting CNI Plugin v1.3.0  ...

Both error repeat for a few more times and then there is no more in this log.

I hope that helps!

@max-rocket-internet
for Failed to communicate with K8S Server. Please check instance security groups or http proxy setting, you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

for error when running /opt/cni/bin/aws-cni-support.sh, please see #285

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: <nil>, please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

for Failed to process delete request for pod ingress01-nginx-ingress-controller-tsqr6 namespace default container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368: , please take a look at /var/log/aws-routed-eni/ipamd.log.xxx at that timestamp why it failed to process delete

OK, maybe it's unrelated to this problem then but here it is:

2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container 55ea31155c5f2f30c0f4eb71e0c2e4501526e5e79642683d3916434c7501a368
2019-01-15T10:20:20Z [WARN] UnassignIPv4Address: Failed to find pod ingress01-nginx-ingress-controller-tsqr6 namespace default Container

you should open a support TT with AWS EKS service and see why your worker can NOT communicate with EKS master through EKS-Owned ENI at that timestamp

I can do this but we have 4 different clusters in different regions and they all have the same problem. That's why I thought it was more likely a problem with the CNI or related config than an AWS service. But OK I'll open a support ticket also.

2 more things...

  1. Doesn't my log output above show the CNI starting multiple times? Since its printing Starting CNI Plugin v1.3.0 ... over and over?

  2. We see the message Failed to communicate with K8S Server a lot. 241 messages over the last week But only a few pod restarts. So I think perhaps the 2 are not related? Can you tell me if that message causes the pod to restart?

Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the kubernetes.svc.default.cluster.local DNS address. If you run the following command on your EKS cluster, you'll see that kube-proxy uses the public endpoint.

$ kubectl get po \
  -n kube-system \
  -l k8s-app=kube-proxy \
  -o jsonpath='{range .items[0].spec.containers[0].command[2]}{@}{"\n"}{end}'

We need to update CNI to have a similar flag so it isn't dependent on DNS or kube-proxy being up.

Cool, thanks for the update @micahhausler

Any ETA for a fix?

Duplicate of https://github.com/aws/amazon-vpc-cni-k8s/issues/282
We are targeting a fix as part of v1.4 (https://github.com/aws/amazon-vpc-cni-k8s/milestone/5)

I think #282 is a different issue, reopening this one. The digging so far there seems to be an open issue in Kubernetes where ENV variables are not always set. The operator-framework has a workaround for it that we are looking into.

Fixed by #364.

JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.

➜ kubectl logs -n kube-system aws-node-xm4dd
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0418 16:22:29.501108      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Pod restart didn't help.

All other nodes were ok and still are ok. I terminated the broken node to replace with another one.

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

The node is dead but if I get the problem again, I'll run the script, thank you for sharing.

I faced this again.

➜ k logs -n kube-system aws-node-mgf7v
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0603 11:32:43.025918      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
  Warning  FailedCreatePodSandBox  8m15s                kubelet, ip-10-51-31-71.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to set up pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "295d63fad14d812c84ddbda1c93be193bf2bab63ac8ef8a599c5190ad885b0e8" network for pod "tables-job-1559561271-sbjmq": NetworkPlugin cni failed to teardown pod "tables-job-1559561271-sbjmq_consumer" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          3m (x25 over 8m15s)  kubelet, ip-10-51-31-71.ec2.internal  Pod sandbox changed, it will be killed and re-created.
# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61678: Connection refused

@kivagant-ba The Failed to communicate with K8S Server error is worrying, could the masters be overloaded? How many nodes and pods are you running?

It's EKS and I don't think the masters are overloaded. The cluster is almost empty and only random nodes periodically experience the issue.

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:05:06Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

Right now we have 6 nodes on the cluster.

I had this issue (or an extremely similar one) after using eksctl to upgrade aws-node, coredns and kube-proxy in an EKS cluster going from 1.11 to 1.12 and then 1.13.

I did 2 things which ultimately fixed this issue for me in all my clusters (unfortunately I haven't narrowed it down to 1 thing yet):

  1. Restored my manifests for the above 3 Deployments/Daemonsets and then just patched the image field manually (eksctl changes several other fields)
  2. Upgraded to aws-node 1.5.0 which as of the past few days, AWS now recommend: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.5.3
➜ k logs -n kube-system aws-node-c7kv6
====== Installing AWS-CNI ======
====== Starting amazon-k8s-agent ======
ERROR: logging before flag.Parse: E0822 20:13:30.043898       9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Added +7 nodes at once to the cluster and none of them can start while the older nodes work fine.

@tiffanyfay, where I can sent cni logs privately?

@kivagant-ba mogren[at]amazon.com and tfj[at]amazon.com please! Thanks.

Before I send this... I'm watching for the get pod command right now and I see how pods start one after another. ... aaand that's it. It took almost an hour for oldest of them to finally start.

@kivagant-ba Also, this PR that got merged into 1.14 seems relevant https://github.com/kubernetes/kubernetes/pull/70994.

@tiffanyfay can I contact you using an official AWS support channel?

UP: I created a support request to move the conversation to more official area.

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

If the issue is in CNI v1.5.3, the problem should be visible in the log. The change to not copy the config until ipamd is running happened in commit https://github.com/aws/amazon-vpc-cni-k8s/commit/f9acdeb62644b68cd2c94ef2725ffcbc64fbec76 and was done to avoid scheduling pods to that node until we had ENIs and IPs available. If ipamd can not talk to the kubernetes API server, it will not copy the plugin binary or the config file and not make the node Ready.

We've run into a similar issue, and for us, /etc/cni/10-aws.conflist is missing. copying that from another node allows us to continue. We've engaged AWS support. Does anyone know how that file get's put on the node?

UPDATE: looks like cni version 1.5.3 causes the issue, but if i edit the daemonset to 1.5.1, all the nodes come up

This solution worked for me as well!

@kivagant-ba Darn, thanks for letting us know. Any chance you ran https://github.com/aws/amazon-vpc-cni-k8s/blob/master/scripts/aws-cni-support.sh ?

I've ran into the same issue and running that script on all master didn't seem to resolve the issue:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:11:43.934767 9 memcache.go:138] couldn't get current server API group list; will keep using cached value.

Hi @aweis89, the support script does not resolve any issues. Rather, it simply gathers information useful during troubleshooting.

Can you let us know what version of the CNI plugin you are deploying and what version of Kubernetes you are using? Thanks much!

Hey @jaypipes!

I've tried with a couple different version from 1.5.0 to the current one I'm using 1.5.5 but the issue is still persistent. The K8S version is 1.14.10. The full log I'm seeing is:
====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Notably 100.64.0.1 seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).

Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.

I'm not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it's not properly set up to talk to the apiserver (even if there's a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that's the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1

Was this page helpful?
0 / 5 - 0 ratings