I updated my eks cluster to 1.15.10 and that worked.
Then I tried to update the cni-k8s from v1.5.5 to v1.6.0 on my test k8s test nodes(2) and as it's a daemonset I had 1 aws-node running and the other having following error:
kubectl logs -f pod/aws-node-cjqwm -nkube-system
starting IPAM daemon in background ... ok.
checking for IPAM connectivity ... failed.
timed out waiting for IPAM daemon to start.
I delete the pod but it's still having the same Error:
kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-22mnl 1/1 Running 0 15m
kube-system aws-node-h6nrx 0/1 Running 3 3m9s
More details:
kubectl describe po aws-node-h6nrx -nkube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m49s default-scheduler Successfully assigned kube-system/aws-node-h6nrx to ip-10-1-46-183.eu-central-1.compute.internal
Warning Unhealthy 3m33s kubelet, ip-10-1-46-183.eu-central-1.compute.internal Readiness probe errored: rpc error: code = Unknown desc = container not running (c542f67fbf22592a6840faa98cd3e9f1c774efeead2a6068319b0488570a903f)
Warning Unhealthy 2m39s kubelet, ip-10-1-46-183.eu-central-1.compute.internal Liveness probe failed: timeout: failed to connect service ":50051" within 1s
Normal Pulling 2m18s (x4 over 4m48s) kubelet, ip-10-1-46-183.eu-central-1.compute.internal Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
Normal Pulled 2m17s (x4 over 4m47s) kubelet, ip-10-1-46-183.eu-central-1.compute.internal Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0"
Normal Created 2m17s (x4 over 4m47s) kubelet, ip-10-1-46-183.eu-central-1.compute.internal Created container aws-node
Normal Started 2m17s (x4 over 4m47s) kubelet, ip-10-1-46-183.eu-central-1.compute.internal Started container aws-node
Warning Unhealthy 100s kubelet, ip-10-1-46-183.eu-central-1.compute.internal Liveness probe errored: rpc error: code = Unknown desc = container not running (a51a934a7d0867d500c7f9533d995ae7605ba7f80ed19186a513dd2fe62b0d88)
Warning BackOff 90s (x6 over 3m32s) kubelet, ip-10-1-46-183.eu-central-1.compute.internal Back-off restarting failed container
@laghao How was the CNI updated? If only the image tag was updated, the issue could be that the required /var/run/dockershim.sock was not mounted?
In your logs I see
Liveness probe failed: timeout: failed to connect service ":50051" within 1s
What is your initialDelaySeconds setting? The initial startup can take quite a while, since ipamd is first trying to talk to the API server, then to EC2 API. If any throttling is happening, or some retry, this might delay the initialization long enough for the liveness probe to fail.
Tried to increase initialDelaySeconds but it didn't work. Container fails with the following status:
Last State: Terminated
Reason: Error
Exit Code: 1
Not because of liveness probe failed
I updated the CNI using the helm chart aws-vpc-cni
Now in parallel I spinned another EKS cluster using terraform direcly using 1.15.10 and cni v1.6.0 which worked smoothly.
The direct upgrade looks broken somehow.
I tried using the helm chart to upgrade from v1.5.5 to v1.6.0 and it took my aws-node pods around 40 to 45 seconds to become ready, no restarts. Will keep trying to reproduce this issue.
@laghao what's your kubelet version on worker nodes? If you are using EKS AMI to launch your worker nodes, can you give us the AMI ID as well?
Hi everyone, I am upgrading from v1.5.5 to v1.6.0 and cni pod fails to start
starting IPAM daemon in background ... ok.
checking for IPAM connectivity ... failed.
timed out waiting for IPAM daemon to start.
dockershim mount exists on pods, and verified /var/run/dockershim.sock exists on the hosts.initialDelaySeconds to 90 didn't helpdockershim mount by docker inspect aws-host on hosts as well :{
"Type": "bind",
"Source": "/var/run/dockershim.sock",
"Destination": "/var/run/dockershim.sock",
"Mode": "",
"RW": true,
"Propagation": "rprivate"
}
Hi @hahasheminejad! Is there any chance you might be able to run the aws-cni-support.sh script before and after the upgrade and send the results to one of us? Either mogren@ or jaypipes@ amazon...
I got the same problem today after updating using eksctl
eksctl update cluster)eksctl utils update-kube-proxyeksctl utils update-aws-node (updated to 1.6)eksctl utils update-corednsSuddenly new containers could not start (timeout from cni). So I created a new nodegroup. Nodes would not become Ready and the AWS CNI was logging this error "timed out waiting for IPAM daemon to start."
~Rolling back to aws node 1.5.7 seems to fix the issue for now.~
EDIT:
I can't seem to get any pods running (except for aws-node, kube-proxy and calico-node) due to not being able to be assigned IP addresses on this cluster anymore, even after rolling back to 1.5.5. There aren't any obvious errors in the logs for aws-node either.
I figured out my issue, hopefully this will help someone else if they find this via Google. The aws-node serviceaccount was using a service account IAM role to provide access to the ENI EC2 API (ala https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-cni-walkthrough.html) instead of giving the node role access to the AmazonEKS_CNI_Policy.
Upgrading aws-node via eksctl overwrote the serviceaccount definition and removed the role annotation.
I fixed this by removing and re-adding the iamserviceaccount using eksctl
eksctl delete iamserviceaccount -f eksctl-cluster.yml --include kube-system/aws-node --approve
eksctl create iamserviceaccount -f eksctl-cluster.yml --approve
I have reported this to eksctl here
@hahasheminejad I noticed in your logs that your worker node was completely overwhelmed and pods constantly getting OOM killed:
grep "killed as a result of limit" messages | wc -l
19684
Did you see this issue on other nodes as well?
Hi, We are facing the same issue while upgrading from Kuberenetes 1.13(eks.9) Kuberenetes 1.14(eks.9) and using CNI v1.6.1 (from CNI v1.5.5) - Mounted dockershim mount
We tried following steps:
Removed and recreated the service account - ( initially SA is created by eksctl)
Removed the annotation in service account and readded it manually.
Restarted the aws-node pods.
Manual kubectl apply from 1.5.5 to 1.6.1
Logs:
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start
Please let us if there any workaround or when will be fix expected
Hello, as with the comment above we are also seeing the same issue updating vpc-cni from v1.5.5 to v.1.6.1.
We have 4 clusters (which are theoretically all configured the same way).
All on v1.15.11-eks-af3caf.
All worker nodes on the same AMI: 1.15.10-20200228.
DNS and Kube-proxy versions up to date inline with table in AWS official guide across all 4 clusters:
https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
CNI VPC plugin has been updated successfully across 3 clusters.
In the last cluster the DaemonSet rolled out successfully to 6/7 nodes.
On the last node the pod crash looped. I bounced it and it crash looped. Due to failing health.
I consistently am getting this issue in the pod logs as others have pointed to.
Starting IPAM daemon in the background ... ok.
Checking for IPAM connectivity ... failed.
Timed out waiting for IPAM daemon to start:
There are other workloads scheduled already on this node.
This has meant I needed to rollback to v1.5.5 only in this cluster.
I'm looking at resources and attempting to triage and may be raising to AWS Support seperately but adding here for more information on this issue occurring in general to keep the issue fresh.
Thanks for reporting the issue @njgibbon! Did you run the aws-cni-support.sh script on the node to gather the log data? It would be great if we could see why the pod failed to start. The logs should be in /var/log/aws-routed-eni/ on the worker node. We have seen issues related to kube-proxy before.
Also, if rolling back, would v1.5.7 be an option?
I've faced similar issues after upgrading to EKS 1.16 and upgrading VPC CNI plugin to 1.6.1 and the latest kube proxy 1.16.8.
After troubleshooting this with AWS Support, the combination of rolling back to our previous EKS 1.15 configuration, i.e. using AWS VPC CNI Plugin 1.5.7 and kubeproxy 1.15.11 worked for me on EKS 1.16.
Please note that terminating your existing EC2 instances might (or will?) be needed in order to get back to a running state.
Out of the 1.16 upgrade "prerequisites", the only mandatory one, if you were already on 1.15, is to make sure you have all yaml files converted to the new API (v1) version. No more betas. https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#1-16-prequisites
You might want to hold off any other changes for now until AWS further communicates on this issue.
For kube-proxy on 1.16, make sure that --resource-container is not in the spec. See Kubernetes 1.16 for details.
This was very hard to track down, but @mogren's comment was what solved it for me. My cluster was created ~2 years ago and kube-proxy was still using the --resource-container flag. After the 1.16 upgrade I started seeing this "cni config uninitialized" error and all the nodes got stuck in the NotReady state.
I tried to downgrade the CNI plugin back to 1.5.x, but that also didn't solve the problem. I had to manually edit my kube-proxy daemonset ($ kubectl edit ds kube-proxy -n kube-system) to remove the flag.
I think it'd be great to mention that in the upgrade guide.
@brianstorti We've updated the doc here https://github.com/awsdocs/amazon-eks-user-guide/pull/125, should go live soon.
@mogren Hi, I experienced almost same issue with @njgibbon .
I am running multiple cluster but upgrade only failed in one node in one cluster.
I roll-backed aws-node to v1.5.7 after I find upgrade failed.
I sent the result of running aws-cni-support.sh in problem happened node to mogren's email.
Hopes it helps.
@spacebarley Hi! Thanks for the logs, they made it clear that you ran into another issue:
{
"level": "error",
"ts": "2020-05-18T08:52:22.632Z",
"caller": "aws-k8s-agent/main.go:30",
"msg": "Initialization failure: failed to allocate one IP addresses on ENI eni-0aaaafcedcb7b0940e,
err: allocate IP address: failed to allocate a private IP address:
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
status code: 400,
request id: 0xxxxxx-a5e4-4a47-b76a-0360e364d5f1"
}
The subnet is out of IPs. First, since you were running the v1.5.x CNI earlier, check for leaked ENIs in your account. They will be marked as Available (blue dot) in the AWS Console, and have a tag, node.k8s.amazonaws.com/instance_id, that shows what instance they once belonged to.
Closing this issue since it has turned into a bucket of multiple upgrade issues. The things we have seen so far:
eksctlkube-proxy on Kubernetes 1.16 does no longer support the --resource-container flagPlease open a new issue if you find any new problem.
Most helpful comment
This was very hard to track down, but @mogren's comment was what solved it for me. My cluster was created ~2 years ago and
kube-proxywas still using the--resource-containerflag. After the 1.16 upgrade I started seeing this "cni config uninitialized" error and all the nodes got stuck in theNotReadystate.I tried to downgrade the CNI plugin back to 1.5.x, but that also didn't solve the problem. I had to manually edit my
kube-proxydaemonset ($ kubectl edit ds kube-proxy -n kube-system) to remove the flag.I think it'd be great to mention that in the upgrade guide.