Amazon-vpc-cni-k8s: Not releasing old ENIs

Created on 3 Jul 2018  路  51Comments  路  Source: aws/amazon-vpc-cni-k8s

It looks like old ENIs are not being deleted by IPAM. What happens is it keeps creating ENIs till it hits the account limits.

What then happens is new pods cannot start - error "Failed create pod sandbox." Manually deleting "Available" ENIs allows the controller to create a new ENI, and schedule the pod as normal.

Number of nodes is 25. Not the highest we have had.

Potentially related to #18

I am happy to help with debugging/finding the root cause.

prioritP1

Most helpful comment

Thanks @harshal-shah, I discussed this with Liwen and basically it comes down to an optimization trade-off in the CNI plugin itself.

The way the CNI works is that it allocates IPs in bulk, and adds new ENIs when needed. It un-assigns IPs when the pod gets deleted, but keeps them allocated to the ENI as long as there are pods active in that ENI. (Check the eni.output file for details). When a new pod gets scheduled to the worker, it will randomly pick any existing IP address that is not assigned and allocate it to the new pod.

As soon as an ENI has no assigned IPs, meaning all pods using that ENI has been deleted, the ENI will be released and all the IPs with it. The reason for this behavior is both to make scheduling new pods a lot faster, no need to call another service, and also to prevent throttling from calling EC2 too much. (Throttling can still happen when scheduling a lot of pods quickly on new workers, not configuring WARM_IP_TARGET.)

If you run out of IP addresses in your Subnet, one solution is to use CNI custom network config and create separate subnets for the pods. Would that work for you?

All 51 comments

Additionally it looks like the aws-node controller hits API request ratelimits.

Error code
Client.RequestLimitExceeded

error

Deleting aws-node on this node resolved the issue. Something is definitely wonky.

I'm seeing this issue, sporadically, with some nodes. I agree deleting aws-node resolves the issue. Related: https://github.com/aws/amazon-vpc-cni-k8s/issues/59

@vpm-bradleyhession , troubleshootting-guilde provides guideline on how to troubleshoot CNI issue at cluster level.

ipamD should release old ENI if the number of Pods running on the nodes goes below threshold.

You can use cni-metrics-helper to view aggregated ENIs and IPs information at the cluster level.

kubectl apply -f cni_metrics_helper.yaml

You can find out cluster level ipamD statistics

kubectl logs cni-metrics-helper-xxxxxx -n kube-system
...
0705 15:52:30.539478       7 metrics.go:250] Processing metric: total_ip_addresses
I0705 15:52:30.539482       7 metrics.go:350] Produce GAUGE metrics: assignIPAddresses, value: 23.000000
I0705 15:52:30.539491       7 metrics.go:350] Produce GAUGE metrics: eniAllocated, value: 8.000000
I0705 15:52:30.539496       7 metrics.go:350] Produce GAUGE metrics: eniMaxAvailable, value: 15.000000
I0705 15:52:30.539500       7 metrics.go:350] Produce GAUGE metrics: maxIPAddresses, value: 735.000000
I0705 15:52:30.539506       7 metrics.go:350] Produce GAUGE metrics: totalIPAddresses, value: 387.000000
I0705 15:52:30.539510       7 metrics.go:340] Produce COUNTER metrics: ipamdErr, value: 0.000000
I0705 15:52:30.539516       7 metrics.go:350] Produce GAUGE metrics: ipamdActionInProgress, value: 0.000000
I0705 15:52:30.539521       7 metrics.go:340] Produce COUNTER metrics: reconcileCount, value: 0.000000

@vpm-bradleyhession , @tomfotherby if you see issue #18 or #59 on some nodes, can you run /opt/cni/bin/aws-cni-support.sh and collect node level debugging information? You can send the node level debugging info directly to me ([email protected])

The problem is that 90% of the requests to the AWS API are rate limited. The number of ENIs that are hanging around far out way the ones that are _actually_ deleted.

@vpm-bradleyhession is it possible your "DeleteNetworkInterface" API threshold is caused by some other tools or manually delete all ENIs at same time as ipamD.

if ipamD is being throttled, today, it uses AWS SDK exponential backoff and you should ipamdActionInProgress not being 0

kubectl logs cni-metrics-helper-xxxxxx -n kube-system
...
I0705 15:52:30.539482       7 metrics.go:350] Produce GAUGE metrics: assignIPAddresses, value: 23.000000
I0705 15:52:30.539491       7 metrics.go:350] Produce GAUGE metrics: eniAllocated, value: 8.000000
I0705 15:52:30.539496       7 metrics.go:350] Produce GAUGE metrics: eniMaxAvailable, value: 15.000000
I0705 15:52:30.539500       7 metrics.go:350] Produce GAUGE metrics: maxIPAddresses, value: 735.000000
I0705 15:52:30.539506       7 metrics.go:350] Produce GAUGE metrics: totalIPAddresses, value: 387.000000
I0705 15:52:30.539510       7 metrics.go:340] Produce COUNTER metrics: ipamdErr, value: 0.000000
I0705 15:52:30.539516       7 metrics.go:350] Produce GAUGE metrics: ipamdActionInProgress, value: 0.000000 <---- not zero
I0705 15:52:30.539521       7 metrics.go:340] Produce COUNTER metrics: reconcileCount, value: 0.000000

Hm, I dont think so. They we recently being deleted up until a few days ago (no changes that would affect it. Then one day it hit the ENI limit for our account.

I'm going to monitor it to catch when this happens again. Although, currently i'm getting

cni_metrics.go:102] Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-h99r5:61678) E0705 16:08:07.082577 6 metrics.go:407] Failed to getMetricsFromTarget: the server is currently unable to handle the request (get pods aws-node-h99r5:61678)

When using the metrics helper?

@vpm-bradleyhession is it possible you can deploy cni_metrics_helper and share the output of kubectl logs cni-metrics-helper-xxxxx -n kube-system?

@vpm-bradleyhession , what version is your CNI? It needs to be 1.0.0

kubectl get ds aws-node -n kube-system -o yaml | grep image
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.0.0

ah this might be it. It's Images:
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:0.1.1

right now. Should mention this is a KOPs cluster so the CNI was provisioned there. Is this safe to upgrade?

We should open an issue w/ Kops to get the version bumped upstream also. Good spot on the version.

@vpm-bradleyhession Please open an issue w/ Kops. 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:0.1.1 has fixed few issues and made improvement using AWS APIs

So is upgrading from 0.1.1 -> 1.0.0 safe to do?

@vpm-bradleyhession Yes. it is safe to upgrade from 0.1.1 -> 1.0.0. The only thing you need to make sure your worker node can reach K8S apiServer on port 443. In 1.0.0, ipamD needs to communicate with K8S apiServer. Some deployment have HTTP PROXY enable to block aws-node communicating with K8S apiServer (e.g. #104).

You can use telnet to confirm if worker node can reach its apiServer

telnet 10.100.0.1 443  <-- kubernetes SVC IP, port 443
Trying 10.100.0.1...
Connected to 10.100.0.1.
Escape character is '^]'.

@liwenwu-amazon thanks so much for your help! Yeah this seems to work, I will bump all nodes over to the CNI 1.0.0 version and see if this fixes our issue with rate limiting. I will update this when I have more information for you.

Thanks!

For reference, I'm seeing the issue using EKS with CNI 1.0.0:

$ kubectl get ds aws-node -n kube-system -o yaml | grep image
image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:1.0.0

(I'll provide more info when it next happens)

We're experiencing this issue on a cluster using m4.xlarge instances (currently running 2 instances, only seen the issue one one of them so far). Cycling the aws-node pods as mentioned in another issue had no effect.

We use Helm and the failed upgrade was rolled back, but the pod count on either cluster shouldn't have gone above 20 during the process. Currently the node we saw the issue on is running 17 pods. While I've been investigating the issue it appears that one ENI on each of the instances was finally released, but they both still have far to many IPs associated for the number of pods running (17 pods across all namespaces on one instance, with 3 ENIs each with 15 addresses).

$ kubectl get ds aws-node -n kube-system -o yaml | grep image
image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:1.0.0
$ kubectl logs cni-metrics-helper-xxxxxxxxx -n kube-system
...
I0705 22:41:49.173918       7 metrics.go:101] Label: [name:"error" value:"free ENI: no ENI can be deleted at this time"  name:"fn" value:"decreaseIPPoolFreeENIFailed" ], Value: 167
I0705 22:41:49.173955       7 metrics.go:350] Produce GAUGE metrics: eniMaxAvailable, value: 8.000000
I0705 22:41:49.173969       7 metrics.go:350] Produce GAUGE metrics: maxIPAddresses, value: 112.000000
I0705 22:41:49.173977       7 metrics.go:340] Produce COUNTER metrics: ipamdErr, value: 167.000000
I0705 22:41:49.173983       7 metrics.go:350] Produce GAUGE metrics: assignIPAddresses, value: 18.000000
I0705 22:41:49.173996       7 metrics.go:350] Produce GAUGE metrics: totalIPAddresses, value: 70.000000
I0705 22:41:49.174000       7 metrics.go:350] Produce GAUGE metrics: eniAllocated, value: 5.000000
I0705 22:41:49.174004       7 metrics.go:350] Produce GAUGE metrics: ipamdActionInProgress, value: 0.000000

@jlogsdon , today, ipamD will NOT free an ENI if there is any running Pod on top of this ENI. In another word, if there is any Pod using secondary IP address from this ENI, ipamD will not free that ENI. And it will increment that particular error count

I0705 22:41:49.173918       7 metrics.go:101] Label: [name:"error" value:"free ENI: no ENI can be deleted at this time"  name:"fn" value:"decreaseIPPoolFreeENIFailed" ], Value: 167

Would the IP addresses still assigned to that ENI be unavailable for new pods? It seems like there should have been plenty of capacity for more pods given our instance size and running pod count.

you can find out pod allocation by

curl http://localhost:61678/v1/enis | python -m json.tool

Thanks. Next time we see this I'll look at that output.

The IP address of ENI that is NOT assigned to other Pods are available for new Pod. In another word, all those IP address with "Assigned": false are available for new Pods. Here is one example on my setup:

curl http://localhost:61678/v1/enis | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   941  100   941    0     0   165k      0 --:--:-- --:--:-- --:--:--  183k
{
    "AssignedIPs": 8,
    "ENIIPPools": {
        "eni-00d33be72e3648649": {
            "AssignedIPv4Addresses": 5,
            "DeviceNumber": 2,
            "ID": "eni-00d33be72e3648649",
            "IPv4Addresses": {
                "10.0.101.97": {
                    "Assigned": true
                },
                "10.0.106.145": {
                    "Assigned": true
                },
                "10.0.107.149": {
                    "Assigned": true
                },
                "10.0.126.201": {
                    "Assigned": true
                },
                "10.0.127.191": {
                    "Assigned": true
                }
            },
            "IsPrimary": false
        },
        "eni-0353b1dce00ab1b51": {
            "AssignedIPv4Addresses": 3,
            "DeviceNumber": 0,
            "ID": "eni-0353b1dce00ab1b51",
            "IPv4Addresses": {
                "10.0.110.85": {
                    "Assigned": false  <---- this one is available for new pods
                },
                "10.0.118.236": {
                    "Assigned": false
                },

For us, because we're running pods with small resource requests, the limiting factor for a node is the number of IP addresses. I did some calculations and the best 3 types are t2.medium, t2.large and c5.large. I put the price per IP into a blog post.

I'm seeing a possibly related problem in our cluster that in some cases the worker nodes doesn't release POD IP addresses, which ends up filling the worker node with unusable slots and eventually the worker node is not able to start new pods.

When this problem appears kubernetes thinks that the node can accept pods, so they are scheduled but stuck with this error: "Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "xxxxxx" network: add cmd: failed to assign an IP address to container"

Currently the only workaround I know is to ssh into the worker nodes and delete all docker containers which are in Exited-state, for example with this oneliner: "docker ps -a | grep Exited | awk '{print $1}' | xargs -n 1 docker rm"

Today we were seeing Pods fail to start in EKS:

k describe pod my-pod-fb756

Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "my-pod-fb756_default" network: add cmd: failed to assign an IP address to container"

It turned out our subnet had only 256 IPs and it had run out of spare IPs.

For reference, to debug it, on Amazon Linux 2, official AMI:

  1. ssh to the EKS worker with the issue and check the ipamd logs.
  2. Use ls -ltrh /var/log/aws-routed-eni/ to see what ipamd logs have been written.
  3. tail the latest log:
$ tail /var/log/aws-routed-eni/ipamd.log.2018-12-11-14
2018-12-11T14:53:32Z [DEBUG] Found eni eni-06443b854 that have less IP address allocated: cur=0, max=14
2018-12-11T14:53:32Z [DEBUG] Attempt again to allocate IP address for eni :eni-06e104f443b854
2018-12-11T14:53:32Z [INFO] Trying to allocate 14 IP address on eni eni-06e10243b854
2018-12-11T14:53:32Z [ERROR] Failed to allocate a private IP address InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
    status code: 400, request id: 236be01a9-82d9-dc815adf8c7e
2018-12-11T14:53:32Z [WARN] During eni repair: error encountered on allocate IP addressallocate ip address: failed to allocate a private IP address: InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
    status code: 400, request id: 236b77dd-41a9-82d9-dc815adf8c7e
2018-12-11T14:53:32Z [DEBUG] IP pool stats: total = 0, used = 0, c.currentMaxAddrsPerENI = 14, c.maxAddrsPerENI = 14
2018-12-11T14:53:32Z [DEBUG] Start increasing IP Pool size
2018-12-11T14:53:32Z [INFO] createENI: use primary interface's config, [0xc428700 0xc4204d8730], subnet-d358a5
2018-12-11T14:53:32Z [ERROR] Failed to CreateNetworkInterface InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
    status code: 400, request id: 59abf6c-b54a9a-0ea02639904a
2018-12-11T14:53:32Z [ERROR] Failed to increase pool size due to not able to allocate ENI allocate eni: failed to create eni: failed to create network interface: InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.
    status code: 400, request id: 59abfa6c-b1-8a9a-0ea02639904a

Aha: InsufficientFreeAddressesInSubnet.

We run a few CronJob every minute - this burns through IPs :(

Following up @tomfotherby's comment seem that we had a significant amount of ENIs that were not in use but kept IPs attached. So our EKS workers subnets starved.

We also hit ENI limits _only_ on cronjobs. Since these only access external resources, they technically don't really need a CNI Addr (aka one from the VPC).

Some more information for context - the kubernetes scheduler is not aware of a node that's hit it's ENI / Addr limits. For example:

Instance with 18 Addresses -> 18 Pods assigned -> 0 Free addresses -> 75% CPU Requested ->
75% Memory requested

In this scenario we get pods that have < 25% CPU OR Memory trying to schedule on this node when there are no addresses left.

The Kubernetes scheduler will _always_ try to assign a pod to this node (since the scheduler is not aware IP Limits are a thing).

@vpm-bradleyhession Hello, I was wondering if you ever got the cni-metrics-helper to push logs to cloudwatch? It seems that we are both using kops for our kubernetes cluster so we might have a similar setup, I deployed the metrics helper to the cluster, and created the IAM policy for cloudwatch, I also attached that policy to the master/worker roles for my cluster. However, I am still seeing this error 'Unable to publish cloudwatch metrics: NoCredentialProviders: no valid providers in chain. Deprecated.' in the cni-metrics-helper pod.

Does anyone here have any reliable solutions on how to resolve these issues? We seem to run into them multiple times and our work around so far is to find nodes with more than X ENI's and cordon them. Eventually the cluster autoscaler takes care of removing the cordoned instances

This is causing some real issues for us. All of our clusters are affected by this as the scheduler gets busier throughout the day. Can we acknowledge this as a bug?

@liwenwu-amazon do you have any suggestions on how to mitigate the problem? It might not affect the majority of the use-cases but when you're relying 100% on k8s+aws-cni and it happens, the disturbance on the cluster's stability is quite big. For us, it's happening literally every day now.

Our solution is to clean terminated pods, in our case every terminated pod older than one hour is deleted. This seems to prevent this problem from manifesting.

Yeah we have set successfulJobHistory: (and failed, too) to 1 on most of our jobs and it seems to mitigate the issue for us, however - this still needs to be acknowledged with an upstream fix.

@garo Could you please explain "clean terminated pods" ? This could be a good solution for us as well

To be clear again, our problem is that the node does not have many pods if you do a describe node or even run ifconfig on the host :

~
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default some-service-http-59dbc5678-t7rpl 50m (1%) 0 (0%) 100Mi (0%) 0 (0%) 21d
kube-system aws-node-b6xpw 10m (0%) 0 (0%) 0 (0%) 0 (0%) 29d
kube-system kube-proxy-ip-172-23-113-91.eu-west-1.compute.internal 100m (2%) 0 (0%) 0 (0%) 0 (0%) 29d
logging kube-gelf-d4ztv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29d
monitoring kube-prometheus-prometheus-node-exporter-v8kqz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29d
tracing jaeger-agent-daemonset-zmsqb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d23h
utility kiam-agent-l88bs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29d
~

But looking at the node from aws console:

image

So how can we clean up these secondary IPs ?

@harshal-shah I run this https://github.com/hjacobs/kube-job-cleaner as a slightly modified version which just does the cleanup again in an infinitive loop every five minutes.

If you ssh into your affected worker node and type "docker ps -a" you should see a great amount of Exited containers and these seems to be what are holding the cni plugin back from releasing the IPs. It seems that deleting terminated pods will eventually clean these away and thus provide a workaround.

I checked on the node mentioned above, there were a few dead containers, even after deleting those containers, the IPs are not released.

Update: We're now setting --max-pods on the Kubelet, this will match the number of ENIs (and addresses available). We are setting this per node group (using kops).

For kops specifically -

spec: kubelet: maxPods: <number>
https://gist.github.com/vpm-bradleyhession/b43a61a94e724f040dbf822a1a206f0d

This seems like the only mitigation that is holding strong for us at the moment.

Hello @mogren, could you comment on this issue please? we're getting affected on a daily basis, would be nice to have an official recommendation from AWS here. Thank you.

Hi @RTodorov!

What version of the plugin are you using? Is this cluster running on EKS? We have a known issue with the CNI that happens if ipamd is killed while it is in the middle of expanding ENI (e.g after having created a new ENI but before attaching it), then it might not be cleaned up.

Could you run /opt/cni/bin/aws-cni-support.sh (comment out line 34 if you get the error reported in #285) and send it to [email protected]?

Hello @mogren
We are using the following image 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0

The cluster is created via kops (version 1.11.0)

I have sent you the output of /opt/cni/bin/aws-cni-support.sh over email.
The node being checked here has 31 Non terminated pods.
We have also kept our MaxPods limit to 55

Thanks @harshal-shah, I discussed this with Liwen and basically it comes down to an optimization trade-off in the CNI plugin itself.

The way the CNI works is that it allocates IPs in bulk, and adds new ENIs when needed. It un-assigns IPs when the pod gets deleted, but keeps them allocated to the ENI as long as there are pods active in that ENI. (Check the eni.output file for details). When a new pod gets scheduled to the worker, it will randomly pick any existing IP address that is not assigned and allocate it to the new pod.

As soon as an ENI has no assigned IPs, meaning all pods using that ENI has been deleted, the ENI will be released and all the IPs with it. The reason for this behavior is both to make scheduling new pods a lot faster, no need to call another service, and also to prevent throttling from calling EC2 too much. (Throttling can still happen when scheduling a lot of pods quickly on new workers, not configuring WARM_IP_TARGET.)

If you run out of IP addresses in your Subnet, one solution is to use CNI custom network config and create separate subnets for the pods. Would that work for you?

Hi @mogren
We ran into this issue today as well.
I'm sending you further details over email.

One thing that we see in ipamd logs is

~
2019-01-21T09:33:19Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:56, assigned 27, pod(Name: masked-service-projector-6ffc8b489c-pcm2d, Namespace: default, Container 403ceda1c2b1665fa89745f36755d0621c2813584e409036389488420f1860e3)
2019-01-21T09:33:19Z [WARN] UnassignIPv4Address: Failed to find pod masked-service-projector-6ffc8b489c-pcm2d namespace default Container 403ceda1c2b1665fa89745f36755d0621c2813584e409036389488420f1860e3
2019-01-21T09:33:19Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:56, assigned 27, pod(Name: masked-service-projector-6ffc8b489c-pcm2d, Namespace: default, Container )
2019-01-21T09:33:19Z [WARN] UnassignIPv4Address: Failed to find pod masked-service-projector-6ffc8b489c-pcm2d namespace default Container
2019-01-21T09:33:19Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-01-21T09:33:19Z [INFO] Received DelNetwork for IP , Pod masked2-fragment-9f98b6ddf-92vsj, Namespace default, Container 1a159e21d3e7063f477f05d6cd6837c08edd6f7f772f3ded635d97cfe1c7d5f2
2019-01-21T09:33:19Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:56, assigned 27, pod(Name: masked2-fragment-9f98b6ddf-92vsj, Namespace: default, Container 1a159e21d3e7063f477f05d6cd6837c08edd6f7f772f3ded635d97cfe1c7d5f2)
~

Hi @mogren we faced this issue again today on a very new node of our cluster. It had less than the prescribed maximum of 55 pods. Following is the event trail for a pod that was not getting IP address for a few minutes and then it got working fine.

~
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned default/masked-68c4cf5845-62ngd to ip-172-23-69-148.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 10m kubelet, ip-172-23-69-148.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "160c971186b055c2cf3dacc3d76765123f9b97c5e91998d1e086306146888583" network for pod "masked-68c4cf5845-62ngd": NetworkPlugin cni failed to set up pod "masked-68c4cf5845-62ngd_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "160c971186b055c2cf3dacc3d76765123f9b97c5e91998d1e086306146888583" network for pod "masked-68c4cf5845-62ngd": NetworkPlugin cni failed to teardown pod "masked-68c4cf5845-62ngd_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Warning FailedCreatePodSandBox 10m kubelet, ip-172-23-69-148.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "40bb89f933f46ac5a08267bc091e3572e137adeaf721d221b4bac55c86b98c66" network for pod "masked-68c4cf5845-62ngd": NetworkPlugin cni failed to set up pod "masked-68c4cf5845-62ngd_default" network: add cmd: failed to assign an IP address to container
Warning FailedCreatePodSandBox 10m kubelet, ip-172-23-69-148.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ace11f03d5f908014d45b75efd2fe4105f208f42ef3d84e6cd97d7648b62990a" network for pod "masked-68c4cf5845-62ngd": NetworkPlugin cni failed to set up pod "masked-68c4cf5845-62ngd_default" network: add cmd: failed to assign an IP address to container
Normal SandboxChanged 10m (x3 over 10m) kubelet, ip-172-23-69-148.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulling 10m kubelet, ip-172-23-69-148.eu-west-1.compute.internal pulling image "quay.io/hellofresh/state-machine:0.86.0"
Normal Pulled 8m16s kubelet, ip-172-23-69-148.eu-west-1.compute.internal Successfully pulled image "quay.io/hellofresh/state-machine:0.86.0"
Normal Created 8m15s kubelet, ip-172-23-69-148.eu-west-1.compute.internal Created container
Normal Started 8m14s kubelet, ip-172-23-69-148.eu-west-1.compute.internal Started container
~

I am also sending the tarball with logs to you separately. Hope this helps.

@harshal-shah Thanks for reporting this. We have gone through the logs and found some minor issues, but have not managed to determine the root cause yet. The previous logs from the 22:nd had a lot more data and I'll keep going through them to see if we can figure out why ipamd gets restarted.

facing the same issue. In my case I just followed the tutorial and created VPC, cluster and 1 worker node. Now I cannot create any pods :(.

Here's what I get from kubectl get events:

7s 7s 1 my-podd-1.158172f02248703c Pod Normal Scheduled default-scheduler Successfully assigned default/my-podd-1 to ip-192-168-145-72.eu-west-1.compute.internal

6s 6s 1 my-podd-1.158172f04749da90 Pod Warning FailedCreatePodSandBox kubelet, ip-192-168-145-72.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "4e42f734b8575d5c2554f55eef1cc215727809913ace8073564a3a4fc34cfd14" network for pod "my-podd-1": NetworkPlugin cni failed to set up pod "my-podd-1_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "4e42f734b8575d5c2554f55eef1cc215727809913ace8073564a3a4fc34cfd14" network for pod "my-podd-1": NetworkPlugin cni failed to teardown pod "my-podd-1_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

6s 6s 1 my-podd-1.158172f078d9e780 Pod Normal SandboxChanged kubelet, ip-192-168-145-72.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.

@paksv - I encountered the same issue.

It looks like my pods that are created (including aws-node CNI pod) aren't able to communicated with API server via the cross account eni's.

Upon checking the ingress rules of the security groups associated to the cluster, I found
443 rule from the Worker nodes security group was missing. I manually added the ingress rule to allow 443 from worker nodes on my cluster security group which solved my issue.

Running into the same issue here in a deployment. If I delete the pod the next one created usually starts up fine.

Resolving since v1.5.0 is released. This version contains a lot of changes in how we allocate ENIs and that we stopped force detaching them.

If using too many ENIs is an issue, consider setting the WARM_IP_TARGET environment variable.

Was this page helpful?
0 / 5 - 0 ratings