Calico: Assigned ipBlocks are not released

Created on 9 Jul 2019 · 21Comments · Source: projectcalico/calico

Assigned blocks of IP addresses (etcd location /calico/ipam/v2/assignment/ipv4/block/<ip-block>) are not released even if those are not assigned to nodes.

Expected Behavior

When there are no blocks, assigned to nodes (block is not assigned to any node at /calico/ipam/v2/host/<host>/<block>), I expect them to be released from /calico/ipam/v2/assignment/ipv4/block/<ip-block>.

Current Behavior

A bunch of blocks, not assigned to nodes, are still kept in assignments.

Possible Solution

When the block is not assigned to the host - release it from assignments.

Steps to Reproduce (for bugs)

Create Kubernetes cluster with calico as cni plugin.
Create a bunch of deployments/replicasets/jobs, which will then create enough number of pods to have as many blocks assigned to hosts as possible. For this, we can set up small ipam size (e.g. /27) with small subnet size (e.g. /29).

Context

In the old cluster with ipam network size /18 and subnet block size /26 (default) we got into a situation, where there were 16 subnets, assigned to hosts(/calico/ipam/v2/host/<host>/<block>), but ~240 subnets in assignments (/calico/ipam/v2/assignment/). That lead is into the issue like

  Warning  FailedCreatePodSandBox  11m (x815 over 44h)    kubelet, worker-x53vn-7f9964b764-2rlb6  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "495cffc4ae7e0e717a62ca01f8249a42928997e9d37e604e2bf2440d108ff0f4" network for pod "test-report-service-7c7f8486bd-qkbl8": NetworkPlugin cni failed to set up pod "test-report-service-7c7f8486bd-qkbl8_default" network: failed to request 1 IPv4 addresses. IPAM allocated only 0

After manual cleaning etcd of all the subnets, which were not assigned to hosts, issues were resolved. E.g.

# get ipblocks, marked as assigned
etcdctl get /calico/ipam/v2/assignment/ipv4/block --prefix --keys-only | grep block | awk -F "/" '{print $NF}' > assigned-by-blocks
# get ipblocks, actually used by nodes
etcdctl get /calico/ipam/v2/host/ --prefix --keys-only | grep block | awk -F "/" '{print $NF}' > assigned-by-nodes
# delete intersection from lists above
for block in `grep -Fvf assigned-by-hosts assigned-by-blocks`; do etcdctl del /calico/ipam/v2/assignment/ipv4/block/${block}; done

Your Environment

Calico version: 3.7.2
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes 1.14.3
Operating System and version: CoreOS 2191.0.0
Link to your project (optional):

kinbug

Source

corest

👍7

All 21 comments

Are you running calico-kube-controllers? That's supposed to clean up blocks when the node is deleted.

fasaxc on 9 Jul 2019

@fasaxc yes, we are running kube-controller. And it didn't cleanup ipBlocks

corest on 9 Jul 2019

Hmm, maybe we're missing that logic for etcd mode. Looks like it might have been added to Kubernetes datastore mode only.

fasaxc on 9 Jul 2019

We clean up IPAM blocks in etcd mode as well, but through the standard node deletion process.

Do you see non-existent nodes hanging around? calicoctl get nodes should only show real nodes that haven't been torn down.

caseydavenport on 9 Jul 2019

IP blocks won't get deleted unless:

The no longer have affinity to a node, AND
There are no IPs assigned within the block.

Can you check to see if the unexpected blocks meet those criteria?

caseydavenport on 9 Jul 2019

What I've tried - draining node and restarting it. So, all the pods where unscheduled from that node. After that node was completely deleted from Kubernetes, it also wasn't visible in list of calico peers. Therefore it was deleted from /calico/ipam/v2/host/<host>/<block>. But all the released block were not cleaned up from assignments.

corest on 9 Jul 2019

@corest could you paste in the contents of one of the blocks that you think should be deleted?

When a node is deleted, we free any pod IP addresses that we think should be removed here: https://github.com/projectcalico/libcalico-go/blob/master/lib/clientv3/node.go#L118

Then, we remove that host's tree here: https://github.com/projectcalico/libcalico-go/blob/master/lib/ipam/ipam.go#L1053-L1110

If the /calico/ipam/v2/host entry is gone for that node, we're likely getting that far.

So long as all the IP addresses are successfully released, then I would expect executing this line to also delete the blocks: https://github.com/projectcalico/libcalico-go/blob/master/lib/ipam/ipam.go#L1065

So, my suspicion is that there is still an address remaining in the block which is preventing it from being deleted. My (further) suspicion is that it is the IPIP tunnel address that is being left around, since I don't see that it gets released in that code snippet above!

caseydavenport on 11 Jul 2019

Thanks @corest for raising this.

In our production cluster running on k8s v1.10.11 and Calico v3.2.3, we managed to hit the limit for IP blocks.

spec:
    blockSize: 26
    cidr: 100.96.0.0/11

/11 gives us a total of 2097150 addresses
block size of /26 is 64 addresses

We do cluster autoscaling - so nodes keep downscaling very often multiple times in a day. And in the past few weeks we are encountering - https://github.com/projectcalico/libcalico-go/blob/release-v3.2/lib/ipam/ipam.go#L329
because of which whenever a new node is spun up by autoscaler, the calico-node keeps retrying and goes into a crashloop mode.
Our /calico/ipam/v2/assignment/ipv4/block has 32671 blocks that were not used by current hosts.

Looking at one particular node that does not exist in the cluster today as an example to trace the IPAM activities, I see

02/07/2019 04:46:20.524 ipam.go 910: Releasing IPAM affinities for host host=\"ip-xx-xx-xx-xx\

02/07/2019 04:46:20.524  ipam.go 917: Querying IPAM host tree in data store host=\"ip-xx-xx-xx-xx\

02/07/2019 04:46:20.524  ipam.go 921: Failed to get IPAM host error=resource does not exist: IPAMHostKey(host=ip-xx-xx-xx-xx) with error: <nil> host=\"ip-xx-xx-xx-xx\"

We never reach to https://github.com/projectcalico/libcalico-go/blob/release-v3.2/lib/ipam/ipam.go#L926

renilthomas on 11 Jul 2019

Here is the content from the unreleased block (there is no even information about the node)
https://gist.github.com/corest/5863287f36f59ac80a36f57aad42b62a

All those handles are not cleaned up from /calico/ipam/v2/handle/<handle-id>.

corest on 11 Jul 2019

👍1

@corest thanks for that. It definitely looks like there are still addresses assigned within the block.

Have you checked to see if those addresses are actually in-use within the cluster?

It would be useful to know, for example, if any workload has some of these addresses (which appear to be allocated in the output you showed)

172.18.159.195
172.18.159.196
172.18.159.199
172.18.159.201

caseydavenport on 11 Jul 2019

Just checked - there are no pods with those IP addresses from handles subnets

corest on 12 Jul 2019

Ok, this definitely sounds like a bug. I'm not sure how we got into this state though. We might be able to add a backstop to kube-controllers to reconcile these away when it spots unused IP addresses, but I'd like to understand the root cause of this anyway.

I think it would be good to see the full kube-controllers log (probably set to debug level logging) and if possible, the full CNI plugin logs (from the container runtime) on a node that has been removed from the cluster.

kube-controllers is responsible for identifying when nodes are removed and cleaning this up, so that's the more important of the two.

caseydavenport on 17 Jul 2019

We having the same issue. We are using calico 3.3 on kubernetes 1.11.

The container even stuck at creation phase because these unrelease affinity.
Our case it is because can not get IPAM host when calico-kube-controllers try to delete the node. https://github.com/projectcalico/libcalico-go/blob/4346117ce592eedcc83269c09fbc4a1e652d0b76/lib/ipam/ipam.go#L1081

After a while our etcd if full with this kind of data

{"cidr":"100.100.10.0/26","affinity":null,"strictAffinity":false,"allocations":[0,null,null,null,null,0,0,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"unallocated":[18,20,19,24,21,12,26,23,28,27,30,31,34,32,33,4,36,39,62,35,37,38,41,42,40,44,47,45,46,49,48,51,50,52,55,53,54,2,56,58,57,60,43,22,63,1,3,61,7,11,13,10,14,16,15,17,59,8,29],"attributes":[{"handle_id":null,"secondary":null}]}

cvs77 on 22 Jul 2019

👍1

We are using calico 3.3 on kubernetes 1.11.

We've made a number of improvements to Calico's IPAM since that release, including several bug fixes in this area. I recommend trying on a cluster using the latest Calico release to see if you are still affected.

That said, it does seem like there is at least one issue that still exists in v3.7+ - it's just that it might not be the same issue you are encountering on v3.3

caseydavenport on 31 Aug 2019

I've spent a fair bit of time trying to break a Calico v3.8 cluster in the way you guys seem to be experiencing it above, unfortunately with no luck.

Do you happen to have the kube-controllers logs from a cluster that is in this state? And perhaps the logs from one of the nodes?

caseydavenport on 20 Sep 2019

{"cidr":"100.100.10.0/26","affinity":null,"strictAffinity":false,"allocations":[0,null,null,null,null,0,0,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"unallocated":[18,20,19,24,21,12,26,23,28,27,30,31,34,32,33,4,36,39,62,35,37,38,41,42,40,44,47,45,46,49,48,51,50,52,55,53,54,2,56,58,57,60,43,22,63,1,3,61,7,11,13,10,14,16,15,17,59,8,29],"attributes":[{"handle_id":null,"secondary":null}]}

@cvs77 I think this block indicates it's a different issue than the one @corest is experiencing. You can see they are all using the same index - 0- which I think means other nodes are borrowing tunnel address from this block.

I suspect that you're encountering the issue that was fixed by this PR in Calico v3.9: https://github.com/projectcalico/libcalico-go/pull/1111

Upgrading to v3.9 will be part of the fix, but you'll also need to delete those blocks which aren't being used so they can become available again. You will need to make sure you either only delete blocks which don't have active IP allocations in them, or restart the pods / nodes which have IPs allocated from the blocks after deleting them.

caseydavenport on 20 Sep 2019

network: failed to request 1 IPv4 addresses. IPAM allocated only 0

For completeness, it would also be useful if you could download the latest calicoctl tool and run calicoctl ipam show --show-blocks

caseydavenport on 20 Sep 2019

I don't think I am experiencing the exact issue that is here, but the solution provided in original description saved our bacon in our production cluster so I wanted to note that thanks here as well as the details for any future person who may benefit from it as we have.

We are on k8s 1.10 and Calico 2.6.7, however we upgraded to Calico 2.6.12 and still saw this issue, although we are still using 1.03 of the kube-controller which may be where the bug still is.

Our cluster had been running fine for more than a year and then all of the sudden when new nodes would come up the Calico pod and kube-proxy pod would come up, but any pod that required ip assignment would fail after some time with the error:

network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Effectively the Node would come up and claim it was healthy, but no pods could actually run on it. This first started appearing and after like 20 minutes the node would "fix" itself so it was just a nuisance which slowed down our scaling ability speed, but it didn't kill us. But as time went on, slowly that time went from 20 minutes to hours to the point where only 20% of our nodes would ever come up. In retrospect it appears that some IP blocks were getting released while others weren't so we were slowly choking ourselves out to the point to where we wouldn't be able to bring up anything.

Luckily after many nights of digging into this we were able to piece it together and through this bug report and the remedy provided by @corest we were able to determine that clearing the allocated blocks that were not actually allocated to a node immediately fixed the problem. Once we did that everything started working again and we are in a good state now. It does appear this is likely fixed in newer versions, but for anyone still stuck on older versions like we are this may be something you run into as well.

Another item of note, as we got in a worse state in addition to the pods not coming up with that error we also saw our Master nodes getting hit incredibly hard with high CPU and constant network and IO traffic. Before we knew what was going on we thought our etcd or Master nodes were the problem and scaled those up but to no avail. Of course now that we know what the issue is it ended up being that Calico was pounding the heck out of etcd trying to find an available block to hand out. Once we cleared the blocks the CPU and load on etcd went back to normal and we were able to scale back down on our Master nodes.

We also use spot nodes heavily which they cycle quite a bit as well as cron jobs, so it is possible that the way we are shutting down isn't giving Calico time to do what it needs to in some cases.

I know this likely doesn't help in triaging the current issue since we are so far behind on our version, but thanks community! If by some way this does help, please let me know if there is any more info I can provide to help diagnose the issue.

enoren on 21 Sep 2019

Wanted to check up on this one - has anyone had any breakthroughs or been able to gather any additional information?

caseydavenport on 8 Jan 2020

I've been looking at this a bit more recently. We have this existing controller function that handles cleaning up orphaned IPAM blocks (and any addresses within) when using the Kubernetes API / CRD backed data store: https://github.com/projectcalico/kube-controllers/blob/master/pkg/controllers/node/kdd.go#L36

However, it's not run when in etcd mode. I think this is because we had an existing solution for etcd mode. However, the main difference between these two is that the etcd implementation looks at workload endpoints in the data store to determine which addresses need to be cleaned up, whereas the kubernetes variant looks at addresses assigned within blocks and compares them to pods that exist in the k8s API. (When we implemented the k8s version, we also tagged IP allocations with new metadata that previously didn't exist and so wouldn't have been possible for etcd).

This means that in etcd mode, its possible a workload endpoint could be deleted without the IP allocation being freed, and then we'd effectively leak that address.

Doing a code read, I'm not sure there's any reason we _can't_ now enable this IPAM cleanup logic for etcd mode as well, which I believe will help catch issues like this one. Unless we also make other changes, it will only do it when a node is removed from the cluster, but we could also consider adding a periodic sync.

The thing to note about the existing implementation is that it will only clean up an IPAM block when:

The node is no longer present in the k8s API
All the allocations within that block are no longer in use by any pod in the k8s API.

That means it won't, without modification, catch cases where a single IP has been leaked on a node that is still around. I think this is probably OK, because:

We want to be extra safe here anyway. Accidentally cleaning up an IP address that is still in use would be bad.
All the reported instances of this seem to do with node auto-scaling, which leads me to believe the leaked addresses have to do with nodes being deleted anyway.

So, my proposal would be to start by enabling the existing cleanup logic on etcd, and see if that resolves the issues listed here.

caseydavenport on 13 Feb 2020

I'm going to close this since we've fixed a number of IP address leaking issues in the last couple of releases.

I think if users still encounter issues with the latest release we should open new issues for further investigation. Thanks all.

caseydavenport on 24 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cann0nf0dder · 5Comments

getting the dial tcp 10.96.0.1:443: i/o timeout issues

mohit5577 · 5Comments

Support for armhf

winromulus · 3Comments

Pod IP is allocated outside of Node's podCIDR

squat · 5Comments

Calico fails to start after GKE 1.11.3-gke.18 upgrade

sindrepm · 5Comments