Assigned blocks of IP addresses (etcd location /calico/ipam/v2/assignment/ipv4/block/<ip-block>) are not released even if those are not assigned to nodes.
When there are no blocks, assigned to nodes (block is not assigned to any node at /calico/ipam/v2/host/<host>/<block>), I expect them to be released from /calico/ipam/v2/assignment/ipv4/block/<ip-block>.
A bunch of blocks, not assigned to nodes, are still kept in assignments.
When the block is not assigned to the host - release it from assignments.
/27) with small subnet size (e.g. /29).In the old cluster with ipam network size /18 and subnet block size /26 (default) we got into a situation, where there were 16 subnets, assigned to hosts(/calico/ipam/v2/host/<host>/<block>), but ~240 subnets in assignments (/calico/ipam/v2/assignment/). That lead is into the issue like
Warning FailedCreatePodSandBox 11m (x815 over 44h) kubelet, worker-x53vn-7f9964b764-2rlb6 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "495cffc4ae7e0e717a62ca01f8249a42928997e9d37e604e2bf2440d108ff0f4" network for pod "test-report-service-7c7f8486bd-qkbl8": NetworkPlugin cni failed to set up pod "test-report-service-7c7f8486bd-qkbl8_default" network: failed to request 1 IPv4 addresses. IPAM allocated only 0
After manual cleaning etcd of all the subnets, which were not assigned to hosts, issues were resolved. E.g.
# get ipblocks, marked as assigned
etcdctl get /calico/ipam/v2/assignment/ipv4/block --prefix --keys-only | grep block | awk -F "/" '{print $NF}' > assigned-by-blocks
# get ipblocks, actually used by nodes
etcdctl get /calico/ipam/v2/host/ --prefix --keys-only | grep block | awk -F "/" '{print $NF}' > assigned-by-nodes
# delete intersection from lists above
for block in `grep -Fvf assigned-by-hosts assigned-by-blocks`; do etcdctl del /calico/ipam/v2/assignment/ipv4/block/${block}; done
3.7.2Are you running calico-kube-controllers? That's supposed to clean up blocks when the node is deleted.
@fasaxc yes, we are running kube-controller. And it didn't cleanup ipBlocks
Hmm, maybe we're missing that logic for etcd mode. Looks like it might have been added to Kubernetes datastore mode only.
We clean up IPAM blocks in etcd mode as well, but through the standard node deletion process.
Do you see non-existent nodes hanging around? calicoctl get nodes should only show real nodes that haven't been torn down.
IP blocks won't get deleted unless:
Can you check to see if the unexpected blocks meet those criteria?
What I've tried - draining node and restarting it. So, all the pods where unscheduled from that node. After that node was completely deleted from Kubernetes, it also wasn't visible in list of calico peers. Therefore it was deleted from /calico/ipam/v2/host/<host>/<block>. But all the released block were not cleaned up from assignments.
@corest could you paste in the contents of one of the blocks that you think should be deleted?
When a node is deleted, we free any pod IP addresses that we think should be removed here: https://github.com/projectcalico/libcalico-go/blob/master/lib/clientv3/node.go#L118
Then, we remove that host's tree here: https://github.com/projectcalico/libcalico-go/blob/master/lib/ipam/ipam.go#L1053-L1110
If the /calico/ipam/v2/host entry is gone for that node, we're likely getting that far.
So long as all the IP addresses are successfully released, then I would expect executing this line to also delete the blocks: https://github.com/projectcalico/libcalico-go/blob/master/lib/ipam/ipam.go#L1065
So, my suspicion is that there is still an address remaining in the block which is preventing it from being deleted. My (further) suspicion is that it is the IPIP tunnel address that is being left around, since I don't see that it gets released in that code snippet above!
Thanks @corest for raising this.
In our production cluster running on k8s v1.10.11 and Calico v3.2.3, we managed to hit the limit for IP blocks.
spec:
blockSize: 26
cidr: 100.96.0.0/11
/11 gives us a total of 2097150 addresses
block size of /26 is 64 addresses
We do cluster autoscaling - so nodes keep downscaling very often multiple times in a day. And in the past few weeks we are encountering - https://github.com/projectcalico/libcalico-go/blob/release-v3.2/lib/ipam/ipam.go#L329
because of which whenever a new node is spun up by autoscaler, the calico-node keeps retrying and goes into a crashloop mode.
Our /calico/ipam/v2/assignment/ipv4/block has 32671 blocks that were not used by current hosts.
Looking at one particular node that does not exist in the cluster today as an example to trace the IPAM activities, I see
02/07/2019 04:46:20.524 ipam.go 910: Releasing IPAM affinities for host host=\"ip-xx-xx-xx-xx\
02/07/2019 04:46:20.524 ipam.go 917: Querying IPAM host tree in data store host=\"ip-xx-xx-xx-xx\
02/07/2019 04:46:20.524 ipam.go 921: Failed to get IPAM host error=resource does not exist: IPAMHostKey(host=ip-xx-xx-xx-xx) with error: <nil> host=\"ip-xx-xx-xx-xx\"
We never reach to https://github.com/projectcalico/libcalico-go/blob/release-v3.2/lib/ipam/ipam.go#L926
Here is the content from the unreleased block (there is no even information about the node)
https://gist.github.com/corest/5863287f36f59ac80a36f57aad42b62a
All those handles are not cleaned up from /calico/ipam/v2/handle/<handle-id>.
@corest thanks for that. It definitely looks like there are still addresses assigned within the block.
Have you checked to see if those addresses are actually in-use within the cluster?
It would be useful to know, for example, if any workload has some of these addresses (which appear to be allocated in the output you showed)
Just checked - there are no pods with those IP addresses from handles subnets
Ok, this definitely sounds like a bug. I'm not sure how we got into this state though. We might be able to add a backstop to kube-controllers to reconcile these away when it spots unused IP addresses, but I'd like to understand the root cause of this anyway.
I think it would be good to see the full kube-controllers log (probably set to debug level logging) and if possible, the full CNI plugin logs (from the container runtime) on a node that has been removed from the cluster.
kube-controllers is responsible for identifying when nodes are removed and cleaning this up, so that's the more important of the two.
We having the same issue. We are using calico 3.3 on kubernetes 1.11.
The container even stuck at creation phase because these unrelease affinity.
Our case it is because can not get IPAM host when calico-kube-controllers try to delete the node. https://github.com/projectcalico/libcalico-go/blob/4346117ce592eedcc83269c09fbc4a1e652d0b76/lib/ipam/ipam.go#L1081
After a while our etcd if full with this kind of data
{"cidr":"100.100.10.0/26","affinity":null,"strictAffinity":false,"allocations":[0,null,null,null,null,0,0,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"unallocated":[18,20,19,24,21,12,26,23,28,27,30,31,34,32,33,4,36,39,62,35,37,38,41,42,40,44,47,45,46,49,48,51,50,52,55,53,54,2,56,58,57,60,43,22,63,1,3,61,7,11,13,10,14,16,15,17,59,8,29],"attributes":[{"handle_id":null,"secondary":null}]}
We are using calico 3.3 on kubernetes 1.11.
We've made a number of improvements to Calico's IPAM since that release, including several bug fixes in this area. I recommend trying on a cluster using the latest Calico release to see if you are still affected.
That said, it does seem like there is at least one issue that still exists in v3.7+ - it's just that it might not be the same issue you are encountering on v3.3
I've spent a fair bit of time trying to break a Calico v3.8 cluster in the way you guys seem to be experiencing it above, unfortunately with no luck.
Do you happen to have the kube-controllers logs from a cluster that is in this state? And perhaps the logs from one of the nodes?
{"cidr":"100.100.10.0/26","affinity":null,"strictAffinity":false,"allocations":[0,null,null,null,null,0,0,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null],"unallocated":[18,20,19,24,21,12,26,23,28,27,30,31,34,32,33,4,36,39,62,35,37,38,41,42,40,44,47,45,46,49,48,51,50,52,55,53,54,2,56,58,57,60,43,22,63,1,3,61,7,11,13,10,14,16,15,17,59,8,29],"attributes":[{"handle_id":null,"secondary":null}]}
@cvs77 I think this block indicates it's a different issue than the one @corest is experiencing. You can see they are all using the same index - 0- which I think means other nodes are borrowing tunnel address from this block.
I suspect that you're encountering the issue that was fixed by this PR in Calico v3.9: https://github.com/projectcalico/libcalico-go/pull/1111
Upgrading to v3.9 will be part of the fix, but you'll also need to delete those blocks which aren't being used so they can become available again. You will need to make sure you either only delete blocks which don't have active IP allocations in them, or restart the pods / nodes which have IPs allocated from the blocks after deleting them.
network: failed to request 1 IPv4 addresses. IPAM allocated only 0
For completeness, it would also be useful if you could download the latest calicoctl tool and run calicoctl ipam show --show-blocks
I don't think I am experiencing the exact issue that is here, but the solution provided in original description saved our bacon in our production cluster so I wanted to note that thanks here as well as the details for any future person who may benefit from it as we have.
We are on k8s 1.10 and Calico 2.6.7, however we upgraded to Calico 2.6.12 and still saw this issue, although we are still using 1.03 of the kube-controller which may be where the bug still is.
Our cluster had been running fine for more than a year and then all of the sudden when new nodes would come up the Calico pod and kube-proxy pod would come up, but any pod that required ip assignment would fail after some time with the error:
network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Effectively the Node would come up and claim it was healthy, but no pods could actually run on it. This first started appearing and after like 20 minutes the node would "fix" itself so it was just a nuisance which slowed down our scaling ability speed, but it didn't kill us. But as time went on, slowly that time went from 20 minutes to hours to the point where only 20% of our nodes would ever come up. In retrospect it appears that some IP blocks were getting released while others weren't so we were slowly choking ourselves out to the point to where we wouldn't be able to bring up anything.
Luckily after many nights of digging into this we were able to piece it together and through this bug report and the remedy provided by @corest we were able to determine that clearing the allocated blocks that were not actually allocated to a node immediately fixed the problem. Once we did that everything started working again and we are in a good state now. It does appear this is likely fixed in newer versions, but for anyone still stuck on older versions like we are this may be something you run into as well.
Another item of note, as we got in a worse state in addition to the pods not coming up with that error we also saw our Master nodes getting hit incredibly hard with high CPU and constant network and IO traffic. Before we knew what was going on we thought our etcd or Master nodes were the problem and scaled those up but to no avail. Of course now that we know what the issue is it ended up being that Calico was pounding the heck out of etcd trying to find an available block to hand out. Once we cleared the blocks the CPU and load on etcd went back to normal and we were able to scale back down on our Master nodes.
We also use spot nodes heavily which they cycle quite a bit as well as cron jobs, so it is possible that the way we are shutting down isn't giving Calico time to do what it needs to in some cases.
I know this likely doesn't help in triaging the current issue since we are so far behind on our version, but thanks community! If by some way this does help, please let me know if there is any more info I can provide to help diagnose the issue.
Wanted to check up on this one - has anyone had any breakthroughs or been able to gather any additional information?
I've been looking at this a bit more recently. We have this existing controller function that handles cleaning up orphaned IPAM blocks (and any addresses within) when using the Kubernetes API / CRD backed data store: https://github.com/projectcalico/kube-controllers/blob/master/pkg/controllers/node/kdd.go#L36
However, it's not run when in etcd mode. I think this is because we had an existing solution for etcd mode. However, the main difference between these two is that the etcd implementation looks at workload endpoints in the data store to determine which addresses need to be cleaned up, whereas the kubernetes variant looks at addresses assigned within blocks and compares them to pods that exist in the k8s API. (When we implemented the k8s version, we also tagged IP allocations with new metadata that previously didn't exist and so wouldn't have been possible for etcd).
This means that in etcd mode, its possible a workload endpoint could be deleted without the IP allocation being freed, and then we'd effectively leak that address.
Doing a code read, I'm not sure there's any reason we _can't_ now enable this IPAM cleanup logic for etcd mode as well, which I believe will help catch issues like this one. Unless we also make other changes, it will only do it when a node is removed from the cluster, but we could also consider adding a periodic sync.
The thing to note about the existing implementation is that it will only clean up an IPAM block when:
That means it won't, without modification, catch cases where a single IP has been leaked on a node that is still around. I think this is probably OK, because:
So, my proposal would be to start by enabling the existing cleanup logic on etcd, and see if that resolves the issues listed here.
I'm going to close this since we've fixed a number of IP address leaking issues in the last couple of releases.
I think if users still encounter issues with the latest release we should open new issues for further investigation. Thanks all.