Amazon-vpc-cni-k8s: VPC CIDR is cached forever causing pod routing issues when secondary VPC CIDR ranges are created afterwards

Created on 2 Dec 2019  路  10Comments  路  Source: aws/amazon-vpc-cni-k8s


Problem statement:
As of today, VPC CIDR ranges are cached during initialization. If new CIDR ranges are added afterwards to address IP space issue, the CNI should be restarted to fetch new CIDR ranges to update the cache to add ip rules/routes to reach to other pods in the cluster with new subnet IP range.

Solution:
Refresh VPC CIDR ranges cache every 2 seconds to avoid staleness.

Steps to replicate the issue:
1) Create EKS cluster in a VPC which has just one CIDR range (10.10.0.0/16)
2) Create worker nodes in the above VPC CIDr range
3) Add secondary CIDR range (100.10.0.0/16) to existing VPC
4) Launch new worker nodes in the subnet which has 100.10.0.0/24 CIDR
5) Pod 1 with 100.10.12.13 IP part of secondary VPC CIDR subnet cannot be talk to coreDNS pod with IP 10.10.23.45 which is part of primary VPC CIDR subnet.

enhancement prioritP1

Most helpful comment

Before getting the rollout for the fix in next couple versions, I would to post the couple suggestions/workarounds for users having the cache issue after adding secondary VPC CIDR:

  • Restart the CNI Plugin should remedy this issue. This can be done by adding label or update ENV of CNI Plugin to trigger this update behavior.
  • For production usage concern, it would be great if can launch new work node and so new running aws-node Pod should aware the secondary VPC CIDR, then replace the old one to ensure it won't impact existing environment.

All 10 comments

Every two seconds may be a bit excessive. Certainly we should refresh the VPC CIDR ranges periodically, but I think 15 or 30 seconds might be a better interval. Alternately, is there a way we can be notified of VPC subnets being created, deleted or modified instead of periodically refreshing our view?

Every two seconds may be a bit excessive. Certainly we should refresh the VPC CIDR ranges periodically, but I think 15 or 30 seconds might be a better interval.

Agreed! I will go with 15 seconds then.

Alternately, is there a way we can be notified of VPC subnets being created, deleted or modified instead of periodically refreshing our view?

I did some research around adding a watch/poll to get notifications about VPC changes without making AWS API calls, but did not find a way.

Before getting the rollout for the fix in next couple versions, I would to post the couple suggestions/workarounds for users having the cache issue after adding secondary VPC CIDR:

  • Restart the CNI Plugin should remedy this issue. This can be done by adding label or update ENV of CNI Plugin to trigger this update behavior.
  • For production usage concern, it would be great if can launch new work node and so new running aws-node Pod should aware the secondary VPC CIDR, then replace the old one to ensure it won't impact existing environment.

Just to add a note: Simple refresh the cache for 2 second / or 15 second won鈥檛 solve this problem
it will only make new pods ok, but not old pods

@M00nF1sh, I was planning to add ticker function to update cache every x seconds and if there are any updates like additional CIDRs, then re-configure the rules/routes which will should fix both old and new pods.

We have also been facing this problem recently and have temporarily solved using https://github.com/giantswarm/aws-cni-restarter wcan help test this once PR #903 has been merged

@paurosello #903 has been merged, and some follow up PRs as well, so in order to test these changes you would have to use a build of the latest master branch. Be sure to use the configs in /config/master since the config have been changed quite a bit, including adding an init container. Not sure yet when we will have a v1.7.0-rc1 ready for testing.

Has this been fixed with 1.7.0?

Ah yes, I see it in the release notes. https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.0

Then I recommend to close this issue.

Yes @marians, resolving.

Was this page helpful?
0 / 5 - 0 ratings