Calico: Calico in eBPF mode doesn't forward traffic from AWS NLBs properly

Created on 6 Nov 2020  Â·  18Comments  Â·  Source: projectcalico/calico

Motivating Scenario

After creating a Kubernetes _Service_ of type "LoadBalancer" annotated with "service.beta.kubernetes.io/aws-load-balancer-type" with value "nlb," the Kubernetes AWS cloud provider creates an EC2 Network Load Balancer. My _Service_ has an external traffic policy of "Cluster," in a Kubernetes cluster with three worker machines striding three subnets in three availability zones, so in this simple case the NLB winds up with a listener per _Service_ port each forwarding to a target group with three targets (arriving at the corresponding node port). Since our external traffic policy is "Cluster," all these worker machines are eligible targets, regardless of whether a pod selected by the _Service_ is running there.

I have one pod running that's selected by this _Service_. Calico is in eBPF mode, with BIRD as the backend (for now, per _kops_ issue kubernetes/kops#10168), and my lone IP pool is using VXLAN in cross-subnet mode.

The network ACLs in play are all wide open, allowing everything, and the security group rules protecting the worker machines are also now wide open, allowing all traffic on all protocols from everywhere, for the sake of discerning what could be responsible for the blocking here.

Expected Behavior

Connections established with and traffic sent through the NLB should arrive at the target pod, whether the NLB is allowing cross-zone load balancing or not. Even with cross-zone load balancing disabled, each NLB listener should contact worker machine targets in its same availability zone on the node port, and that machine should forward the packets if necessary to a different worker machine that's hosting a target pod.

Current Behavior

My pod only receives this traffic in two cases:

  1. With cross-zone load balancing disabled, requests landing at the NLB listener sitting in the same zone as the target pod's hosting machine reach the target pod.
    That is, contacting a listener in a different zone fails, even though the AWS API reports the target group as having three healthy targets—again, one per zone—and includes all three listeners' IP addresses in the NLB's DNS A record.
  2. With cross-zone load balancing enabled, approximately 1/3 of the requests landing at any of the NLB listeners reach the target pod.
    Since each listener can contact any target, there's an approximately one in three chance of the listener choosing the machine hosting the target pod.

Presented differently, consider machines _M1_, _M2_, and _M3_ in availability zones _Z1_, _Z2_, and _Z3_, with one pod running on machine _M2_ in zone _Z2_ (denoted by 🎯 below).

Machine | Zone | Success with CZB Enabled | Success with CZB Disabled
--- | --- | --- | ---
_M1_ | _Z1_ | ~33% | :no_entry_sign: 0%
_M2_ 🎯 | _Z2_ | ~33% | ✅ 100%
_M3_ | _Z3_ | ~33% | :no_entry_sign: 0%

I visited each of these three machines and ran _tcpdump_, isolating different network interfaces, looking to see which telltale traffic arrives. The successful requests' HTTP content appeared on one of the _veth_-prefixed interfaces, originating from the original client IP address on a random port and destined for the pod IP address on the container port.

Past that, I tried—and I mean _reallly_ tried—to find evidence of this traffic arriving and getting blocked on other interfaces, but to no avail. With cross-zone balancing disabled, I figured that a given NLB listener could only send traffic to one worker machine. Running _tcpdump_ on one of the two worker machines not hosting the target pod, I expected to see traffic arriving and either getting blocked before leaving that machine, or getting forwarded to the machine hosting the pod. There were lots of packets arriving from the NLB listeners—probably the health checks—all of small size with no payload, but I couldn't see any of the HTTP requests arriving.

The VPC flow logs indicate no rejected traffic involving the service port, node port, or container port. I see most—if not all—of my connection attempts present in the flow logs with an "ACCEPT" outcome. If some are not being accepted, though, they're not present in the logs with a "REJECT" outcome.

Variation and Comparison

If I switch from eBBF to using _kube-proxy_ with either _iptables_ or _ipvs_, with no other changes, all requests succeed.
If I switch from an NLB to a Classic Load Balancer, with no other changes, all requests succeed.
If I switch from cross-subnet encapsulation to always use encapsulation, the outcome doesn't change.
If I switch from VXLAN to IP-in-IP encapsulation, the outcome doesn't change. (I _think_. I may need to test that again.)

Steps to Reproduce

  1. Set up Calico using eBPF mode in a Kubernetes cluster with machines spread across multiple subnets, and probably across multiple availability zones.
    Unfortunately, it is difficult for me to add more subnets in a single zone to see whether it's the zonal or subnet-level separation that makes the difference here.
  2. Create a Kubernetes _Service_ of type "LoadBalancer," annotated to summon an NLB.
    I'm not certain that there's anything special about Kubernetes here; this same problem would probably apply to a hand-crafted NLB, and probably outside of a Kubernetes cluster.
  3. Look up the IP addresses of the NLB's listeners using its DNS A record.
  4. Create a pod with a server, such as NGINX running an "echo" handler.
    Note which node hosts the pod, and in which availability zone that node's machine sits.
  5. Issue requests against each of the NLB listeners with cross-zone load balancing disabled.
    Observe that requests against only one of the listeners succeed.
  6. Issue requests against each of the NLB listeners with cross-zone load balancing enabled.
    Observe that all listeners behave similarly, only succeeding approximately _1 / listener count_ of the time.

Context

We would like to have the option of using NLBs for our Kubernetes _Services_ together with Calico in eBPF mode, partly to take advantage of being able to preserve client IP addresses with external traffic policy "Cluster" without using the PROXY protocol.

Your Environment

  • Cloud Provider: AWS EC2 ("us-east-2" region, VPC networking)
  • Calico version: 3.16.4 (as deployed by _kops_ 1.19)
  • Orchestrator version: Kubernetes 1.19.3
  • Operating System and version: Flatcar Container Linux

    /etc/os-release
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=2643.1.0
VERSION_ID=2643.1.0
BUILD_ID=2020-10-13-1801
PRETTY_NAME="Flatcar Container Linux by Kinvolk 2643.1.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar-linux.org/"
BUG_REPORT_URL="https://issues.flatcar-linux.org"

impachigh kinbug likelihoohigh

Most helpful comment

My testing confirms that this problem is fixed by projectcalico/felix#2589 (and projectcalico/felix#2588).
Thank you, @tomastigera and @fasaxc!

All 18 comments

@hakman, @rifelpet, and @johngmyers, I thought you might be interested in this issue at the interesction of _kops_, Calico, and NLBs.

Where is the traffic originating from, within the cluster or outside the cluster?

Is the NLB internal or internet-facing?

What error message does the client report when requests fail?

I know you mention wanting to preserve client IP addresses but I would be curious to see if this also happens with targets registered by IP address rather than instance. The built-in AWS cloud controller manager doesn't support this though. The aws-load-balancer-controller supports registering _pod IPs_ as targets but I don't believe it supports registering _node IPs_, so if your pod IPs aren't routable within the VPC then this may not be an option. I could see that being a reasonable feature request though.

I'm wondering if you're experiencing this NLB limitation.

Where is the traffic originating from, within the cluster or outside the cluster?

From outside the cluster. I tried these experiments both from outside the VPC and from inside on a bastion machine, though, with equivalent results.

Is the NLB internal or internet-facing?

It's internet-facing (public).

What error message does the client report when requests fail?

It times out trying to establish the TCP connection.

I know you mention wanting to preserve client IP addresses but I would be curious to see if this also happens with targets registered by IP address rather than instance. The built-in AWS cloud controller manager doesn't support this though.

Right, these targets are registered by instance ID, which makes them ineligible for hairpin connections, as you noted below.

I'm wondering if you're experiencing this NLB limitation.

I don't think so. Considering a client outside of AWS's network, the connection attempt lands at the NLB listener, which should in turn connect to any of my Kubernetes worker machines. Each of those might in turn establish a connection to a sibling machine, but none of them should go back out to the NLB as a client for any reason.

This issue is quite specific to AWS when the traffic from the backing pods end up being routed through different eni than the eni the incoming traffic arrive on. We are aware of the issue and we have a fix in progress https://github.com/projectcalico/felix/pull/2549 that should land soon.

Thank you, Tomas. I road that issue, but was left uncertain as to whether it pertains to EC2 instances with only one ENI attached. Are the multiple interfaces mentioned there counting the VXLAN interfaces?

The behavior I've observed could also be induced if we were treating our external traffic policy incorrectly, where the health check was acting as if the policy was "Cluster," but the proxying was acting as if the policy was "Local." Is it possible to pass through this portion of (*Syncer).applyDerived for a _Service_ of type "LoadBalancer?" Should that block also apply for the "LoadBalancer" case?

Also, Tomas, am I understanding your description correctly that when you say "traffic from the backing pods," that suggests that the target pod received packets and is now responding. Since I don't see evidence of the HTTP server acting on these requests, are you referring perhaps to packets that arrive ahead of the HTTP request, such that the connection negotiation never completes?

Is there a test I could run with _tcpdump_ to confirm that this is what's happening?

Also, Tomas, am I understanding your description correctly that when you say "traffic from the backing pods," that suggests that the target pod received packets and is now responding. Since I don't see evidence of the HTTP server acting on these requests, are you referring perhaps to packets that arrive ahead of the HTTP request, such that the connection negotiation never completes?

yes, the connection would not set up. If this issue is the same one we are trying to fix, not even the TCP SYN-ACK would return to the client.

Is there a test I could run with _tcpdump_ to confirm that this is what's happening?

You can tcpdump within the backing pod and see whether you get the TCP SYN and you should see the SYN-ACK response

You can tcpdump on the node that hosts the backing pod and you should see a VXLAN packet port 4789 VNI 0xca11co leaving the pod that carries the SYN-ACK but likely being routed to a different eni than eth0.

Just to be sure, are you using aws cni?

Just to be sure, are you using aws cni?

No, we're not. We're using Calico as the CNI, for both IPAM and policy. Does that change your assumption about what's wrong here?

By the way, _0xca11c0_ is a clever choice.

Just to be sure, are you using aws cni?

No, we're not. We're using Calico as the CNI, for both IPAM and policy. Does that change your assumption about what's wrong here?

It does, it is still likely related, but I will need to look more closely into this. I will let you know once I do.

I did confirm that changing my _Service_ using an external traffic policy of "Local" alleviates this problem.

This is still consistent with what I described :arrow_double_up: as that does not forward packet between nodes and thus avoids the issue that we are trying to tackle.

An embarrassing but still puzzling update: Calico 3.16.4's _calico-node_ image does work fine—after I've replaced all of the machines in my cluster, even when I've told _calico-node_ to clean up any _iptables_ rules.

In my upgrade scenario, I switch _calico-node_ into eBPF mode, wait for all the replacement pods to become healthy, then stop _kube-proxy_ on all the nodes, then reconfigure _calico-node_ to clean up the _iptables_ rules, and wait for the replacement pods to become healthy again. At this point, I suffer the behavior described above.

Once I replace the worker nodes (or, rather, all the machines registered as targets for the _Service_ load balancer), though, the node port routing works as expected. Perhaps the _iptables_ cleanup procedure is leaving behind something that precludes this routing from working as intended.

Few ideas:

  • As you said, incorrect cleanup.
  • Problem with health ports when both kube-proxy and calico are running. We'll both try to open the health ports. Perhaps we fail silently without retrying to open the port and then, once kube-proxy is removed, there's no-one listening for health ports?
  • Some subtle BPF routing issue.

If you can get hte cluster into the bad state again, you can try tcpdump as Tomas described to see if the request or response traffic is lost. Then we can binary chop from there to find out if it's lost on the ingress node or the backing node, then if it's dropped by BPF program or iptables, etc...

I'm going to try removing those rules to using the _iptables_ tool to see whether it restores proper routing behavior to this machine.

Well, either that wasn't it, or it was only part of what's wrong.

Going machine by machine, I tried

  • removing each of those extra _iptables_ rules,
  • restarting the _calico-node_ container, and
  • rebooting the machine.

After each of these steps, I tried my HTTP client test. None of those steps changed the behavior.

Feeling despondent, I then destroyed and replaced each of the three worker machines. After that, my test succeeds.

I don't understand yet what could have persisted through the rebooting of the machine, preventing success, but then behaves differently with a new machine.

You can tcpdump within the backing pod and see whether you get the TCP SYN and you should see the SYN-ACK response

Within the serving pod, I see the following pattern play out repeatedly when issuing HTTP requests through a node port on a different machine:

  • source address → pod address
    TPC flag: _SYN_
  • pod address → source address
    TPC flag: _SYN-ACK_
  • source address → pod address
    TPC flag: _RESET_

None of the HTTP request makes it across. That suggests that the other machine receiving the request on its node port _is_ trying to connect to the pod, but it looks like the other machine then closes the connection instead of sending the _ACK_ packet.

If I run _watch -n 1 nestat -a_ in the serving pod, I can see lots of connections appearing and disappearing frequently, all in state _LAST_ACK_ or _SYN_RECV_.

You can tcpdump on the node that hosts the backing pod and you should see a VXLAN packet port 4789 VNI 0xca11co leaving the pod that carries the SYN-ACK but likely being routed to a different eni than eth0.

I'll try this next.

I'll try this next.

I can't figure out how to see _any_ of this traffic occurring on the host machine. I started out probing narrowly, and kept backing out until I'm looking at all traffic on all interfaces, and I don't see anything that corresponds to these requests.

I performed another experiment tonight, where after migrating from Flannel to Calico, I replaced all the machines in the cluster, and _then_ transitioned Calico to use eBPF mode. These new machines started out running _kube-proxy_; I stopped it on all the machines, then restarted _calico-node_ in eBPF mode, telling it to clean up _kube-proxy_'s leftover _iptables_ rules.

Once _calico-node_ had restarted again, I ran my experiment with my _Service_ of type "LoadBalancer" using an NLB, with an external traffic policy of "Cluster." I was unable to reproduce the problem reported here. This suggests that there's some artifact left over from the Flannel-to-Calico migration that prevents eBPF mode from working properly for this scenario. I did try using the _ip link delete_ command to delete the network interfaces that were present after migration, but not present on new machines, to see if those interfaces were the problem. Deleting them didn't change the outcome, though. I am left to wonder what else is left over from migration that could interfere here.

Since I've interpreted my experiments incorrectly before, I'm not trusting this first test yet. I'll report back tomorrow after a second test.

My testing confirms that this problem is fixed by projectcalico/felix#2589 (and projectcalico/felix#2588).
Thank you, @tomastigera and @fasaxc!

Was this page helpful?
0 / 5 - 0 ratings