Amazon-vpc-cni-k8s: curl do not pass through when both source node and target node are same

Created on 12 Feb 2020 · 5Comments · Source: aws/amazon-vpc-cni-k8s

Steps to reproduce the issue:

Create an eks cluster (Killer-eks in our case) with secondary IP support.
Create an autoscaling group of one node and attach to the cluster.
Create a network load balancer, listeners, target groups.
Create a hosted zone in Route 53. Add a recordset of type cname mapped to the loadbalancer.
Deploy nginx ingress controller in the cluster of nodeport type.
All the above are connected as NLB -> listeners -> Target groups -> autoscaling group nodes
Deploy a simple nginx deployment(my-nginx) in the cluster, map it to the nginx service(my-nginx) and to the ingress (my-nginx-ingress).
Deploy another simple nginx pod (my-nginx1) in the cluster.
Enter into one of the pods say my-nginx1
curl the ingress endpoints. curl fails

Slightly different environment is set up in my case and the test cases are demonstrated in the attached file in detail.
eks-nlb-issue-aws-support.txt

documentation

Source

ravicm

Most helpful comment

@ravicm @jonathan-mothership

Expected behavior with NLB when both Source and Target Nodes are same. Internal NLB do not support loopback or hair pinning. Since NLB in instance mode preserves the source ip of the packet, the response (say SYN ACK) will directly go to the local client pod. So, SYN will have client PodIP as SIP and NLB IP as DIP where as SYN ACK will have server PodIP as SIP(and not NLB IP) and client PodIP as DIP when both the pods are on the same node, so client pod will terminate/reset the TCP session.

Refer to “Connections time out for requests from a target to its load balancer” section in the below guide. It also lists out possible options. NLB in IP mode should help out in this scenario.

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html.

achevuru on 2 Sep 2020

👍3

All 5 comments

We are facing the same issue, did you find a solution?

paurosello on 15 Mar 2020

I'm facing the same issue!
CNI v1.6.0
We have a different ingress controller (Traefik) and are seeing the same issue. In a 3-node ASG, a pod on node A that makes a request to an NLB where that request is served by same backend node running the requesting pod, the connection times out. We have two listeners and this happens on both the TLS and TCP listeners.

My suspicion is that this is because the NLB preserves the source IP of the request when connecting to the backend and the CNI iptables rules are getting tripped up with that connection.

jonathan-mothership on 9 Apr 2020

I am seeing this issue too.

But when I call from outside cluster node, it works fine
I guess, if I am on same eks cluster, I would use k8s-svc and not NLB; I guess I am fine for now; but interested to see what how it is happening.

bhaveshph on 13 Jul 2020

@ravicm @jonathan-mothership

Refer to “Connections time out for requests from a target to its load balancer” section in the below guide. It also lists out possible options. NLB in IP mode should help out in this scenario.

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html.

achevuru on 2 Sep 2020

👍3

Hi,

Since this is an expected behavior will be closing this issue for now.

Thank you!