Ingress-nginx: Terminating pod causes network timeouts with GCP L4 LB and externalTrafficPolicy: Local

Created on 5 Oct 2020 · 12Comments · Source: kubernetes/ingress-nginx

NGINX Ingress controller version: 0.34.1

Kubernetes version (use kubectl version): v1.16.13-gke.401

Environment:

Cloud provider or hardware configuration: GKE
Load Balancer: GCP TCP Load Balancer
Extra Options:
service.type: LoadBalancer
service.externalTrafficPolicy: Local
lifecycle.preStop.exec.command: ["sh", "-c", "sleep 60 && /wait-shutdown"]
kind: Deployment

What happened:

Incoming connections to the HTTP/HTTPS ports from the Load Balancer start timing out immediately upon the start of the termination process if the pod is the last copy on a Node. This results in downscaling using the HPA to periodically cause almost 30 seconds of service disruption, as the Load Balancer continues to send traffic during that time until it is removed as a backend to the LB due to failing health checks.

What you expected to happen:

Regardless of whether externalTrafficPolicy is Local or Cluster, the preStop hook should be honored so that the Load Balancer has time to remove the empty Node.

This appears to be a result of the same issue causing https://github.com/kubernetes/kubernetes/issues/85643. Inside the Node, the HTTP(S) NodePort continues to work correctly during the termination process until the app actually stops, at which point the NGINX endpoint has been safely removed as an Endpoint. However, the moment the termination process begins, that port is closed to external traffic, meaning that there is no grace period for the Load Balancer to remove the Node from its backend pool and any traffic sent to the Node is silently lost.

How to reproduce it:

1) Deploy ingress-nginx in GKE using externalTrafficPolicy: Local and a delaying preStop hook.
2) Run no more than one NGINX pod per Node.
3) Send traffic directly to the HTTP NodePort on each Node and observe that it reaches the Default Backend.
4) Remove one pod from the Deployment.
5) Observe that, immediately, the HTTP NodePort is closed to outside traffic despite the pod itself continuing to run NGINX.

Anything else we need to know: I'm trying to find any solution around this problem (however hacky) that preserves the remote IP and, ideally, allows me to continue using ingress-nginx.

Already attempted:

Using service.type: Both to ensure at least one NGINX pod on every Node through the DaemonSet -- this helps a bit, but, in addition to overprovisioning the service, has its own problems around the Node itself being terminated.
Disabling HPA with a higher baseline -- this eliminates the risk of scaling, and thus prevents the issue from occurring, but leaves us at risk of both overprovisioning and not being able to handle an unexpected spike properly.
Running with externalTrafficPolicy: Cluster -- I could find no configuration in GKE where this preserved the remote IP, which is a hard requirement.
Switching to a different Ingress Controller -- so far, it looks like the only option which would avoid this problem would be the GCE Ingress and an L7 GCP LB. However, I currently rely on NGINX features that the GCE Ingress does not support.

/kind bug

kinbug

Source

jeisen

👍1

Most helpful comment

This is easy to replicate with the custom preStop hook removed. In the following example, hello.sample is a DNS record pointing at my GCP LB, going to a simple hello-world backend. There are 2 ingress-nginx pods running (on different nodes), and this is the only traffic going to that LB. I deleted one of the pods at 16:39:21.

% while : ; do curl --connect-timeout 2 -s https://hello.sample > /dev/null && echo `date` ok || echo `date` fail ; sleep 0.2 ; done

Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 fail
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:25 EDT 2020 ok
Wed Oct 7 16:39:27 EDT 2020 fail
Wed Oct 7 16:39:27 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 fail
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:32 EDT 2020 fail
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:37 EDT 2020 fail
Wed Oct 7 16:39:39 EDT 2020 fail
Wed Oct 7 16:39:41 EDT 2020 fail
Wed Oct 7 16:39:43 EDT 2020 fail
Wed Oct 7 16:39:45 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok

The moment the termination signal arrives, requests start timing out because the LB is sending traffic to a node that has closed its NodePort. Exactly 24 seconds later, that node is finally removed from the LB and the service is healthy again.

jeisen on 7 Oct 2020

👍2

All 12 comments

Running with externalTrafficPolicy: Cluster -- I could find no configuration in GKE where this preserved the remote IP, which is a hard requirement.

Please to set the next settings in the configuration configmap

proxy-real-ip-cidr: XXX.XXX.XXX/XX -> Your VPC/LB address/range
use-forwarded-headers: "true"

https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#proxy-real-ip-cidr
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#use-forwarded-headers

By default, ingress-nginx does not trust X-Forwarded-For and the setting proxy-real-ip-cidr limits the scope of the trust

aledbf on 5 Oct 2020

proxy-real-ip-cidr: XXX.XXX.XXX/XX -> Your VPC/LB address/range
use-forwarded-headers: "true"

Unfortunately it looks like this doesn't do the right thing on the GCP TCP LB unless we also use externalTrafficPolicy: Local. Even setting the accepted range to 0.0.0.0/0 doesn't help:

data:
  proxy-real-ip-cidr: 0.0.0.0/0
  use-forwarded-headers: "true"

10.150.128.27 - - [06/Oct/2020:03:03:51 +0000] "GET /some/endpoint HTTP/2.0" 200 6718 "-" "curl/7.64.1" 56 0.019 [my-app-http] [] 10.150.0.83:3000 6731 0.018 200 fc32c20714bb6219d03592175723892e

With externalTrafficPolicy: Local, with or without the proxy-real-ip-cidr or use-forwarded-headers, that first argument is my real remote IP.

jeisen on 6 Oct 2020

lifecycle.preStop.exec.command: ["sh", "-c", "sleep 60 && /wait-shutdown"]

Deploy ingress-nginx in GKE using externalTrafficPolicy: Local and a delaying preStop hook.

Run no more than one NGINX pod per Node.

Send traffic directly to the HTTP NodePort on each Node and observe that it reaches the Default Backend.

Remove one pod from the Deployment.

Observe that, immediately, the HTTP NodePort is closed to outside traffic despite the pod itself continuing to run NGINX.

Are you sure the preStop hook is working as intended ? I don't see why you have to modify the preStop hook at all, since the default preStop hook will gracefully shutdown the master/workers. See here: https://github.com/kubernetes/ingress-nginx/issues/6034

Regarding your described steps, in step 5 that is what I would expect to happen. When you issue SIGTERM to the nginx pod, it will catch it and send SIGQUIT to the nginx _process_, which would allow nginx to complete inflight requests and close idle keepalive session. New TCP sessions would be denied, as expected, since SIGQUIT will gracefully shutdown nginx.

Thats why I'm asking if the preStop hook is working at all?

Regarding your IP issue, I have the same setup as you and confirm it does work. You have to remember that the Network Load Balancer from Google is not a proxy, so nginx would see the real IP from clients:

The network load balancers are not proxies.
Responses from the backend VMs go directly to the clients, not back through the load balancer. The industry term for this is direct server return.
The load balancer preserves the source IP addresses of packets.
The destination IP address for packets is the regional external IP address associated with the load balancer's forwarding rule.

https://cloud.google.com/load-balancing/docs/network

toredash on 6 Oct 2020

% while : ; do curl --connect-timeout 2 -s https://hello.sample > /dev/null && echo `date` ok || echo `date` fail ; sleep 0.2 ; done

Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:19 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:20 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:21 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 fail
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:24 EDT 2020 ok
Wed Oct 7 16:39:25 EDT 2020 ok
Wed Oct 7 16:39:27 EDT 2020 fail
Wed Oct 7 16:39:27 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 fail
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:30 EDT 2020 ok
Wed Oct 7 16:39:32 EDT 2020 fail
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:33 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:34 EDT 2020 ok
Wed Oct 7 16:39:37 EDT 2020 fail
Wed Oct 7 16:39:39 EDT 2020 fail
Wed Oct 7 16:39:41 EDT 2020 fail
Wed Oct 7 16:39:43 EDT 2020 fail
Wed Oct 7 16:39:45 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 fail
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:48 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:49 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok
Wed Oct 7 16:39:50 EDT 2020 ok

jeisen on 7 Oct 2020

👍2

We have the exact same behavior on our gke clusters when using nginx-ingress with externalTrafficPolicy: Local behind a GCP l4 LB

vfiset on 8 Oct 2020

@jeisen I can confirm the issue your raising, and can replicate it.

This issue is related to: https://github.com/kubernetes/kubernetes/issues/85643 (as you already mentioned)

TL;DR; When a nginx pod is instructed to terminate, the NodePort on that host is closed at the same time. Result is that _new_ TCP connections will fail, already established connections will continue to work.

This will not be an issue for established connections as the GCP L4 LB will not terminate any connections (I really cant either since it is not a proxy).

When the Service object is initially created, it will use the health check path defined in the controllers deployment/daemonset as a source for health check for the GCP LB:
https://github.com/kubernetes/ingress-nginx/blob/master/charts/ingress-nginx/templates/controller-deployment.yaml#L150-L159

My GCP LB uses a health check interval of 8s, unhealthy threshold 3, I assume this if the default, and it matches @jeisen findings, and I was able to replicate it.

So;

Pod is asked to terminate
prestop hook is activated, and at the same time:
the nodePort on that host is closed
_new_ requests to the GCP LB IP could be directed to the pod that is asked to terminate, but it will fail as nginx does not accept new connections after QUIT signal is sent.

As this is a k8s issue, specifically when using NodePort, ingress-nginx cannot fix this.

What I've done is to manually change the health check for GCP L4 LB to the following values:

% gcloud compute http-health-checks describe X
[...]
checkIntervalSec: 1
healthyThreshold: 1
requestPath: /healthz
timeoutSec: 1
unhealthyThreshold: 1

Notice I've changed checkIntervalSec 8>1 and unhealthyThreshold 3>1.

If a health check fails from the LBs perspective, _new_ connections won't be directed to that instance. Existing connections will continue to work. When a NodePort is closed/pod is asked to terminate, there is still possibility for having some connections not being able to establish. Considering GCP uses multiple source probes there should less than <1 before GCP LB detects that a pod is draining/not available.

I don't see another workaround being possible at the current time.

@aledbf

Do you consider this a valid problem ? If yes, I could generate a PR to enhance the documentation for this when deploying to GCP using L4 Load Balancer.

Other sources pointing to the same issue:
https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2
https://philpearl.github.io/post/k8s_ingress/

toredash on 8 Oct 2020

checkIntervalSec: 1
healthyThreshold: 1
unhealthyThreshold: 1

Those values are too aggressive. In nodes with a high load, this could lead to frequent healthy/unhealthy switches.

If this workaround works until kubernetes/kubernetes#85643 is fixed, use it, but It has side effects.
For that reason, I don't think it is a good idea to add it to the docs.

@toredash thank you for taking the time to debug the issue

aledbf on 8 Oct 2020

Those values are too aggressive. In nodes with a high load, this could lead to frequent healthy/unhealthy switches.

I agree, but for our environment I don't see any other workaround. Mind you we using CloudFlare in front of nginx, we will always have a pool of keepalive connections open at all times. So _if_ we have frequent switches, it only affect new connections.

If we dont alter our health check, _any_ reconfiguration that requires that pods get re-deployed will cause Origin timeouts for us.

The pods readiness check does not change, so the pod will continue to use serve already established connections. If the liveness probe does not fail, it does not send the shutdown signal either.

If all nodes running nginx pods are under high load, I assume they would have issues with accepting new connections anyway. So not accepting new connections and only serve existing connections seems like a acceptable tradeoff.

toredash on 8 Oct 2020

Other sources pointing to the same issue:
https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2
https://philpearl.github.io/post/k8s_ingress/

I don't believe this is quite describing the same problem -- the scenario they describe would be solved by sleeping after receiving a SIGTERM so that the endpoint can be removed. Instead, this problem is because the external LB doesn't work the same way as kube-proxy; we could only solve this problem either at the LB side or by somehow delaying the NodePort close.

jeisen on 8 Oct 2020

👍1

I don't believe this is quite describing the same problem
Agree, after reading them once more I see they are not related.

Have you been able to try the suggested workaround ? We have not seen any issues with pod maintenance after this change was introduced.

toredash on 8 Oct 2020

After consulting with Google, we're pursuing a near-term solution of running a dedicated node pool for Nginx with the DaemonSet+Deployment configuration. Unfortunately, this doesn't resolve the issue of a node itself shutting down, so we will need to turn off Cluster Autoscaling for that pool and make sure it cannot auto-upgrade. I may also try running the Health Check workaround, but the risk it still leaves open and the management overhead that requires is still harder than I'd like.

An improvement to this plan might involve setting up the new node pool as our "baseline" static set, which would allow any pods to run on its nodes as long as it prioritized Nginx, with a non-Nginx node pool acting as a totally dynamic and scalable pool that could even scale to 0 if everything fit inside the baseline.

jeisen on 16 Oct 2020

We have this issue with EKS and NLB's, except the default health check timeout is 30s * 3 and not configurable (until 1.19?), so it's pretty rough.