Ingress-nginx: Lost client requests when updating ingress resource

Created on 23 Jan 2020  路  16Comments  路  Source: kubernetes/ingress-nginx

NGINX Ingress controller version: 0.27.1

Kubernetes version (use kubectl version): 1.12.8

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): Linux ip-10-8-3-187 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux
  • Install tools: terraform (.11.14) / kops (1.11.1)

What happened:

Exact same issue as https://github.com/kubernetes/ingress-nginx/issues/4168
When updating the ingress resources, for a half second to second the ingress-controller returns 502s. Exact same log message regarding 0.0.0.1:80 upstream unavailable.

What you expected to happen:

Being able to update the ingress without an outage.

How to reproduce it:

Same steps as https://github.com/kubernetes/ingress-nginx/issues/4168 but setting the ingress to .27.1

Anything else we need to know:

It appears this bug has been around since .24 at least, maybe longer.

/kind bug

kinbug lifecyclrotten

Most helpful comment

I made a project that reproduces this issue: https://github.com/austintp/ingress-nginx-502-example

helm template -f values.yaml init/main | kubectl apply -f -
helm template -f values.yaml init/ingress-controller | kubectl apply -f -
helm template -f values.yaml platform-services/myapp | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-b | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-c | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-d | kubectl apply -f -
helm template -f values.yaml platform-services/subdomain-provisioner | kubectl apply -f -

A new Ingress rule can be added by running:

kubectl exec -it -n myapp subdomain-provisioner... -- sh
./provision-subdomain.sh test1

You can then setup a basic watch monitor on test1.myapp.com (I added myapp.com, test1.myapp.com, testn+1.myapp.com to my hosts file).

Then use the provision-subdomain.sh script to provision two or three more tenants/subdomains and see test1.myapp.com return back 502s briefly.

All 16 comments

@austintp did you follow the comment https://github.com/kubernetes/ingress-nginx/issues/4168#issuecomment-500049415 ?

Edit: also, do you have more than one replica of your application and has readiness and liveness probes?

I'll try with proxy-next-upstream: "error timeout http_502 http_503 http_504" however my hunch is that the result will be the same due to the balancer attempting to use 0.0.0.1:80 instead of another service. I'll report back after using proxy-next-upstream.

We currently only have one replica via HPA for the one application I was testing. We're only running one replica at the moment due to currently needing sticky bit having an issue with sticky bit. That's a different issue though (tested only on .25) and one we put on the back burner without diving in deeper.

We have readiness and liveness probes for our applications.

I added proxy-next-upstream to the nginx-configuration configmap but the problem did not go away.

apiVersion: v1
data:
  custom-http-errors: 502,503,504
  proxy-body-size: 2000m
  proxy-next-upstream: error timeout http_502 http_503 http_504
  use-proxy-protocol: "true"
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
  name: nginx-configuration
  namespace: ingress-nginx
  selfLink: /api/v1/namespaces/ingress-nginx/configmaps/nginx-configuration

kubectl logs -n ingress-nginx -l "app.kubernetes.io/name=ingress-nginx" | grep 502

IP - - [23/Jan/2020:17:12:39 +0000] "GET / HTTP/1.1" 200 714 "-" "Ruby" 160 0.011 [tenant-1-myapp-443] [] 100.67.154.1:443 720 0.012 200 12b9e6a502bb4cfc7607010ac806e89c
W0123 17:12:46.350297       8 controller.go:810] Error obtaining Endpoints for Service "ingress-nginx/myapp": no object matching key "ingress-nginx/myapp" in local store
IP - - [23/Jan/2020:17:12:50 +0000] "GET / HTTP/1.1" 502 15951 "-" "Ruby" 160 0.004 [upstream-default-backend] [] 0.0.0.1:80 : 100.115.198.83:80 0 : 15951 0.000 : 0.004 502 : 502 364514bf8e5b90886aac4bcd9eac87cd
IP - - [23/Jan/2020:17:12:50 +0000] "GET / HTTP/1.1" 502 15951 "-" "Ruby" 160 0.003 [upstream-default-backend] [] 0.0.0.1:80 : 100.115.198.83:80 0 : 15951 0.000 : 0.004 502 : 502 96b2af11088ea37acc85ec91c8d8c4b0
IP - - [23/Jan/2020:17:12:50 +0000] "GET / HTTP/1.1" 502 15951 "-" "Ruby" 160 0.004 [upstream-default-backend] [] 0.0.0.1:80 : 100.115.198.83:80 0 : 15951 0.000 : 0.004 502 : 502 89716fb0e458f2d00538116fc51f81e4

"ingress-nginx/myapp": no object matching key

Your service has no endpoints and it was removed?

[upstream-default-backend]

default backend

100.115.198.83:80

only has one endpoint but is returning 502

"ingress-nginx/myapp": no object matching key

That error occurs when adding new ingress rules. The service works fine before applying the ingress rules and then very shortly after, roughly a second. The service itself is not modified (the deployment and the service) and the readiness/liveness probes are healthy.

The default backend returns 502 because we provide a custom error page but it returns back the original HTTP status code from the ingress (502).

e.g.

      # HTTP:80 listener used by ingress
      server {
        listen       80;
        server_name  cluster.local;

        access_log /var/log/nginx/access.log origin;
        error_log /var/log/nginx/error.log debug;

        default_type text/plain;

        root /www;
        error_page 502 /502.html;
        error_page 503 /503.html;
        error_page 504 /504.html;

        location / {
          if ($http_x_code = "502") {
            return 502;
          }
          if ($http_x_code = "503") {
            return 503;
          }
          if ($http_x_code = "504") {
            return 504;
          }
        }

        location = /502.html {
          default_type text/plain;
          internal;
        }

        location = /503.html {
          default_type text/plain;
          internal;
        }

        location = /504.html {
          default_type text/plain;
          internal;
        }
      }

I made a project that reproduces this issue: https://github.com/austintp/ingress-nginx-502-example

helm template -f values.yaml init/main | kubectl apply -f -
helm template -f values.yaml init/ingress-controller | kubectl apply -f -
helm template -f values.yaml platform-services/myapp | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-b | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-c | kubectl apply -f -
helm template -f values.yaml platform-services/myapp-d | kubectl apply -f -
helm template -f values.yaml platform-services/subdomain-provisioner | kubectl apply -f -

A new Ingress rule can be added by running:

kubectl exec -it -n myapp subdomain-provisioner... -- sh
./provision-subdomain.sh test1

You can then setup a basic watch monitor on test1.myapp.com (I added myapp.com, test1.myapp.com, testn+1.myapp.com to my hosts file).

Then use the provision-subdomain.sh script to provision two or three more tenants/subdomains and see test1.myapp.com return back 502s briefly.

It appears that the problem is when using a Service for the Ingress backend that is of the type ExternalName. These routes return a 502 temporarily when applying new Ingress rules.

https://github.com/austintp/ingress-nginx-502-example/blob/master/platform-services/subdomain-provisioner/files/helm/templates/tenant._namespace.yaml#L9-L43

@austintp thank you for the repository but this is not an "example"

  1. Why are you using

type: ExternalName
externalName: myapp-d.myapp.svc.cluster.local

  1. tls sections without secretName will never work
  tls:
  - hosts:
    - {{ .Values.tenant }}.{{ .Values.host }}
  1. Please use https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#backend-certificate-authentication for all this and don't use a volume to mount a file in the ingress controller.
      proxy_ssl_verify on;
      proxy_ssl_session_reuse on;
      proxy_ssl_name *.myapp.svc.cluster.local;
      proxy_ssl_trusted_certificate /etc/nginx/ssl/myapp-internal-ca-root-public.crt;
  1. If you use type: ExternalName annotations like nginx.ingress.kubernetes.io/affinity: "cookie" will not work. That approach means you have only one thing, the FQDN

  2. The most important part, please don't use type: ExternalName because that introduces the source of most errors in a k8s cluster, DNS.

  1. We are using ExternalName to be able to have an ingress rule in one namespace reference a service in another namespace. The end goal is having test1.myapp.com point to myapp.com/ for example. As well as test1.myapp.com/myapp-b/ point to myapp.com/myapp-b/. We pass the subdomain along to our multi-tenant services in the myapp namespace. If there is a better way to accomplish this, we're fine with doing it a different way but we didn't see any documentation on this being an unsupported or bad practice.

  2. This was left over from a large project that I missed while trying to scale it down while maintaining the issue. In this "example" I wasn't intending to provide ssl at the ingress, the ingress service only specified port 80.

  3. Thanks for the suggestion. We are currently at .24 ingress, not sure when this was introduced or how we missed this option. One thing I don't see however is how to specify the expected SSL certificate CN as in my "example" by proxy_ssl_name. I suppose the verify option to ensure the certificate was signed by the expected CA is probably good enough but just in case it is a hard requirement how would the certificate CN be validated using the Backend Certificate Authentication? We would need to be able to specify the CN we want to match because the hostname won't match the certificate.

  4. Understood. An oversight on my behalf.

  5. See 1. Is there any official kubernetes documentation outlining this? Is using ExternalName Service backends an unsupported configuration for Ingress Nginx? Removing the issue with sticky bit and if the use case was completely different, say we have to proxy to outside the cluster, what is the suggested method in that case?

I'd definitely like to follow-up on this one, seeing pretty much the exact same issue with an ExternalName (we're using it like @austintp mentioned, proxying outside the cluster).

Is ExternalName unsupported? If it really won't work with this, then we might have to deploy a proxy service to do this (instead of using ExternalName).

Just as a side note, we removed ExternalName use due to https://github.com/kubernetes/ingress-nginx/issues/4641.

We've mostly removed our use of ExternalName in our project as well but we have a couple of instances where we cannot. We have some tenant/subdomain routes that we must proxy through the ingress to an endpoint outside the cluster.

I made a new branch on my example for reproducing the issue which is quite a bit more simplified. The issues that @aledbf brought up I believe have all been addressed except the use of ExternalName.

https://github.com/austintp/ingress-nginx-502-example/tree/simplified
https://github.com/austintp/ingress-nginx-502-example/blob/simplified/README.md

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings