Linkerd2: Injecting linkerd-proxy into nginx-ingress-controller causes proxy errors and high ELB RPS

Created on 28 Aug 2018 · 7Comments · Source: linkerd/linkerd2

I'm having some trouble when injecting linkerd into the nginx-ingress-controller. I'm using the following two annotations on each Ingress to prevent the ingress-controller from rewriting the Host header used by Linkerd for routing:

    nginx.ingress.kubernetes.io/service-upstream: "true"
    nginx.ingress.kubernetes.io/upstream-vhost: <<service_name>>.<<service_namespace>>.svc.cluster.local

Requests are working, and I can see that traffic between the nginx-ingress-controller and the backend service pods is encrypted (yay!), but the linkerd-proxy seems to be hammering the ELB associated with the LoadBalancer service used by the nginx-ingress-controller (~15k requests per minute). Removing linkerd from the ingress-controller deployment drops the ELB traffic back down to normal.

The linkerd-proxy log on the nginx-ingress-controller pod is full of the following errors:

ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52842} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52848} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52854} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52860} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52866} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52872} linkerd2_proxy turning operation timed out after 10s into 500
ERR! proxy={server=out listen=127.0.0.1:4140 remote=10.18.22.28:52878} linkerd2_proxy turning operation timed out after 10s into 500

Any ideas on why the linkerd-proxy would be hitting the ELB provisioned by the LoadBalancer service?

Environment:
- K8s v1.9.3+coreos.0
- Linkerd v18.8.2
- nginx-ingress-controller v0.18.0 (on AWS, using L4 config)

areproxy bug priorittriage

Source

Capitrium

All 7 comments

I'm also seeing the following warnings in the linkerd-proxy log:

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for DnsNameAndPort { host: DnsName(DNSName("_")), port: 80 }: Grpc(Status { code: Unknown }, {"content-type": "application/grpc", "grpc-status": "2", "grpc-message": "resolver [&{k8sDNSZoneLabels:[] endpointsWatcher:0xc42051c7d0}] found error resolving host [_.ingress-nginx.svc.cluster.local] port[80]: DNS name cannot only contain digits and hyphens: _.ingress-nginx.svc.cluster.local"})
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for DnsNameAndPort { host: DnsName(DNSName("_")), port: 80 }: Grpc(Status { code: Unknown }, {"content-type": "application/grpc", "grpc-status": "2", "grpc-message": "resolver [&{k8sDNSZoneLabels:[] endpointsWatcher:0xc42051c7d0}] found error resolving host [_.ingress-nginx.svc.cluster.local] port[80]: DNS name cannot only contain digits and hyphens: _.ingress-nginx.svc.cluster.local"})
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for DnsNameAndPort { host: DnsName(DNSName("_")), port: 80 }: Grpc(Status { code: Unknown }, {"content-type": "application/grpc", "grpc-status": "2", "grpc-message": "resolver [&{k8sDNSZoneLabels:[] endpointsWatcher:0xc42051c7d0}] found error resolving host [_.ingress-nginx.svc.cluster.local] port[80]: DNS name cannot only contain digits and hyphens: _.ingress-nginx.svc.cluster.local"})
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for DnsNameAndPort { host: DnsName(DNSName("_")), port: 80 }: Grpc(Status { code: Unknown }, {"content-type": "application/grpc", "grpc-status": "2", "grpc-message": "resolver [&{k8sDNSZoneLabels:[] endpointsWatcher:0xc42051c7d0}] found error resolving host [_.ingress-nginx.svc.cluster.local] port[80]: DNS name cannot only contain digits and hyphens: _.ingress-nginx.svc.cluster.local"})
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for DnsNameAndPort { host: DnsName(DNSName("_")), port: 80 }: Grpc(Status { code: Unknown }, {"content-type": "application/grpc", "grpc-status": "2", "grpc-message": "resolver [&{k8sDNSZoneLabels:[] endpointsWatcher:0xc42051c7d0}] found error resolving host [_.ingress-nginx.svc.cluster.local] port[80]: DNS name cannot only contain digits and hyphens: _.ingress-nginx.svc.cluster.local"})

Capitrium on 28 Aug 2018

I may have the cause of this one figured out:

The upstream-vhost annotation is required on ingresses so that the right Host header is set in requests from the nginx-ingress-controller, otherwise linkerd won't route them properly. However, there's no ingress for the default-http-backend service, so what does nginx set the Host header to for requests that have no matching ingress?

Turns out it's this line in the nginx-ingress-controller's templated config file:

{{ $proxySetHeader }} Host $best_http_host;

The $best_http_host variable is based on some headers (i.e. X-Forwarded-Host). Looking at the linkerd-proxy debug logs on the nginx-ingress-controller pod, we can see the values of those headers (edited for clarity):

1) DBUG proxy={client=out dst=<<elb-eni-ip>>:80 proto=Http1 { host: Authority(<<elb-eni-ip>>), is_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::client client request: method=GET uri=http://<<elb-eni-ip>>/ version=HTTP/1.1 headers={"host": "<<elb-eni-ip>>", "x-scheme": "http", "x-request-id": "ec3dde97b1e3f0d9f6c94c25c939bf32", "x-real-ip": "<<original-client-ip>>", "x-forwarded-for": "<<original-client-ip>>", "x-forwarded-host": "<<elb-eni-ip>>", "x-forwarded-port": "80", "x-forwarded-proto": "http", "x-original-uri": "/"}
2) DBUG proxy={client=out dst=<<elb-eni-ip>>:80 proto=Http1 { host: Authority(<<elb-eni-ip>>), is_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::client client request: method=GET uri=http://<<elb-eni-ip>>/ version=HTTP/1.1 headers={"host": "<<elb-eni-ip>>", "x-original-forwarded-for": "<<original-client-ip>>", "x-request-id": "ec3dde97b1e3f0d9f6c94c25c939bf32", "x-real-ip": "<<elb-public-ip>>", "x-forwarded-for": "<<elb-public-ip>>", "x-forwarded-host": "<<elb-eni-ip>>", "x-forwarded-port": "80", "x-forwarded-proto": "http", "x-original-uri": "/", "x-scheme": "http"}
3) DBUG proxy={client=out dst=<<elb-eni-ip>>:80 proto=Http1 { host: Authority(<<elb-eni-ip>>), is_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::client client request: method=GET uri=http://<<elb-eni-ip>>/ version=HTTP/1.1 headers={"host": "<<elb-eni-ip>>", "x-original-forwarded-for": "<<elb-public-ip>>", "x-request-id": "ec3dde97b1e3f0d9f6c94c25c939bf32", "x-real-ip": "<<elb-public-ip>>", "x-forwarded-for": "<<elb-public-ip>>", "x-forwarded-host": "<<elb-eni-ip>>", "x-forwarded-port": "80", "x-forwarded-proto": "http", "x-original-uri": "/", "x-scheme": "http"}
4) DBUG proxy={client=out dst=<<elb-eni-ip>>:80 proto=Http1 { host: Authority(<<elb-eni-ip>>), is_h1_upgrade: false, was_absolute_form: false }} linkerd2_proxy::proxy::http::client client request: method=GET uri=http://<<elb-eni-ip>>/ version=HTTP/1.1 headers={"host": "<<elb-eni-ip>>", "x-original-forwarded-for": "<<elb-public-ip>>", "x-request-id": "ec3dde97b1e3f0d9f6c94c25c939bf32", "x-real-ip": "<<elb-public-ip>>", "x-forwarded-for": "<<elb-public-ip>>", "x-forwarded-host": "<<elb-eni-ip>>", "x-forwarded-port": "80", "x-forwarded-proto": "http", "x-original-uri": "/", "x-scheme": "http"}

Note that requests (3) and (4) are identical; these get repeated at ~15K RPM for several minutes before eventually stopping, causing the high RPS issue noted above.

My current hacky fix is to replace the use of the $best_http_host variable with the URL of the default backend service, i.e.:

{{ $proxySetHeader }} Host "default-http-backend.ingress-nginx.svc.cluster.local"

Capitrium on 5 Oct 2018

@Capitrium More specifically _how_ do you do this?

My current hacky fix is to replace the use of the $best_http_host variable with the URL of the default backend service, i.e.:
{{ $proxySetHeader }} Host "default-http-backend.ingress-nginx.svc.cluster.local"

tomasaschan on 1 Nov 2018

@tomasaschan:

Configure the ingress controller to use a custom config template.
Get the DNS name for your default http backend (i.e. default-http-backend.ingress-nginx.svc.cluster.local)
In the config template, replace

{{ $proxySetHeader }} Host $best_http_host;

with

{{ $proxySetHeader }} Host "default-http-backend.ingress-nginx.svc.cluster.local";

@grampelberg I had to revert this fix on my staging environment as it prevents cert-manager from updating the TLS certs for each ingress. I'd appreciate some suggestions on next steps from the Linkerd team, because it currently seems like Linkerd simply doesn't support E2E TLS configurations and I'm not sure of the best way to move forward on this.

Capitrium on 1 Nov 2018

👍1

@Capitrium would you mind throwing a really simple YAML and/or set of replication steps that makes it all break (including cert-manager)? We can definitely dig in a little bit more at that point.

grampelberg on 1 Nov 2018

@grampelberg I think the issues with cert-manager are due to the fact that we're in the process of migrating our ingresses to a new controller but still have cert-manager configured to use our old ingress controller :man_shrugging:.

I was thinking that the issue was related to the upstream-vhost annotation; that's a fairly brittle solution since it forces all traffic for a given ingress to the defined backend service. Cert-manager would actually break with that annotation until v0.4 since it used to add services to existing ingresses to handle the HTTP-01 challenges, and I'm pretty sure it would also break if you use the edit-in-place annotation.

Capitrium on 1 Nov 2018

@Capitrium I'm going to close this out based on the ingress doc being up now. Mind opening an issue around cert-manager if there's something broken between that and LD2?

grampelberg on 3 Dec 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings