envoy returns HTTP 426 for proxying traffic to nginx http2 endpoint

Created on 1 Feb 2018 · 8Comments · Source: envoyproxy/envoy

First, this problem happens randomly and I still cannot repo it. I post an issue here to see if you have ideas. And I will update it once I have more information.

In my case, I have a remote dependency, it's a nginx enabled http2. And I try to connect to it via Envoy:

My service ---HTTP 1.1---> Envoy ---HTTP 2---> Nginx Plus (SLB)

OS: ubuntu 14.04
Envoy: v1.5.0

Here is a related config I pick from my large config:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 127.0.0.1, port_value: 10000 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["www.google.com"]
              routes:
              - match: { prefix: "/" }
                route:
                  timeout: 60s
                  host_rewrite: www.google.com
                  cluster: service_google
          http_filters:
          - name: envoy.router
  clusters:
  - name: service_google
    connect_timeout: 0.25s
    type: STRICT_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: www.google.com
        port_value: 443
    http2_protocol_options: {}
    tls_context:
      sni: www.google.com
      common_tls_context:
        alpn_protocols: h2

From the service log, I can see some of the response status codes are 426 (~1%), but I cannot repo it stably.

But I got a chance to do it by nc and got the following:

$ nc 127.0.0.1 12345
GET /cms/service/search?search=SEX&page_num=0 HTTP/1.1
host: A_PRIVATE_DOMAIN

HTTP/1.1 426 Upgrade Required
server: envoy
date: Thu, 01 Feb 2018 07:42:58 GMT
content-length: 0
x-envoy-upstream-service-time: 2

GET /cms/service/search?search=SEX&page_num=0 HTTP/1.1
host: A_PRIVATE_DOMAIN

HTTP/1.1 200 OK
server: envoy
date: Thu, 01 Feb 2018 07:43:06 GMT
content-type: application/json
content-length: 5428
vary: Accept-Encoding
x-api-version: bf8d7cd
etag: "j2jBnN6O4An7p/NjdexSWjaih5Q="
content-md5: A/Vrt9tfWp4jxmmf/sbnJg==
api-version: 4.0.0
request-id: a31779e4-9ee4-409e-83da-6290f81498bb
response-time: 96
x-cache-status: REVALIDATED
x-envoy-upstream-service-time: 63

{"pages":5,...

Envoy access log:

[2018-02-01T10:07:44.756Z] "GET /cms/service/search?search=Meet%20The%20Mill&page_num=0 HTTP/1.1" 426 - 0 0 0 0 "172.30.1.168" "-" "f08b0651-572a-96c9-a9d1-ade1aac6efbe" "A_PRIVATE_DOMAIN" "172.30.1.133:443" nginx_cluster - 172.30.1.168

I can get 400 from envoy randomly as well:

HTTP/1.1 400 Bad Request
server: envoy
date: Thu, 01 Feb 2018 10:13:22 GMT
content-length: 0
x-envoy-upstream-service-time: 1

I am wondering if this 426 is happening on the remote nginx and envoy return it directly to the client.

question

Source

winguse

Most helpful comment

I finally find the root cause, it's the issue of ourself, neither Envoy nor NGINX.

The full route is like this:

Service --> Envoy --> NGINX SLB --(x)--> Envoy --> Another Service

In route x, is where the problem happens.

This route is for a legacy API, which enabled NGINX cache for performance reason, but in this route's proxy config, it missed a shared config proxy_http_version 1.1, which default to use HTTP 1.0 for all NGINX upstream.

And Envoy will return HTTP 426 if the request is HTTP 1.0. (#170)

The reason why this is happening randomly is because I deploy only part of the machines in cluster.

Thank you so much for your help!

winguse on 2 Feb 2018

👍36 ❤31 🎉25 🚀14

All 8 comments

more info:

i cannot repo this issue without high load -- I can only repo this in our production
i cannot repo it with domain www.google.com
this url is a url enabled nginx cache, if i repo this issue, i can see from the following success request, the cache is not hit.
this issue persist for the following condition
- remove http2_protocol_options: {} and alpn_protocols: h2
- set alpn_protocols: h2,http/1.1
- set codec_type: HTTP1

May be this is an issue of nginx plus.

But I think envoy should not let the downstream know about upstream status 426.

winguse on 1 Feb 2018

FYI it looks to me like the 426 is coming from NGINX, not Envoy. (Same for the 400). You can tell this by the x-envoy-upstream-service-time header. Beyond that no idea right now. This is the first I have heard of an issue like this. I might ask NGINX to help debug if you are paying for Plus already?

mattklein123 on 1 Feb 2018

Yeah, I'd argue if upstream wants H2 and Envoy is configured to do h2 and HTTP/1 it's an upstream bug that H2 is not negotiated so I don't think it's worth the complexity to add functionality to Envoy to retry with H2 only.
Admittedly it would be a bit confusing if Envoy were configured for HTTP/1.1 only, a user were doing H2-to-Envoy and got a proxied 426 from an Envoy-Upstream HTTP/1 connection, but I think the best we could do at that point was have a configuration option to swallow the 426 and 50x instead.

alyssawilk on 1 Feb 2018

One other pretty bizarre thing here: NGINX Plus appears to be returning 426 when Envoy is speaking h2, since Envoy does not negotiate. Given that it's random, this sounds like some caching bug to me in their server.

mattklein123 on 1 Feb 2018

We have paid for NGINX plus. And how to debug?

winguse on 2 Feb 2018

And how to debug?

Ask NGINX to help determine why it is returning a 426? I'm not going to claim there is no chance this is a bug in Envoy, but there is not much to go on here.

mattklein123 on 2 Feb 2018

Thanks, we will reach NGINX for help.

BTW, we are going to allow http traffic to work around currently.

winguse on 2 Feb 2018

I finally find the root cause, it's the issue of ourself, neither Envoy nor NGINX.

The full route is like this:

Service --> Envoy --> NGINX SLB --(x)--> Envoy --> Another Service

In route x, is where the problem happens.

And Envoy will return HTTP 426 if the request is HTTP 1.0. (#170)

The reason why this is happening randomly is because I deploy only part of the machines in cluster.

Thank you so much for your help!

winguse on 2 Feb 2018