First, this problem happens randomly and I still cannot repo it. I post an issue here to see if you have ideas. And I will update it once I have more information.
In my case, I have a remote dependency, it's a nginx enabled http2. And I try to connect to it via Envoy:
My service ---HTTP 1.1---> Envoy ---HTTP 2---> Nginx Plus (SLB)
OS: ubuntu 14.04
Envoy: v1.5.0
Here is a related config I pick from my large config:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 127.0.0.1, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 127.0.0.1, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["www.google.com"]
routes:
- match: { prefix: "/" }
route:
timeout: 60s
host_rewrite: www.google.com
cluster: service_google
http_filters:
- name: envoy.router
clusters:
- name: service_google
connect_timeout: 0.25s
type: STRICT_DNS
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
address: www.google.com
port_value: 443
http2_protocol_options: {}
tls_context:
sni: www.google.com
common_tls_context:
alpn_protocols: h2
From the service log, I can see some of the response status codes are 426 (~1%), but I cannot repo it stably.
But I got a chance to do it by nc and got the following:
$ nc 127.0.0.1 12345
GET /cms/service/search?search=SEX&page_num=0 HTTP/1.1
host: A_PRIVATE_DOMAIN
HTTP/1.1 426 Upgrade Required
server: envoy
date: Thu, 01 Feb 2018 07:42:58 GMT
content-length: 0
x-envoy-upstream-service-time: 2
GET /cms/service/search?search=SEX&page_num=0 HTTP/1.1
host: A_PRIVATE_DOMAIN
HTTP/1.1 200 OK
server: envoy
date: Thu, 01 Feb 2018 07:43:06 GMT
content-type: application/json
content-length: 5428
vary: Accept-Encoding
x-api-version: bf8d7cd
etag: "j2jBnN6O4An7p/NjdexSWjaih5Q="
content-md5: A/Vrt9tfWp4jxmmf/sbnJg==
api-version: 4.0.0
request-id: a31779e4-9ee4-409e-83da-6290f81498bb
response-time: 96
x-cache-status: REVALIDATED
x-envoy-upstream-service-time: 63
{"pages":5,...
Envoy access log:
[2018-02-01T10:07:44.756Z] "GET /cms/service/search?search=Meet%20The%20Mill&page_num=0 HTTP/1.1" 426 - 0 0 0 0 "172.30.1.168" "-" "f08b0651-572a-96c9-a9d1-ade1aac6efbe" "A_PRIVATE_DOMAIN" "172.30.1.133:443" nginx_cluster - 172.30.1.168
I can get 400 from envoy randomly as well:
HTTP/1.1 400 Bad Request
server: envoy
date: Thu, 01 Feb 2018 10:13:22 GMT
content-length: 0
x-envoy-upstream-service-time: 1
I am wondering if this 426 is happening on the remote nginx and envoy return it directly to the client.
more info:
www.google.comhttp2_protocol_options: {} and alpn_protocols: h2alpn_protocols: h2,http/1.1codec_type: HTTP1May be this is an issue of nginx plus.
But I think envoy should not let the downstream know about upstream status 426.
FYI it looks to me like the 426 is coming from NGINX, not Envoy. (Same for the 400). You can tell this by the x-envoy-upstream-service-time header. Beyond that no idea right now. This is the first I have heard of an issue like this. I might ask NGINX to help debug if you are paying for Plus already?
Yeah, I'd argue if upstream wants H2 and Envoy is configured to do h2 and HTTP/1 it's an upstream bug that H2 is not negotiated so I don't think it's worth the complexity to add functionality to Envoy to retry with H2 only.
Admittedly it would be a bit confusing if Envoy were configured for HTTP/1.1 only, a user were doing H2-to-Envoy and got a proxied 426 from an Envoy-Upstream HTTP/1 connection, but I think the best we could do at that point was have a configuration option to swallow the 426 and 50x instead.
One other pretty bizarre thing here: NGINX Plus appears to be returning 426 when Envoy is speaking h2, since Envoy does not negotiate. Given that it's random, this sounds like some caching bug to me in their server.
We have paid for NGINX plus. And how to debug?
And how to debug?
Ask NGINX to help determine why it is returning a 426? I'm not going to claim there is no chance this is a bug in Envoy, but there is not much to go on here.
Thanks, we will reach NGINX for help.
BTW, we are going to allow http traffic to work around currently.
I finally find the root cause, it's the issue of ourself, neither Envoy nor NGINX.
The full route is like this:
Service --> Envoy --> NGINX SLB --(x)--> Envoy --> Another Service
In route x, is where the problem happens.
This route is for a legacy API, which enabled NGINX cache for performance reason, but in this route's proxy config, it missed a shared config proxy_http_version 1.1, which default to use HTTP 1.0 for all NGINX upstream.
And Envoy will return HTTP 426 if the request is HTTP 1.0. (#170)
The reason why this is happening randomly is because I deploy only part of the machines in cluster.
Thank you so much for your help!
Most helpful comment
I finally find the root cause, it's the issue of ourself, neither Envoy nor NGINX.
The full route is like this:
Service-->Envoy-->NGINX SLB--(x)-->Envoy-->Another ServiceIn route x, is where the problem happens.
This route is for a legacy API, which enabled NGINX cache for performance reason, but in this route's proxy config, it missed a shared config
proxy_http_version 1.1, which default to use HTTP 1.0 for all NGINX upstream.And Envoy will return
HTTP 426if the request isHTTP 1.0. (#170)The reason why this is happening randomly is because I deploy only part of the machines in cluster.
Thank you so much for your help!