Title: Envoy does not adhere to HTTP/2 RFC 7540
Description:
RFC 7540 Section 9.1.1 and 9.1.2 specifies when a request coming in through a re-used HTTP/2 connection is accidentally sent to a non-origin but authoritative server that a 421 response should be returned. This can happen if two servers one with a wildcard certificate (e.g., a.example.com) and another server (b.example.com) with a non-wildcard on the same IP address using SNI responds to requests those meant for server b.example.com will accidentally be forwarded down the re-used HTTP/2 connection for a.example.com. In this situation a.example.com should send back a 421 to indicate the request was destined for b.example.com. This forces browsers to re-establish a new connection, re-negotiate the SNI, and thus the backing server and subsequently route to the correct origin.
[optional Relevant Links:]
@alyssawilk @PiotrSikora thoughts on this? I haven't read the relevant RFCs in detail to fully understand what is needed.
Intersection of H2 specs and TLS handshake? I eagerly anticipate Piotr sorting this out :-P
The gist is that browsers coalesce HTTP/2 connections pretty aggressively: when a browser opens connection to www.example.com and is presented with a certificate for *.example.com during TLS handshake, then it will re-use this connection for all requests to *.example.com, as long as the hostname resolves to the same IP (and some browsers don't even care about that).
As long as all *.example.com hostnames are served by the same listener/filter chain, this shouldn't be an issue in Envoy, since routing is happening on a per-request, and not per-connection basis (please correct me if I'm wrong).
However, if www.example.com (with *.example.com certificate) is served by one listener/filter chain, and app.example.com is served by another listener/filter chain, then we have an issue, since connections are latched to a single listener/filter chain for the lifetime of the connection (again, please correct me if I'm wrong), and if the connection to www.example.com is established first, then requests to app.example.com will be coalesced on the same connection, using configuration for www.example.com, and forwarded to the wrong backend.
One solution would be to send 421 Misdirected Request response to requests for hostnames that are not configured on a given listener/filter chain (but this wouldn't work if *.example.com is configured), or send 421 Misdirected Request response to requests for hostnames that are configured on other listeners/filter chains (but this requires a global list of all configured hostnames).
Another solution would be using HTTP/2 ORIGIN frame (RFC8336) to advertise allowed hostnames on a given listener/filter chain (but this requires a global list as well, and this extension is supported only by a few clients).
Is is possible to reprioritize this issue? We have a use case where we have thousands of services behind hundreds of FQDN's that are served by a set of identical envoys, (all using a wildcard TLS cert). We exhibit this exact issue when HTTP/2 is enabled but not when enforcing usage of HTTP/1.1.
Here's the CVE for this vulnerability https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11767
CC @envoyproxy/security-team
One solution would be to send
421 Misdirected Requestresponse to requests for hostnames that are not configured on a given listener/filter chain (but this wouldn't work if*.example.comis configured), or send421 Misdirected Requestresponse to requests for hostnames that are configured on other listeners/filter chains (but this requires a global list of all configured hostnames).
I can think of 3 alternatives:
What if you could specify the HTTP response code for an RBAC filter DENY? Then the management server that configured the HCM can add an RBAC policy for the server names that it allows on that HCM and generate 421 on DENY.
The management server could program the SNI server name check in the Lua filter, generating a 421 when it didn't match.
Add a dedicated filter that can be configured with the acceptable SNI server names.
What is the plan for fixing https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11767 in envoy?
AFAIK there is no plan currently. Someone needs to own this issue and drive a resolution if they are passionate about fixing it.
I hacked up a 421 response when the virtual host lookup fails and this works for a simple case that I tried. If this is a reasonable approach I'd need a bit of help to polish it up and figure out how to write tests for it.
https://gist.github.com/jpeach/e01f5f752eed5ffd09ea1f18634d1fc5
I think I managed to find a workaround:
On Envoy instance I added an "envoy.lua" HTTP filter, that checks if the response code is a 404 (the same code, that is being generated for non-existent route) AND checks if the "x-envoy-upstream-service-time" header is NOT present.
The Lua code:
function envoy_on_response(response_handle)
if response_handle:headers():get(":status") == "404" and response_handle:headers():get("x-envoy-upstream-service-time") == nil then
response_handle:headers():replace(":status", "421")
end
end
Example configuration on Envoy (fetched by LDS):
"http_filters": [
{
"name": "envoy.lua",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua",
"inline_code": `
function envoy_on_response(response_handle)
if response_handle:headers():get(":status") == "404" and response_handle:headers():get("x-envoy-upstream-service-time") == nil then
response_handle:headers():replace(":status", "421")
end
end`
}
},
{
"name": "envoy.filters.http.router"
}
]
I think I managed to find a workaround:
On Envoy instance I added an "envoy.lua" HTTP filter, that checks if the response code is a 404 (the same code, that is being generated for non-existent route) AND checks if the _"x-envoy-upstream-service-time"_ header is NOT present.
Nice! That's a bit cleaner than my equivalent 馃憤
@jpeach @lunighty Is there a plan to have a fix in Envoy code or do you think the current approach with EnvoyFilter is sufficient?
Most helpful comment
Is is possible to reprioritize this issue? We have a use case where we have thousands of services behind hundreds of FQDN's that are served by a set of identical envoys, (all using a wildcard TLS cert). We exhibit this exact issue when HTTP/2 is enabled but not when enforcing usage of HTTP/1.1.