At least in the cases where we use SO_ORIGINAL_DST to determine the destination IP address, the Host header field is set to the SO_ORIGINAL_DST IP address. Instead the Host header field (and :authority) should not change in these cases. In particular, if there wasn't a Host header field before, none should be added, and if there was a host header field then it should be preserved.
AFAICT, the only time we should change the Host header field is when we manually override the Host in FullyQualifiedAuthority::normalize().
/cc @seanmonstar @hawkw
AFAICT, the only time we should change the Host header field is when we manually override the Host in FullyQualifiedAuthority::normalize().
In fact, I think we shouldn't even change the Host header field in that case, as of now. When we implement TLS then we may need to have a way to do this Host header field override only for the case where we're using the Conduit TLS infrastructure (i.e. Conduit-to-Conduit transport). Otherwise I think we should not touch it.
Also, I say "Host header field" everywhere here, but the same thinking applies to :authority for HTTP/2.
To clarify the exact current behavior:
absolute-form request-target, the authority replaces any Host header that may exist, per RFC7230 Section 5.4.absolute-form, and there is a Host header, it is completely untouched.absolute-form, and there is no Host header, the Host will indeed get set to the SO_ORIGINAL_DST address, but more as an accident than on purpose.The reason for the 3rd case is that the request-target (Uri) must be built for telemetry recording, and to pass to hyper's Client. Inside the Client, hyper will automatically set the Host header if one does not exist.
If an accepted HTTP/1 request has an absolute-form request-target, the authority replaces any Host header that may exist, per RFC7230 Section 5.4.
RFC7230 Section 5.4 applies only to HTTP proxies. The Conduit proxy isn't acting as an HTTP proxy, because clients don't connect to it using CONNECT. It is acting as a gateway or a transparent proxy, so the UA and origin server requirements apply to it, not the proxy requirements. From the RFC:
However, an HTTP-to-HTTP gateway that wishes to interoperate with
third-party HTTP servers ought to conform to user agent requirements
on the gateway's inbound connection.
and
An "interception proxy" [RFC3040] (also commonly known
as a "transparent proxy" [RFC1919] or "captive portal") differs from
an HTTP proxy because it is not selected by the client. Instead, an
interception proxy filters or redirects outgoing TCP port 80 packets
(and occasionally other common port traffic).
The reason for the 3rd case is that the request-target (Uri) must be built for telemetry recording
Consider the two cases:
Host: 1.2.3.4In the second case where we used SO_ORIGINAL_DST, the original URL was probably not http://1.2.3.4/.... Much more likely, the original URL was using a hostname.
Consequently, it would be really misleading to mix the telemetry for these two cases. Instead we should bucket the telemetry for these two cases separately.
and to pass to hyper's Client. Inside the Client, hyper will automatically set the Host header if one does not exist.
We need to be able to tell Hyper which authority to use as an override for routing purposes (the FullyQualifiedAuthority) and for telemetry purposes, in a way that doesn't cause it to change the HTTP message.
The Conduit proxy isn't acting as an HTTP proxy, because clients don't connect to it using CONNECT.
Interesting, I hadn't seen the distinction between the different intermediary types! Slight correction here, though, is that the rule in 5.4 isn't about CONNECT requests, but rather the opposite: Say you've stood up a proxy for your PC, and want to fetch to Conduit's website, and your client doesn't know how to tunnel, and just sent this:
GET https://conduit.io/docs HTTP/1.1
Host: www.conduit.io
That's a case where the proxy should replace the Host with Host: conduit.io.
Much more likely, the original URL was using a hostname.
I don't follow. If it was using a hostname, it would be included in the request, wouldn't it?
We need to be able to tell Hyper which authority
That much, at least, is possible! hyper will only set the Host header if one does not already exist. So, if in the proxy we know of a better hostname, we can set it prior to passing to hyper.
However, if we specifically want there to be no host header, hyper doesn't understand that. I've just added a config option to hyper's client to allow disabling automatic host header addition. (In most cases, it's a very useful feature, allowing client.get("https://conduit.io") to set the Host header for you.)
I don't follow. If it was using a hostname, it would be included in the request, wouldn't it?
An HTTP/1.0 request often won't have a Host header.
(In most cases, it's a very useful feature, allowing client.get("https://conduit.io") to set the Host header for you.)
I fully support that behavior of hyper, for those high-level APIs. Really the concern I have here is for transparent proxies, which really want a very low-level interface where the default is for nothing to change unless it is explicitly asked for something to be changed.
Slight correction here, though, is that the rule in 5.4 isn't about CONNECT requests, but rather the opposite: Say you've stood up a proxy for your PC, and want to fetch to Conduit's website, and your client doesn't know how to tunnel, and just sent this:
GET https://conduit.io/docs HTTP/1.1 Host: www.conduit.ioThat's a case where the proxy should replace the Host with Host: conduit.io.
That's a case where an HTTP proxy should replace the Host. However, the definition of an HTTP proxy is one in which the client specifically CONNECTs to. Since the client never CONNECTs to Conduit, it isn't an HTTP proxy. therefore, when Conduit receives a GET request like the one you quoted, it shouldn't apply the rule for HTTP proxies rewriting the Host header.
However, the definition of an HTTP proxy is one in which the client specifically CONNECTs to.
Reference?
My understanding of "HTTP Proxies" is that they receive "proxy requests" where the resource is in absolute-form. The http_proxy env variable does not cause curl to issue CONNECT requests unless the target is HTTPS.
FWIW, I don't think that Conduit should do anything special with HTTP proxy requests, because they're presumably being configured in the context that the _destination_ is a proxy.
For example, imagine I run
http_proxy=http://outbound-squid-proxy.corp.example.com curl http://conduit.io
this would result in a request like
GET http://conduit.io/
Host: outbound-squid-proxy.corp.example.com
I think we want to route the request to outbound-squid-proxy.corp.example.com and not conduit.io directly.
Nope, I'm totally wrong about how that curl command would work. Disregard.
We _will_ have to figure out how to configure proxies, but that can be done later.
However, the definition of an HTTP proxy is one in which the client specifically CONNECTs to. Since the client never CONNECTs to Conduit, it isn't an HTTP proxy. therefore, when Conduit receives a GET request like the one you quoted, it shouldn't apply the rule for HTTP proxies rewriting the Host header.
Reference?
Sorry, I was sloppy. What I meant is better described in the RFC text I quoted above:
An "interception proxy" [RFC3040] (also commonly known
as a "transparent proxy" [RFC1919] or "captive portal") differs from
an HTTP proxy because it is not selected by the client. Instead, an
interception proxy filters or redirects outgoing TCP port 80 packets
(and occasionally other common port traffic).
In our case, the client application is totally unaware of Conduit, so Conduit can't be acting as a HTTP proxy for it. In particular, as you mentioned in the Gitter discussion, the application that Conduit is proxying may be intending to communicate with an HTTP proxy like this:
Application <-> Conduit <-> Proxy
Therefore, we have to be really careful that we don't do "proxy stuff" that the application is expecting that proxy to do.
More generally, it goes back to the transparency idea: Unless we've explicitly chosen for the proxy to do some non-transparent behavior, all the behavior of the proxy should be transparent by default.
If we really are trying to be transparent in the face of http-proxy requests, we should never route proxy requests through discovery/balancing, instead using the original dst. The only indication that the request is destined for a proxy is the absolute-form resource. And the only knowledge we have of the proxy address is the original dst on the socket...
We can guess that the request was meant for a proxy if it has an absolute-form target in HTTP/1, but it's only a guess! It's not illegal at all to send an absolute-form target directly to the end server. And for HTTP/2, requests always has the :authority, so we can't really even guess there.
OK, so it's probably the case that we cannot detect proxy configuration transparently and it needs to be explicitly configured via the control plane?
OK, so it's probably the case that we cannot detect proxy configuration transparently and it needs to be explicitly configured via the control plane?
I do not know what you're asking here, but I'll guess: We don't provide any way for the user to configure their app to use Conduit as an HTTP proxy, so we can assume that we're not being used as a proxy. If/when that changes, regardless of how that configuration is done, we'll have to revisit. (I think it would be nice to have a configuration option for Conduit that allows the application to use it as a HTTP proxy and avoid the iptables stuff.)
If we really are trying to be transparent in the face of http-proxy requests, we should never route proxy requests through discovery/balancing, instead using the original dst. The only indication that the request is destined for a proxy is the absolute-form resource. And the only knowledge we have of the proxy address is the original dst on the socket...
I don't think we should do this extra complication of trying to detect when we're talking to a proxy. One of the ways where we're intentionally non-transparent is that we do L7 load balancing and different service discovery.
Therefore, we have to be really careful that we don't do "proxy stuff" that the application is expecting that proxy to do.
In particular, in theory the application might be sending hop-by-hop header fields (as noted in the Connection header field) that are intended for its HTTP proxy to consume. It would be very wrong for us to consume the hop-by-hop header fields. In particular, "Both the Proxy-Authenticate and the Proxy-Authorization header fields are hop-by-hop". The application would fail to authenticate to the proxy if we consumed the Proxy-Authenticate and the Proxy-Authorization header fields:
Application <-> Conduit <-> Proxy
It would be very wrong for us to consume the hop-by-hop header fields.
So, it'd be useful to know then that we currently do strip off connection headers.
We'd also need to do so if we ever upgrade h1 to h2 internally, as most h1 connection headers are illegal in h2.
So, it'd be useful to know then that we currently do strip off connection headers.
Yeah, I know. That seems wrong to me but I don't know all the consequences of fixing it.
We'd also need to do so if we ever upgrade h1 to h2 internally, as most h1 connection headers are illegal in h2.
Upgrading to HTTP/2 isn't so much about removing hop-by-hop header fields as it is about replacing them with HTTP/2's alternative mechanisms.
I don't think that we can be transparent to HTTP proxies unless CONNECT is used; and I think that, in order to use HTTP proxies, users will have to explicitly provide configuration for this in the controller (i.e. routing).
If the proxy gets an HTTP request, it really _doesn't_ have any reliable way to know whether the request is destined for a proxy or for the end-destination. and so I think we have to strip connection headers, for the sake of not leaking credentials (we definitely have to strip the Connection header -- but i'll argue we need to strip others as well).
Given a configuration like:
http_proxy=http://outbound-squid-proxy.corp.example.com curl http://conduit.io/docs
We'll get a request like:
GET http://conduit.io/docs HTTP/1.1
Host: conduit.io
How are we to know that this shouldn't go directly to the external destination?
I think conduit will need to be configured explicitly to route through other intermediaries.