Kong: bug: upstream host_header is not respected when preserving host

Created on 18 Feb 2020  ·  17Comments  ·  Source: Kong/kong

Summary

When preserve_host in the Route entity is set to true, host_header in Upstream entity is not respected.

Steps To Reproduce

Configure Kong with:

_format_version: "1.1"
services:
- name: svc1
  host: my-backend.org
  routes:
  - name: r1
    hosts:
    - bar.com
    strip_path: true
    preserve_host: true
    paths:
    - /r1
upstreams:
- name: my-backend.org
  host_header: foo.com
  algorithm: round-robin
  targets:
  - target: "66.39.74.7:80"

And then fire:

http :8000/r1/echo host:bar.com                                                                                                                                                                             <<<
HTTP/1.1 200 OK
Accept-Ranges: none
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 483
Content-Type: text/plain;charset=UTF-8
Date: Tue, 18 Feb 2020 21:49:11 GMT
Server: Apache
Vary: Accept-Encoding,User-Agent
Via: kong/2.0.0
X-Kong-Proxy-Latency: 1
X-Kong-Upstream-Latency: 164

Simple webservice echo test: make a request to this endpoint to return the HTTP request parameters and headers. Results available in plain text, JSON, or XML formats. See http://www.cantoni.org/2012/01/08/simple-webservice-echo-test for more details, or https://github.com/bcantoni/echotest for source code.

Array
(
    [method] => GET
    [headers] => Array
        (
            [Accept] => */*
            [Accept-Encoding] => gzip, deflate
            [User-Agent] => HTTPie/0.9.8
            [X-Real-Ip] => 127.0.0.1
            [X-Forwarded-Port] => 8000
            [X-Forwarded-Host] => bar.com
            [X-Forwarded-Proto] => http
            [X-Forwarded-For] => 127.0.0.1
            [Connection] => close
            [Host] => bar.com
        )

    [request] => Array
        (
        )

    [client_ip] => 127.0.0.1
    [time_utc] => 2020-02-18T21:49:11+0000
    [info] => Echo service from Scooterlabs (http://www.scooterlabs.com)
)

As seen in the response from upstream, the upstream sees the host as bar.com and not foo.com.

Additional Details & Logs

  • Kong version ($ kong version) 2.0 (won't work in previous versions either)
  • Kong debug-level startup logs ($ kong start --vv)
  • Kong error logs (<KONG_PREFIX>/logs/error.log)
tasbug

All 17 comments

There can be an argument that route's properties should take precedence.

The reason the expected behavior here makes more sense is because upstream and service are more "closer" to the concept of service than a route.
There can be multiple routes in Kong forwarding traffic to the same upstream service and it is the Service owner who gets to decide how it wants to receive traffic from Kong.

This is one of those things where route entity being used not just for matching but also for manipulating requests creates a problem.

Hey, @locao , you added host_header in #4959. Do you have an opinion on this one?

This is one of those things where route entity being used not just for matching but also for manipulating requests creates a problem.

Well, isn't it just preventing the manipulation?

Isn't it more flexible to have it like it is? preserve_host=false by default, and if upstream.host_header overrides that, then the preserve_host=true does nothing. With current behavior you can create route without preserve host and it then uses upstreams host_header. But if you really want to keep original request's host header, you can preserve that.

Service owner can decide that all the routes created to my service needs to have preserve_host=false?, no?

Service owner can decide that all the routes created to my service needs to have preserve_host=false?, no?

They certainly could. An upstream is "closer" and is more concrete when you think about an "Upstream service" in Kong and hence I feel the current behavior is incorrect.

In my point of view, if user defined the Host header that must be used for that route, that should be respected. An upstream having a host defined is a normal/common configuration, a route overwriting a host header is a specific behavior, so I would guess that when users set the host header for that route, they really want that route to have distinct behavior.

On the other hand, I also agree with:

This is one of those things where route entity being used not just for matching but also for manipulating requests creates a problem.

I don't like ambiguity, I feel like the best approach to this would be removing the preserve_host option at all.

Yes, host_header got added some 4 months ago, while preserve_host has been there from a day one, and yes they sure feel a bit conflicting.

I understand "the upstream is closer", but people have proposed us to add all the service fields to route entity so that they could adjust some service params by route. And with this it only makes sense to move this closer to client. But we have refused to do so. The preserve host is indeed weird exception.

in general, in the line of "route", "service", "upstream", it goes from specific to generic. SO from that perspective it would make sense that a route-property would have a higher precedence than an upstream-property. But that is in general.

In this case, I think it would help to create a matrix of the options that define the host-header (outgoing), based on the input parameters;

  • incoming Host-header
  • route.preserve_host
  • service.host
  • upstream.host_header
  • target

So imo this is what should happen (I added the healthcheck host, since in that case there will be no incoming Host-header):

incoming Host-header |route.preserve_host|service.host|upstream.host_header|target | upstream Host-header | (health-check host)
---------------------|---------------------|--------------|----------------------|--------------|----------------------|--------------------
foo.com | true | bar.com | n.a. | n.a. | foo.com | n.a.
foo.com | true | [upstream] | myhost.com | internal.com | foo.com | myhost.com
foo.com | true | [upstream] | nil | internal.com | foo.com | internal.com
foo.com | false | bar.com | n.a. | n.a. | bar.com | n.a.
foo.com | false | [upstream] | myhost.com | internal.com | myhost.com | myhost.com
foo.com | false | [upstream] | nil | internal.com | internal.com | internal.com

The whole argument around specific to generic is flawed as far as I can see.

We evolved from the api entity in our model to routes and services but didn't segregate the upstream and downstreams concerns enough. Route matching an request manipulations are different concerns.

While preserve_host and strip_path exist and they probably need to exist for a long time for backwards compatibility, we should not be afraid of inverting the behavior. API gateway should transparently take care of handling the conversion of requests from a foo.external.com to an application that's running internally in the network at bar.internal.com.

On another point, I might be missing something here but having a different upstream host-header and a different health-check host header is extremely confusing without any usability benefit at all. It would introduce more confusion from an observability standpoint.

There can and will be weird corner cases here, but at that point, we have to tell the user that you are doing something wrong and should update your application (or at that point, use a plugin if you really want to) to use a specific host header.

API gateway should transparently take care of handling the conversion of requests from a foo.external.com to an application that's running internally in the network at bar.internal.com.

But it does, doesn't it? set preserve_host to false, and it does exactly that?

On another point, I might be missing something here but having a different upstream host-header and a different health-check host header is extremely confusing without any usability benefit at all. It would introduce more confusion from an observability standpoint.

No way to get around this. Health-probes are not based on requests, so the host-header info is simply not available when the timer runs. If you match on "path" for example, you don't even know all possible incoming hostnames, so impossible to make this work.

Route matching an request manipulations are different concerns.

This is a fair comment. Though not sure whether the required changes would be justified. What specific cases do you have that you currently cannot deal with?

No way to get around this. Health-probes are not based on requests, so the host-header info is simply not available when the timer runs. If you match on "path" for example, you don't even know all possible incoming hostnames, so impossible to make this work.

This is exactly why I'm recommending that the upstream should respond on a single virtual host to all requests, no matter it is a proxy request or a health-check request.
If we don't do that differentiation, then the upstream is available at a specific virtual host and kong can transparently handle other host names. The app doesn't even need to be aware of other host names.

Now, for some weird reason, you do want that behavior, then unset the host_header in your upstream and use preserve_host behavior.

I don't think it is weird for single upstream to answer:
customer-a.app.com AND
customer-b.app.com

Aka *.app.com. But yes, the whole strip_path and preserve_host are things I don't like that much, but they have been with the product for very long, while upstream.host_header is a new one.

in general, in the line of "route", "service", "upstream", it goes from specific to generic. SO from that perspective it would make sense that a route-property would have a higher precedence than an upstream-property. But that is in general.

In this case, I think it would help to create a matrix of the options that define the host-header (outgoing), based on the input parameters;

  • incoming Host-header
  • route.preserve_host
  • service.host
  • upstream.host_header
  • target

So imo this is what should happen (I added the healthcheck host, since in that case there will be no incoming Host-header):

incoming Host-header route.preserve_host service.host upstream.host_header target upstream Host-header (health-check host)
foo.com true bar.com n.a. n.a. foo.com n.a.
foo.com true [upstream] myhost.com internal.com foo.com myhost.com
foo.com true [upstream] nil internal.com foo.com internal.com
foo.com false bar.com n.a. n.a. bar.com n.a.
foo.com false [upstream] myhost.com internal.com myhost.com myhost.com
foo.com false [upstream] nil internal.com internal.com internal.com

I am a little confused about your matrix which first case's service.host is bar.com and the second one's is [upstream] . I guess the meaning of [upstream] is that service.host property is equal to upstream.name property , but what does you mean in the first case ?
I thought service.host is always equal to upstream.name , is not that ?
Otherwise ,what the value of upstream.name should be when service.host is bar.com ,and how does the service matchs with its upstream? Looking forward to your reply,thanks.

@hnotyet the [upstream] could be any internal upstream-entity-name. But since it is an internal name, it should never be used as a Host header going upstream. That's why I put it in there as [upstream] to indicate the name itself is irrelevant in this discussion

Great points raised here all around, and if the behavior was the opposite the discussion could easily go the other way around ("bug: preserve_host is not respected when setting host_header").

The points of agreement are that nobody seems to be a big fan of preserve_host or mixing route matching and request manipulation in the same entity, but here's where we are.

As of Kong 2.x we must maintain the current behavior.

For Kong 3.x, if both upstream.host_header and route.preserve_host continue to exist, I also think we should keep the current behavior because it is the more flexible one. Rationale (uppercase means Kong entity): given that healthchecks are implemented in Kong an Upstream-level concern (i.e. they _don't_ mean "is this Service alive" but rather "is this Upstream alive") and one Upstream can serve different Services matched by different hostnames, it does make sense that different Services may want to use different hostnames and the Upstream may end up using a different one for its healthchecks. (If the counterargument for this is "people should not be doing this", then this hits the more fundamental question of how opinionated should Kong be, and generally so far the answer has been "not much".)

I think @Tieske's table would be a useful addition to the docs (perhaps changing [upstream] to something like internal-upstream to avoid confusion as seen in @hnotyet 's question) and the Admin API docs for upstream.host_header could be changed from

The hostname to be used as Host header when proxying requests through Kong.

to

The hostname to be used as Host header when performing active healthchecks and proxying requests through Kong, unless overridden during proxying using route.preserve_host.

This opens another question, though: what is the current behavior for passive healthchecks? Where does it get the host information from, and, does it matter? From a quick look at the sources, I couldn't grasp it.

I am a user,i think:

  1. [upstream.host_header] for active health check only.
  2. passive health check by incoming host header.
  3. [service.host] valid, when the upstream entity is null and [route.preserve_host] set false .
  4. respects the user's choice, when the user set [route.preserve_host=true] , other settings are invalid.
  5. users can add or replace host headers using the request-transformer plugin, we often use
  6. [route.strip_path] very important and useful function.

separate responsibilities for each object,please think about it.

@hishamhm @bungle @hbagdi @Tieske @locao @hnotyet

This opens another question, though: what is the current behavior for passive healthchecks? Where does it get the host information from, and, does it matter? From a quick look at the sources, I couldn't grasp it.

This question is irrelevant, there is no 'host' with passive healthchecks, because there is no probe-request. Stated otherwise, it will inspect the request and hence will always have the Host from that request, that is by definition 1-on-1.
This problem only exists for active healthchecks, because they run without having an actual request, hence the inputs in the first 3 columns of the table are unavailable when constructing the probe request.

ALSO: the table is what would make sense to me, I do not think it matches the current behaviour though. If we consider it a bug we can change it, if not we should not change it. (router path behaviour versioning anyone?)

Was this page helpful?
0 / 5 - 0 ratings