Httpx: Cannot request resources with paths that include a mix of path separator slashes and URL-encoded slashes

Created on 1 Dec 2020  Â·  13Comments  Â·  Source: encode/httpx

With httpx 0.15.0 and newer, we are getting some unexpected behaviour with URLs that include url-encoded slashes.

For example, a call to the URL "http://a.b/c/a%2Fb" ends up requesting the path "/c/a/b" from the server, instead of "/c/a%2Fb", which is what we would expect to happen.

Using recent httpx versions, we don't see any way to request resources including a mix of path separator slashes and URL-encoded slashes.

bug user-experience

Most helpful comment

My interpretation of https://tools.ietf.org/html/rfc3986, Section 2.1, is that "%2F" is a valid way of including a non-delimiting slash in a URL.

All 13 comments

My interpretation of https://tools.ietf.org/html/rfc3986, Section 2.1, is that "%2F" is a valid way of including a non-delimiting slash in a URL.

I get the impression that the problem is here: https://github.com/encode/httpx/blob/master/httpx/_models.py#L237

unquote(a/b) = a/b
unquote(a%2Fb) = a/b

and as such there is no way to distinguish the two, after unquoting.

Hi @jbaayen

Can you share a snippet of code you're trying to run?

In the models.py file you linked to, you can see that there's also a raw_path property, and that's what gets used in .raw, which is used when calling the HTTP transport.

If you're just passing in a string URL then it's possible we do some pre-processing based on .path that we shouldn't be doing (eg applying base URLs).

Hi @florimondmanca , we use the AsyncClient get and post methods, passing in the URL as a string.

A snippet would be this: await client.get("http://localhost/a%2Fc")

@jbaayen The issue you mention wouldn't occur in that snippet, but does occur if the client is using a base_url. This is because the url is corrupted when accessing merge_url.path: https://github.com/encode/httpx/blob/master/httpx/_client.py#L330

In my opinion, URL.path should not be decoding slashes, and I wonder why it is doing any decoding at all.

Thanks all, useful pointers. Here's a sample reproduction case:

"Echo path" server:

# Deps: pip install uvicorn
# Run: uvicorn app:app

async def app(scope, receive, send):
    assert scope["type"] == "http"

    body = b'{"raw_path": "%s"}' % scope["raw_path"]

    await send(
        {
            "type": "http.response.start",
            "status": 200,
            "headers": [[b"content-type", b"text/plain"]],
        }
    )
    await send({"type": "http.response.body", "body": body})

Test script:

import httpx


with httpx.Client() as client:
    r = client.get("http://localhost:8000/a%2Fb")
    print("default:", r.json())


with httpx.Client(base_url="http://localhost:8000") as client:
    r = client.get("/a%2Fb")
    print("base_url:", r.json())

Output:

$ python example.py
default: {'raw_path': '/a%2Fb'}
base_url: {'raw_path': '/a/b'}

So indeed base_url seems to corrupt any URL-encoded slash characters, instead of passing them through.

1406 is an interesting take, but for all intents and purposes it might be too broad of a fix if the issue is really just that base_url does something off.

I _guessed_ we had some discussion about whether .path should URL-decode, or just return a string-decoded equivalent of .raw_path. I guess the rationale for URL-decoding .path would be to print a "human-friendly" version of the URL by default, while allowing to pass machine-ready paths using .raw_path. So anywhere we deal with paths internally, we should do so via .raw_path. But then there's probably a case to be made wrt user expectations: would users know that there are two different properties, one doing URL-decoding and the other not doing URL-decoding?

We can treat that as a separate discussion though, since the particular problem in this issue seems to be fixable independently of what we decide .path should do…

So there's a useful prompt here about tweaking the behaviour of Client(base_url=...) which @florimondmanca has followed up nicely in #1407. However I think that's unrelated to the behaviour that @jbaayen is seeing.

As noted we don't escape the path when making the request. Version 0.15.0 onwards is noted, and the place where we access the URL for sending the request is...

https://github.com/encode/httpx/blob/0.15.0/httpx/_models.py#L260-L272

Which uses raw_path, which is the raw byte encoded path+querystring without and modification or decoding.

@jbaayen - More likely what you're seeing is the server automatically decoding the escaped '/' characters. For example Python WSGI and ASGI servers will do this, as it is spec'ed and expected behaviour. See eg. Gunicorn... https://github.com/benoitc/gunicorn/blob/548d5828da6b93fa6a14217742c6e6d2c7b2b900/gunicorn/http/wsgi.py#L184

Hi @tomchristie, it certainly is _not_ the server automatically decoding the escaped slashes. With the very same server, we get the expected behaviour with requests, aiohttp, and httpx < 0.15.0. It's only httpx >= 0.15.0 that ends up making incorrect requests ...

So, help us help you...

I can verify with the following that httpx isn't escaping the path component...

$ pip install httpx==0.15.0
$ venv/bin/python -c "import httpx; print(httpx.get('http://www.httpbin.org/a%2Fb'))"

(Using fiddler to inspect the request.)

That's not to say that there might be some difference in behaviour that you're seeing, but you'll need to describe how to replicate the issue you're seeing.

With

12:35 $ pip show httpx
Name: httpx
Version: 0.15.0
Summary: The next generation HTTP client.
Home-page: https://github.com/encode/httpx
Author: Tom Christie
Author-email: [email protected]
License: BSD
Location: /usr/local/lib/python3.8/site-packages
Requires: certifi, httpcore, sniffio, rfc3986
Required-by: kisters.water.time-series, kisters.water.time-series.tsa, kisters.water.operational.access-control
✔ ~

and the snippet

client = AsyncClient(base_url="http://www.httpbin.org/")
print(await client.get('a%2Fb'))

I see the following in Wireshark:

image

Right, in this snippet you are using base_url. That was definitely decoding the URL unnecessarily and has been fixed in https://github.com/encode/httpx/pull/1407

Okay, yup. I've verified that...

import httpx

# Reproduces issue
client = httpx.Client(base_url="http://www.httpbin.org/")
client.get("a%2Fb")

# Does not reproduce issue
client = httpx.Client()
client.get("http://www.httpbin.org/a%2Fb")
Was this page helpful?
0 / 5 - 0 ratings

Related issues

coltoneakins picture coltoneakins  Â·  3Comments

kde713 picture kde713  Â·  3Comments

njsmith picture njsmith  Â·  3Comments

tomchristie picture tomchristie  Â·  3Comments

szelenka picture szelenka  Â·  4Comments