With httpx 0.15.0 and newer, we are getting some unexpected behaviour with URLs that include url-encoded slashes.
For example, a call to the URL "http://a.b/c/a%2Fb" ends up requesting the path "/c/a/b" from the server, instead of "/c/a%2Fb", which is what we would expect to happen.
Using recent httpx versions, we don't see any way to request resources including a mix of path separator slashes and URL-encoded slashes.
My interpretation of https://tools.ietf.org/html/rfc3986, Section 2.1, is that "%2F" is a valid way of including a non-delimiting slash in a URL.
I get the impression that the problem is here: https://github.com/encode/httpx/blob/master/httpx/_models.py#L237
unquote(a/b) = a/b
unquote(a%2Fb) = a/b
and as such there is no way to distinguish the two, after unquoting.
Hi @jbaayen
Can you share a snippet of code you're trying to run?
In the models.py file you linked to, you can see that there's also a raw_path property, and that's what gets used in .raw, which is used when calling the HTTP transport.
If you're just passing in a string URL then it's possible we do some pre-processing based on .path that we shouldn't be doing (eg applying base URLs).
Hi @florimondmanca , we use the AsyncClient get and post methods, passing in the URL as a string.
A snippet would be this: await client.get("http://localhost/a%2Fc")
@jbaayen The issue you mention wouldn't occur in that snippet, but does occur if the client is using a base_url. This is because the url is corrupted when accessing merge_url.path: https://github.com/encode/httpx/blob/master/httpx/_client.py#L330
In my opinion, URL.path should not be decoding slashes, and I wonder why it is doing any decoding at all.
Thanks all, useful pointers. Here's a sample reproduction case:
"Echo path" server:
# Deps: pip install uvicorn
# Run: uvicorn app:app
async def app(scope, receive, send):
assert scope["type"] == "http"
body = b'{"raw_path": "%s"}' % scope["raw_path"]
await send(
{
"type": "http.response.start",
"status": 200,
"headers": [[b"content-type", b"text/plain"]],
}
)
await send({"type": "http.response.body", "body": body})
Test script:
import httpx
with httpx.Client() as client:
r = client.get("http://localhost:8000/a%2Fb")
print("default:", r.json())
with httpx.Client(base_url="http://localhost:8000") as client:
r = client.get("/a%2Fb")
print("base_url:", r.json())
Output:
$ python example.py
default: {'raw_path': '/a%2Fb'}
base_url: {'raw_path': '/a/b'}
So indeed base_url seems to corrupt any URL-encoded slash characters, instead of passing them through.
base_url does something off.I _guessed_ we had some discussion about whether .path should URL-decode, or just return a string-decoded equivalent of .raw_path. I guess the rationale for URL-decoding .path would be to print a "human-friendly" version of the URL by default, while allowing to pass machine-ready paths using .raw_path. So anywhere we deal with paths internally, we should do so via .raw_path. But then there's probably a case to be made wrt user expectations: would users know that there are two different properties, one doing URL-decoding and the other not doing URL-decoding?
We can treat that as a separate discussion though, since the particular problem in this issue seems to be fixable independently of what we decide .path should do…
So there's a useful prompt here about tweaking the behaviour of Client(base_url=...) which @florimondmanca has followed up nicely in #1407. However I think that's unrelated to the behaviour that @jbaayen is seeing.
As noted we don't escape the path when making the request. Version 0.15.0 onwards is noted, and the place where we access the URL for sending the request is...
https://github.com/encode/httpx/blob/0.15.0/httpx/_models.py#L260-L272
Which uses raw_path, which is the raw byte encoded path+querystring without and modification or decoding.
@jbaayen - More likely what you're seeing is the server automatically decoding the escaped '/' characters. For example Python WSGI and ASGI servers will do this, as it is spec'ed and expected behaviour. See eg. Gunicorn... https://github.com/benoitc/gunicorn/blob/548d5828da6b93fa6a14217742c6e6d2c7b2b900/gunicorn/http/wsgi.py#L184
Hi @tomchristie, it certainly is _not_ the server automatically decoding the escaped slashes. With the very same server, we get the expected behaviour with requests, aiohttp, and httpx < 0.15.0. It's only httpx >= 0.15.0 that ends up making incorrect requests ...
So, help us help you...
I can verify with the following that httpx isn't escaping the path component...
$ pip install httpx==0.15.0
$ venv/bin/python -c "import httpx; print(httpx.get('http://www.httpbin.org/a%2Fb'))"
(Using fiddler to inspect the request.)
That's not to say that there might be some difference in behaviour that you're seeing, but you'll need to describe how to replicate the issue you're seeing.
With
12:35 $ pip show httpx
Name: httpx
Version: 0.15.0
Summary: The next generation HTTP client.
Home-page: https://github.com/encode/httpx
Author: Tom Christie
Author-email: [email protected]
License: BSD
Location: /usr/local/lib/python3.8/site-packages
Requires: certifi, httpcore, sniffio, rfc3986
Required-by: kisters.water.time-series, kisters.water.time-series.tsa, kisters.water.operational.access-control
✔ ~
and the snippet
client = AsyncClient(base_url="http://www.httpbin.org/")
print(await client.get('a%2Fb'))
I see the following in Wireshark:

Right, in this snippet you are using base_url. That was definitely decoding the URL unnecessarily and has been fixed in https://github.com/encode/httpx/pull/1407
Okay, yup. I've verified that...
import httpx
# Reproduces issue
client = httpx.Client(base_url="http://www.httpbin.org/")
client.get("a%2Fb")
# Does not reproduce issue
client = httpx.Client()
client.get("http://www.httpbin.org/a%2Fb")
Most helpful comment
My interpretation of https://tools.ietf.org/html/rfc3986, Section 2.1, is that "%2F" is a valid way of including a non-delimiting slash in a URL.