Per #539 there may be a continued problem in parsing unicode characters due to a dependency.
Per @vltr: /中国 or even /jacaré would break on request.py, mostly because of httptools.parse_url that receives bytes and is unable to parse non-ascii chars. I'll need to dig deeper to see if this is a restraint in httptools itself or just in the Python bindings.
Just passing my eyes through Node's http-parser source code (base of httptools), I've seen in their tests that some level of unicode characters should be possible to use. So, the next step would be to test httptools itself - in a next chapter :wink:
I remembered there was a PR addressing this https://github.com/huge-success/sanic/pull/1081
@huge-success/sanic-release-managers add this to 18.12LTS milestone?
We need to re-test, per #1081 this may have been solved as it was merged back in january.
there is a unit test for verifying.
@yunstanford would this be a valid test?
from sanic.request import Request
uri = "/jacaré"
Request(
url_bytes=uri.encode("utf-8"),
headers={},
version=None,
method='GET',
transport=None
)
The result:
$ python test-sanic-unicode.py
Traceback (most recent call last):
File "test-sanic-unicode.py", line 10, in <module>
transport=None
File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
self._parsed_url = parse_url(url_bytes)
File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xc3\xa9'
Test looks valid to me.
After dugging, it looks like that the issue still exists in the dependency. It actually seems to exist upstream from there in http-tools (see here, where they are still relying on RFC2616 for valid url characters - which are only latin-1). After digging, while RFC2616 was updated by RFC3986 and obsoleted by RFC7230, the latin-1 character sets (backreferenced to RFC3986) and percent encoding remain the only official valid characters for URIs.
So we have two problems, one of which I don't care about. i18n and non-latin-1 characters are still technically disallowed - don't care. W3C allows internationalization of domain names using non latin-1 characters, and so should we.
The second part is that our dependency chain DOES care. The http-tools parser will and does fail if it's trying to parse any portion of the schema that non-latin-1 encoded. And I don't think we can fix this unless we eliminate the dependency, and I'm sure that's going to be a problem. So I do care about this issue.
I think there is a workaround though, but I think it will be costly: if the unicode characters are intercepted before parsing and translated to their %characters, that solves the url parsing. But we also have to do that for route registration. I'd like to get some more eyes on this issue to make sure I'm not crazy. @ahopkins @r0fls @seemethere @yunstanford
I'm also concerned because @vltr has a test that is clearly failing yet the test from #1081 passes
@sjsadowski for this simple script I made, using latin1 provides the same error:
from sanic.request import Request
uri = "/jacaré"
Request(
url_bytes=uri.encode("latin1"),
headers={},
version=None,
method='GET',
transport=None
)
Result:
$ python test-sanic-unicode.py
Traceback (most recent call last):
File "test-sanic-unicode.py", line 10, in <module>
transport=None
File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
self._parsed_url = parse_url(url_bytes)
File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xe9'
httptools version:
$ pip freeze | grep httptools
httptools==0.0.11
But, even if I create a simple server:
from sanic import Sanic
from sanic.response import text
app = Sanic(__name__)
@app.route("/test/<hello>")
async def myroute(request, hello):
return text("hello, {}!".format(hello))
if __name__ == '__main__':
app.run(host="0.0.0.0", port=8000, workers=1)
It will raise (basically) the same error.
Server:
$ python test-sanic-unicode.py
[2018-09-28 10:28:54 -0300] [8411] [INFO] Goin' Fast @ http://0.0.0.0:8000
[2018-09-28 10:28:54 -0300] [8411] [INFO] Starting worker [8411]
[2018-09-28 10:29:05 -0300] [8411] [ERROR] Traceback (most recent call last):
File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/server.py", line 216, in data_received
self.parser.feed_data(data)
File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserInvalidURLError: invalid URL
cURL:
$ curl -v http://127.0.0.1:8000/test/jacaré
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> GET /test/jacaré HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 400 Bad Request
< Connection: close
< Content-Length: 18
< Content-Type: text/plain; charset=utf-8
<
* Closing connection 0
Error: Bad Request
I think the encoding is supposed to be 'latin-1' but either way é is not valid in RFC2616.
here is a direct link to the character table for RFC2616 that is implemented in http-tools
Oh. Sorry, my bad. I'll pay more attention to the RFC :grin:
Regarding the dash on latin-1, it works with or without:
>>> "é".encode("utf-8")
b'\xc3\xa9'
>>> "é".encode("utf8")
b'\xc3\xa9'
>>> "é".encode("latin1")
b'\xe9'
>>> "é".encode("latin-1")
b'\xe9'
Yeah it's just dumb. It's very outdated (1999 - almost 20 years) and even though it's been superseded twice, they haven't updated the "legal" characters for a URI.
We need to see if any updated RFC brings this question forward, since 2616 is obsoleted and updated.
@sjsadowski wow, almost at the same time (the comments). I'll take a look into these other RFCs just for precaution.
@vltr you want 3986 which is an update and 7230 which obsoletes 2616
Thanks, @sjsadowski, I'll read them carefully when I get some free time and will get back here with (or without) new conclusions :+1:
I have not had a full chance to review the issue.... but my first thought when I read the idea of catching non Latin characters and translating them would be the potential performance hit compared to the use case. I'll come back after the weekend with some more thoughts.
@ahopkins indeed, depending on how to address this issue. I would not want to address this inside Sanic because of the performance hit. If possible, I would like to address this on httptools because it seems to work on Node's http-parser (see here).
didn't get time to dig into this issue, will take a look when getting a chance.
If you guys've investigated and thought it's because of httptools, we can pull @1st1 in for discussion
yeah it's not httptools, it's the upstream parser it relies on http-tools. I don't think the httptools package can implement a fix.
I wonder if we want to consider a workaround that would enable unicode parsing with a flag.
Everyone, I did some more digging into this and things got a little more complicated. From the RFC 3986, see the topic "2.4. When to Encode or Decode".
From what I could understand:
é) that is used, it should be encoded to its "percent-encoded octet", which in this example would be %C3%A9;httptools would correctly parse everything and all is good.What happens:
What can be done:
urllib.parse.quote) prior to send to httptools, adding some overhead;urllib.parse.unquote) after what httptools provides us (if it not breaks); my personal opinion is that this should be considered (I can send /%C3%A9 and read /é); orThoughts?
I'd lean towards nothing. If a client is supposed to be % encoding non-ascii characters that are in the RFC2616 charmap, then bare unicode characters should not be parsed, and httptools/http-tools are doing the right thing.
So long as routes that have unicode characters in them are getting % encoded as well, we have done (what I think) are our jobs, as the routes will match the correct client encoding.
@sjsadowski that would be my opinion too. We can _perhaps_ add a server option to unquote after httptools have returned correctly, but that's just an idea -- the developer can implement this as well. It may have some more trouble than expected because it needs to overwrite Request.__init__ and use the newer class in the application constructor if intended to be used with the Router, so perhaps this should be a matter of heavy documentation on the subject?
@vltr unless we get a chorus of 'no' I'm going to say we just need to document the hell out of it. A special section on unicode characters or something for handling unicode in Request.
@sjsadowski I completely agree with you.
Perhaps we can even provide some pre-built examples of extending the Request class to deal with all these questions - even though I personally think that these kind of parsing should be done inside the server, but anyway ...
I'm actually closing this per #1424
@sjsadowski I'm sorry, I really forgot to document some examples with the Request class. Please, leave it open and I'll create a PR for this in a couple of hours - I just have to revise what I have wrote in here already and etc. I'm really sorry.
@vltr no worries, just trying to do some house keeping!
@sjsadowski I know :wink: And I'm glad you're doing it because I'm so full of stuff to do that this went straight through my todo list and I completely forgot about.
@sjsadowski I made a mess in my mind here - I thought this was related to httptools on the Request object, but it is in fact in the HttpProtocol. It's better to close this _or_ take the bumpy road: add optional (and configurable) hooks to the protocol, which might not even solve the _issue_ :confused:
I'm going to close it. Solving the issue itself is going to require significant work and should be tracked separately - if we choose to take it on - and I think this issue number has run its course.