Sanic: Unicode character parsing

Created on 26 Sep 2018 · 32Comments · Source: sanic-org/sanic

Per #539 there may be a continued problem in parsing unicode characters due to a dependency.

Per @vltr: /中国 or even /jacaré would break on request.py, mostly because of httptools.parse_url that receives bytes and is unable to parse non-ascii chars. I'll need to dig deeper to see if this is a restraint in httptools itself or just in the Python bindings.

needs investigation

Source

sjsadowski

All 32 comments

Just passing my eyes through Node's http-parser source code (base of httptools), I've seen in their tests that some level of unicode characters should be possible to use. So, the next step would be to test httptools itself - in a next chapter :wink:

vltr on 26 Sep 2018

I remembered there was a PR addressing this https://github.com/huge-success/sanic/pull/1081

yunstanford on 27 Sep 2018

@huge-success/sanic-release-managers add this to 18.12LTS milestone?

ahopkins on 27 Sep 2018

👍1

We need to re-test, per #1081 this may have been solved as it was merged back in january.

sjsadowski on 27 Sep 2018

there is a unit test for verifying.

yunstanford on 28 Sep 2018

@yunstanford would this be a valid test?

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("utf-8"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

The result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xc3\xa9'

vltr on 28 Sep 2018

Test looks valid to me.

After dugging, it looks like that the issue still exists in the dependency. It actually seems to exist upstream from there in http-tools (see here, where they are still relying on RFC2616 for valid url characters - which are only latin-1). After digging, while RFC2616 was updated by RFC3986 and obsoleted by RFC7230, the latin-1 character sets (backreferenced to RFC3986) and percent encoding remain the only official valid characters for URIs.

So we have two problems, one of which I don't care about. i18n and non-latin-1 characters are still technically disallowed - don't care. W3C allows internationalization of domain names using non latin-1 characters, and so should we.

The second part is that our dependency chain DOES care. The http-tools parser will and does fail if it's trying to parse any portion of the schema that non-latin-1 encoded. And I don't think we can fix this unless we eliminate the dependency, and I'm sure that's going to be a problem. So I do care about this issue.

I think there is a workaround though, but I think it will be costly: if the unicode characters are intercepted before parsing and translated to their %characters, that solves the url parsing. But we also have to do that for route registration. I'd like to get some more eyes on this issue to make sure I'm not crazy. @ahopkins @r0fls @seemethere @yunstanford

sjsadowski on 28 Sep 2018

I'm also concerned because @vltr has a test that is clearly failing yet the test from #1081 passes

sjsadowski on 28 Sep 2018

@sjsadowski for this simple script I made, using latin1 provides the same error:

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("latin1"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

Result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xe9'

httptools version:

$ pip freeze | grep httptools
httptools==0.0.11

But, even if I create a simple server:

from sanic import Sanic
from sanic.response import text

app = Sanic(__name__)


@app.route("/test/<hello>")
async def myroute(request, hello):
    return text("hello, {}!".format(hello))


if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8000, workers=1)

It will raise (basically) the same error.

Server:

$ python test-sanic-unicode.py
[2018-09-28 10:28:54 -0300] [8411] [INFO] Goin' Fast @ http://0.0.0.0:8000
[2018-09-28 10:28:54 -0300] [8411] [INFO] Starting worker [8411]
[2018-09-28 10:29:05 -0300] [8411] [ERROR] Traceback (most recent call last):
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/server.py", line 216, in data_received
    self.parser.feed_data(data)
  File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserInvalidURLError: invalid URL

cURL:

$ curl -v http://127.0.0.1:8000/test/jacaré
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> GET /test/jacaré HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< Connection: close
< Content-Length: 18
< Content-Type: text/plain; charset=utf-8
< 
* Closing connection 0
Error: Bad Request

vltr on 28 Sep 2018

I think the encoding is supposed to be 'latin-1' but either way é is not valid in RFC2616.

here is a direct link to the character table for RFC2616 that is implemented in http-tools

sjsadowski on 28 Sep 2018

Oh. Sorry, my bad. I'll pay more attention to the RFC :grin:

Regarding the dash on latin-1, it works with or without:

>>> "é".encode("utf-8")
b'\xc3\xa9'

>>> "é".encode("utf8")
b'\xc3\xa9'

>>> "é".encode("latin1")
b'\xe9'

>>> "é".encode("latin-1")
b'\xe9'

vltr on 28 Sep 2018

Yeah it's just dumb. It's very outdated (1999 - almost 20 years) and even though it's been superseded twice, they haven't updated the "legal" characters for a URI.

sjsadowski on 28 Sep 2018

We need to see if any updated RFC brings this question forward, since 2616 is obsoleted and updated.

vltr on 28 Sep 2018

@sjsadowski wow, almost at the same time (the comments). I'll take a look into these other RFCs just for precaution.

vltr on 28 Sep 2018

@vltr you want 3986 which is an update and 7230 which obsoletes 2616

sjsadowski on 28 Sep 2018

👍1

Thanks, @sjsadowski, I'll read them carefully when I get some free time and will get back here with (or without) new conclusions :+1:

vltr on 28 Sep 2018

I have not had a full chance to review the issue.... but my first thought when I read the idea of catching non Latin characters and translating them would be the potential performance hit compared to the use case. I'll come back after the weekend with some more thoughts.

ahopkins on 28 Sep 2018

@ahopkins indeed, depending on how to address this issue. I would not want to address this inside Sanic because of the performance hit. If possible, I would like to address this on httptools because it seems to work on Node's http-parser (see here).

vltr on 28 Sep 2018

👍1

didn't get time to dig into this issue, will take a look when getting a chance.

If you guys've investigated and thought it's because of httptools, we can pull @1st1 in for discussion

yunstanford on 29 Sep 2018

yeah it's not httptools, it's the upstream parser it relies on http-tools. I don't think the httptools package can implement a fix.

I wonder if we want to consider a workaround that would enable unicode parsing with a flag.

sjsadowski on 1 Oct 2018

Everyone, I did some more digging into this and things got a little more complicated. From the RFC 3986, see the topic "2.4. When to Encode or Decode".

From what I could understand:

For every non US-ASCII character (let's pick a simple example that I used a lot here, é) that is used, it should be encoded to its "percent-encoded octet", which in this example would be %C3%A9;
By doing this, httptools would correctly parse everything and all is good.

What happens:

Some clients (like cURL) doesn't encode anything prior to the request, it just sends the request "as is", that would of course break inside Sanic for not being encoded properly first;
We can see that some other web frameworks can handle this kind of values, but the question is: are they doing it to "embrace the world" and make users happy, instead of properly handle the issue?

What can be done:

Nothing. Sanic is working strictly under the terms of the RFC, so if anything comes wrong, a bad request will be raised;
Half embrace the world [1]: try to encode these "strange chars" (using urllib.parse.quote) prior to send to httptools, adding some overhead;
Half embrace the world [2]: try do decode these "strange chars" (using urllib.parse.unquote) after what httptools provides us (if it not breaks); my personal opinion is that this should be considered (I can send /%C3%A9 and read /é); or
Full embrace the world: try to encode everything weird before and after, to get the whole world satisfied.

Thoughts?

vltr on 1 Oct 2018

I'd lean towards nothing. If a client is supposed to be % encoding non-ascii characters that are in the RFC2616 charmap, then bare unicode characters should not be parsed, and httptools/http-tools are doing the right thing.

So long as routes that have unicode characters in them are getting % encoded as well, we have done (what I think) are our jobs, as the routes will match the correct client encoding.

sjsadowski on 1 Oct 2018

@sjsadowski that would be my opinion too. We can _perhaps_ add a server option to unquote after httptools have returned correctly, but that's just an idea -- the developer can implement this as well. It may have some more trouble than expected because it needs to overwrite Request.__init__ and use the newer class in the application constructor if intended to be used with the Router, so perhaps this should be a matter of heavy documentation on the subject?

vltr on 1 Oct 2018

@vltr unless we get a chorus of 'no' I'm going to say we just need to document the hell out of it. A special section on unicode characters or something for handling unicode in Request.

sjsadowski on 1 Oct 2018

@sjsadowski I completely agree with you.

vltr on 1 Oct 2018

Perhaps we can even provide some pre-built examples of extending the Request class to deal with all these questions - even though I personally think that these kind of parsing should be done inside the server, but anyway ...

vltr on 1 Oct 2018

I'm actually closing this per #1424

sjsadowski on 8 Feb 2019

@sjsadowski I'm sorry, I really forgot to document some examples with the Request class. Please, leave it open and I'll create a PR for this in a couple of hours - I just have to revise what I have wrote in here already and etc. I'm really sorry.

vltr on 8 Feb 2019

@vltr no worries, just trying to do some house keeping!

sjsadowski on 8 Feb 2019

@sjsadowski I know :wink: And I'm glad you're doing it because I'm so full of stuff to do that this went straight through my todo list and I completely forgot about.

vltr on 8 Feb 2019

@sjsadowski I made a mess in my mind here - I thought this was related to httptools on the Request object, but it is in fact in the HttpProtocol. It's better to close this _or_ take the bumpy road: add optional (and configurable) hooks to the protocol, which might not even solve the _issue_ :confused:

vltr on 8 Feb 2019

I'm going to close it. Solving the issue itself is going to require significant work and should be tracked separately - if we choose to take it on - and I think this issue number has run its course.

sjsadowski on 8 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings