Applications, like flask, which expect that the headers are using latin-1 encoding as specified in [rfc5987] will break gunicorn because gunicorn uses ASCII encoding.
Issue #1353 was closed with the comment
I'm going to close this as it's well documented around the web that HTTP headers should be ASCII, unfortunately.
which is, according to [rfc5987] wrong. Headers should be encoded ISO-8859-1 (aka latin-1). However, a comment in the issue also mentioned commit 5f4ebd2eb2b08783a5fbefe79d09fcb3fc1fbc73, in which the encoding of the headers was switched back from latin-1 to ascii based on a section in [rfc7230] which states that US-ASCII should be used:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. [...]
Now, I'm not into all this RFC specifications and don't know which is the right way to do it. However, as other applications expect latin-1 as specified in [rfc5987] but gunicorn implements the version given in [rfc7230], it might be worth to discuss this, even if it's just for a clarification on which track gunicorn is.
Related Issues in gunicorn:
https://github.com/benoitc/gunicorn/issues/1353 : wsgi.py > send_headers: encoding problem.
Related Issues in Flask:
https://github.com/pallets/flask/issues/2766 : send_file: latin-1 encoding not compatible with gunicorn
https://github.com/pallets/flask/pull/2223 : Fix send_file's attachment_filename to work with non-ascii filenames
Related RFC:
https://tools.ietf.org/html/rfc5987#section-1
gunicorn implements the last specification (and it used to work with previous versions of flask).
If we want to relax it for old applications maybe we may want yo try to encode to latin1 if encoding to ascii didn't work. Thoughts ? cc @tilgovi
to be more complete, gunicorn use the updated HTTP 1.1 spec (7230 and related). That could be indeed documented.
From my position, the spec says "SHOULD" and PEP 333 mentions latin-1 so I think Gunicorn could support latin-1 even if applications are advised not to use non-ASCII characters.
That's my initial reading, but I'll wait for others to weigh in.
I think Gunicorn could support latin-1
By this I mean only that Gunicorn would not be breaking any spec. Right now Gunicorn is more strict than it needs to be. It is not necessarily wrong to be so strict. We can decide together.
RFC 7320 says that new headers "SHOULD" use ASCII. That most headers use ASCII right now seems to be just information about the current situation. But I think this sentence is more important:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding.
So, latin-1 is clearly allowed.
If RFC7320 says that new headers "SHOULD" use ASCII, the question is, who defines the headers? Is it the application using gunicorn or is it gunicorn itself. As I understand, the application defines the headers.
What annoyed me is that change has been done 3 years ago without much people beeing blocked by it on the different frameworks until that recent change in flask. I would have thought that in 2018 all applications would follow a specification published in 2014. Part of the issue is that PEP3333 allowing ISO-8859-1 aka latin 1.
Anyway I don't disagree to be more relaxed on it. I propose to do the following :
Thoughts ?
@benoitc sorry for causing trouble! It's hard to keep track of all the different specs. :sweat:
If you show that warning, I predict a regular stream of issues being open on Flask about it though. I think it makes more sense to just allow Latin-1 silently, as it's part of the PEP.
@davidism well that's kind of the point: bugging the user until it makes the correct change ;) I think that would be better but I have no strong opinion on it. @tilgovi @berkerpeksag ?
@davidism no worries :) Thanks for bringing it up.
I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.
I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.
+1
Until when the status quo would be kept though... yes users will open some bugs which prompted our change. I don’t see why it couldn’t be also fixed in other libraries and the pep updated.
At least we should document it somewhere so users using a strict client won’t be surprised as well. Also if we do the change it will have to be in a major version since it’s a breaking change.
I'm happy to make this change. Do we need a major version? It makes requirements looser for applications. Existing applications should continue to work. Applications which did not work before, but which violate the "SHOULD" will now work.
In any case, I can document it somehow.
Maybe 19.9? We can keep making 19.9.x or 19.x.y releases for Python 2 maintenance critical bug fixes for a bit, but still keep 20 for Python 3 only?
I've been thinking about this more, and I'm ok switching Flask and Werkzeug to use ASCII here.
I still think it's probably good if Gunicorn supports it, since it's part of the PEP. I'm assuming you've never had an issue reported about this before now so it seems like it's probably not well known anyway. If you're worried about breaking changes, I can handle it on my side only.
I do plan on implementing ASGI in Werkzeug, which leaves the header encoding to the application, so that would be a good time for me to fix inconsistent use of ASCII vs Latin-1 on my side.
I support the idea of @davidism: Flask should follow the recommendation and use ASCII, but gunicorn should also follow PEP3333 and allow Latin-1, even if other specifications do not recommend to do so.
I guess I should have done this sooner, but I just tried sending a unicode filename with Flask and Gunicorn and it worked fine: https://github.com/pallets/flask/issues/2766#issuecomment-391736114. Now I'm not clear where the issue is.
@tilgovi 19.9 would be ok. also #1151 that was closed by the latest change will need to be addressed.
Related to the last comment of @davidism , @jrast what is the filename that was failing? I think we should have a test to reproduce the issue before making any change.
I can't remeber exactly and I don't have a logfile at hand at the moment. But the problem arised with filenames containing german umlauts, like äöü. For example a filename like "Sterntour-Glärnischhütte.xlsx" would cause problems, caused by the "ä" and "ü" in the filename.
I can search for the logfile if you need more details.
Thanks, that filename caused the issue. The name used in the test in Flask triggered the filename*
behavior, which ended up as ASCII. Your example encoded to Latin-1, so there was no extra encoding done and it was passed on as-is.
Just for the reference: @davidism fixed the encoding in flask with pull request https://github.com/pallets/flask/pull/2804, which will be released in flask 1.0.3. This will make flask compatible with the current behaviour of gunicorn.
At this occasion, i want to thank @benoitc , @tilgovi and @davidism (and all other involved devs!) for their work on gunicorn and flask. I am positively surprised about the discussion which unfold in this issue! The issue was addressed very quick, the discussion was based on facts and a good solution for both packages was found very quickly! I'm allways surprised how quick issues are addressed in the open source world and that, most of the time, good solutions can be found for all involved parties. I think a fix would have taken much longer if both packages would have been closed source libraries, owned by some big coorperations. Thank you very much and keep up the great work!
flask/werkzeug will encode than decode cookie value to Unicode with latin1:
https://github.com/pallets/werkzeug/blob/master/werkzeug/http.py#L1107-L1108
If a unicode string is returned it’s tunneled through latin1 as required by PEP 3333.
@georgexsh so what's your expectation?
@benoitc to point out flask/werkzeug is still had incompatible behavior with gunicorn. I will be happier if https://github.com/benoitc/gunicorn/issues/1778#issuecomment-387686906 get implemented.
btw, your reply is very agile, I admire your diligence.
Given that it's never had a problem before between that code and Gunicorn, I'm not too concerned, although I may switch more things to ASCII eventually. Do you have a real failing example?
from werkzeug.wrappers import Request, Response
class MyResp(Response):
charset = 'utf-8-sig'
@Request.application
def application(request):
resp = MyResp()
resp.set_cookie('foo', 'bar')
return resp
run it with gunicorn yields an error:
Traceback (most recent call last):
File "gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "gunicorn/workers/sync.py", line 183, in handle_request
resp.close()
File "gunicorn/http/wsgi.py", line 409, in close
self.send_headers()
File "gunicorn/http/wsgi.py", line 329, in send_headers
util.write(self.sock, util.to_bytestring(header_str, "ascii"))
File "gunicorn/util.py", line 507, in to_bytestring
return value.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 155-157: ordinal not in range(128)
the error is from this header:
Set-Cookie: foo="\357\273\277bar"; Path=%EF%BB%BF/?%EF%BB%BF#%EF%BB%BF
when werkzeug dump cookies with utf-8-sig
, BOM is added, it could be decoded with 8bit latin1 but can not reencode with as 7bit ASCII:
>>> ''.encode('utf-8-sig').decode('latin1')
''
>>> ''.encode('utf-8-sig').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
there is more detail in https://github.com/apache/incubator-superset/issues/5377
but I think this is a bug in the application (superset), it should not rely on flask/werkzeug to encode CSV string to bytes but encode by itself.
That's definitely an issue with that application.
Some notes on the behaviour differences between Werkzeug(+Flask) and Gunicorn(+Flask):
Application attempts to set header value|Werkzeug/0.14.1+Flask|Gunicorn 19.9/master
---------------------------------------- | ---------------------- | --------------------
b'f\xc3\xb6o'.decode("utf-8", "strict") = "föo"
|200 OK, header value is sent, contains bytes \x66\xf6\x6f = föo
|500, UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 164: ordinal not in range(128)
b'Swan: \xF0\x9F\xA6\xA2'.decode("utf-8", "strict") = "Swan: (swan glyph)"
|200 but empty response, errors out while encoding headers - UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f9a2' in position 19: ordinal not in range(256)
|500, UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f9a2' in position 169: ordinal not in range(128)
Note that when writing-out the response headers, Werkzeug delegates to Python builtin BaseHTTPRequestHandler
:
... which uses latin-1
:
That will explain the behaviour differences.
My reading of the considerable number and history of RFCs on this suggests that applications and servers should be restricting themselves to US-ASCII
, but that (perhaps for historical reasons) latin-1 is allowed, but clients should treat such octets as opaque.
Either way, ideally Gunicorn and Werkzeug should behave the same here if possible, and it seems that a latin-1 character should not error the response. Gunicorn has a superior way of serializing the header values, so if there is a non-latin-1 character present, it detects this before committing 200 OK (like happens in Werkzeug), and can send 500. But it appears to me that it should allow latin-1
.
Another possibly interesting check in corelibs:
header_re = re.compile(r'^[a-zA-Z][a-zA-Z0-9\-_]*$')
bad_header_value_re = re.compile(r'[\000-\037]')
So Python WSGI is very strict on characters allowed in the header name (alpha-num, hyphen, underscore), and bans non-printables from values.
Most helpful comment
@davidism no worries :) Thanks for bringing it up.
I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.