Httpx: Replace chardet with plain UTF-8 detection

Created on 9 Jun 2020  Â·  14Comments  Â·  Source: encode/httpx

I'd suggest replacing chardet with dead-simple decoding logic:

encoding = "UTF-8"
try:
    text = databytes.decode(encoding)
except UnicodeDecodeError:
    encoding = "ISO-8859-1"
    text = databytes.decode(encoding)

It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such.

Otherwise ISO-8859-1 (as specified in HTTP RFC) should be used instead of trying to detect languages, which is always quite unreliable. Further, ISO-8859-1 maps all possible byte values directly to Unicode code points 0-255, so the final decoding step in the above code will never raise UnicodeDecodeError.

Since the chardet module has been around a long time and is known as the standard go-to solution for charset detection, many modules depend on it, and this is a frequent source of dependency conflicts such as the one reported here: https://github.com/huge-success/sanic/issues/1858

A more serious problem is that chardet is notoriously unreliable. Especially its tendency to misdetect anything as Turkish is well known:

>>> chardet.decode("Testiä".encode("UTF-8"))   # "Testing" in Finnish
{'encoding': 'ISO-8859-9',
 'confidence': 0.5521177026603239,
 'language': 'Turkish'}

>>> "Testiä".encode("UTF-8").decode("ISO-8859-9")  # Yields gibberish, not Turkish
'Testiä'
enhancement

Most helpful comment

@alexchamberlain Possibly, but it'd need some careful thinking about around how to make Response API nicely coherent in that area. (Eg. see https://github.com/encode/httpx/issues/1091)

I probably wouldn't want us to accept any API changes on it at the moment.

All 14 comments

Hiya. Okay, this is well pitched, tho it's quite hard to get a handle onto if it'd be a net positive or net negative for us to switch away from.

We use chardet in cases where the charset hasn't been included in the Content-Type, because one of the initial design goals for httpx was requests compatiblility and that's historically been their behaviour. However I'm open to reassessing this. I expect the landscape has kinda change with the prevalence of UTF-8 having dramatically risen in that time.

  • It'd be interesting to collate some issues from requests to see what their prior experiences are here.
  • It'd be interesting to get an insight into browser behaviours wrt. fuzzily determining content types where it's unset in the headers.
  • We might well do a better job by using some fuzzy pattern matching for HTML meta tags as a fallback when headers are unset, with utf-8, and then iso-8859-1 as absolute fallbacks.

I guess the upshot of this is likely to be...

  • Start by allowing the "apparent encoding" mechanism to be configurable.
  • Switch chardet out from being the default, and instead fallback on 1. fuzzy pattern match for meta tags 2. utf-8 3. iso-8859-1 in that order.

Hrm, so it looks to me like chardet is what Mozilla use if both Content-Type and Meta tags fail... https://www-archive.mozilla.org/projects/intl/chardet.html

As such I think we ought to be a bit careful about the assumption in "It is fairly safe to assume that any valid UTF-8 sequence is in fact UTF-8, and should be decoded as such." - it's not necessarily a given that'd match up with actual real-world behaviours, or with current browser behaviours.

One thing that'd be helpful would be some examples of URLs which are resolving with undesired content-types. Having a dig into real world cases would help given a better overall orientation of cases where HTML pages are omitting encoding information.

Also useful context https://github.com/aio-libs/aiohttp/issues/1811 (aiohttp diverges from requests here)

Benchmark of the OP code against chardet detect, using 100 kB of ASCII and 100 kB of ISO-8859-1 (only 'ä' repeated):

>>> %timeit auto_decode(ascii_bytes)
7.66 µs ± 84.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit auto_decode(latin1_bytes)
12.1 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit chardet.detect(ascii_bytes)
1.32 ms ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit chardet.detect(latin1_bytes)
2.89 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is a huge difference even with ASCII, and non-ASCII characters totally devastate chardet performance (tried with UTF-8 too, with similar timing results as latin1 on both implementations).

It might be tempting to try and optimise by only using the initial few kilobytes for detection but doing so is likely to catch only HTML markup, missing the actual native characters that would appear somewhere in the middle of the document. Fortunately Python's built-in decoding code is extremely fast and breaks on first error, so any such optimisations are not needed, especially when the decoding result is also actually used.

Okay, so given that the landscape has moved on since requests opted with chardet for determining unknown character encodings, and that UTF-8 is so much more prevalent now, I'm feeling pretty open to us considering a change in behaviour here.

There's two practical things we'd need to consider to make that change...

  1. A pull request updating TextDecoder to instead use UTF-8 with ISO-8859-1 fallback. It needs to support iterative decoding, so probably using IncrementalDecoder, and initially waiting until at least (say) 4096 bytes have been buffered before the initial decode pass?
  2. We might(?) want to consider an API to allow customizing the text decoder, to allow for alternate users to switch to alternate behaviours such as ChardetDecoder or an HTMLMetaTagDecoder?

Another note on this issue: https://github.com/psf/requests/issues/4848 mentions that chardet is problematic when bundling it with an application in productive deployments. I also like the suggestion to make it pluggable with https://github.com/Ousret/charset_normalizer (MIT license) or https://github.com/PyYoshi/cChardet (which is MPL, still easier to comply than LGPL). It'd be great if you could consider making the dependency optional :)

Buffering causes issues for applications that expect partial messages to come through without delay (long polling, logging and other such hacks). Thus, maybe it would be best to just always use ISO-8859-1 when nothing else is explicitly specified?

Section 3.7.1 of https://tools.ietf.org/html/rfc2616 says:

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.

I wonder where the autodetection is actually used because I just tried serving a UTF-8 document as text/html with no charset or meta tags, and Chrome, Firefox, httpx and requests all decoded it as 8859-1 or similar, not as UTF-8, although chardet detected the content as UTF-8.

The autodetection is only used for media types that are not text/*, and that don't have a charset set... https://github.com/encode/httpx/blob/master/httpx/_models.py#L777

And yes, I'm more inclined to diverge from requests behaviour here, since it's anyway applied to only a pretty narrow subset of cases, and the landscape wrt. utf-8 has moved on since their initial design decision here.

I found this issue as I was reporting an issue with response.text taking 10 seconds, which I tracked down to chardet.detect. For what it's worth, in my case, I'm parsing a sitemap, which is XML, so the algorithm for guessing the encoding should be something like: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch12s11.html.

I've implemented this outside of httpx via response.content. Do you see supporting the XML encoding guessing or a custom guessing function fitting in here?

Not really, no, although if we could provide enough good API for allowing it as a third party option then that'd be a good route.

There is a setter for encoding exposed on Response, which I think is sufficient now that I've thought about it a little more.

Would you be open to being able to set the encoding in __init__? I'd use it when reconstructing Response from an on-disk cache.

@alexchamberlain Possibly, but it'd need some careful thinking about around how to make Response API nicely coherent in that area. (Eg. see https://github.com/encode/httpx/issues/1091)

I probably wouldn't want us to accept any API changes on it at the moment.

Was this page helpful?
0 / 5 - 0 ratings