Requests: Charset detection in text() can be pathologically slow

Created on 27 Nov 2014 · 10Comments · Source: psf/requests

When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library"""
        return chardet.detect(self.content)['encoding']

Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:

"a" * (1024 * 1024) + "\xa9"

causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)

Propose Close

Source

marcocova

👍2

Most helpful comment

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']

mlissner on 30 Jul 2015

👍6 ❤4 😄1

All 10 comments

We're aware of this. Do you have a proposal to replace it with something better?

sigmavirus24 on 27 Nov 2014

How about using cChardet ?

rsnair2 on 28 Nov 2014

👍2

We can't vendor anything that uses C extensions, so no @rsnair2 we can't use cChardet

sigmavirus24 on 28 Nov 2014

Couldn't it be the solution in @marcocova's case though? If instead of letting requests determine the charset, he could use cChardet and tell requests what charset to expect.

Terr on 28 Nov 2014

@Terr the solution of setting the encoding manually is well documented, so I would hope that @marcocova had considered that.

sigmavirus24 on 29 Nov 2014

@sigmavirus24, yes, workarounds are well understood. I was just concerned that the default behavior leaves the user exposed to this issue (with no warnings in the docs - that I could find at least).

marcocova on 2 Dec 2014

Closing due to inactivity

sigmavirus24 on 7 Apr 2015

@marcocova I had the same issue, then i found this: https://pypi.python.org/pypi/cchardet/
From 5 seconds (chardet), I got 1 milisecond :)

simion on 1 Jul 2015

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']

mlissner on 30 Jul 2015

👍6 ❤4 😄1

Following up on @sigmavirus24 and @mlissner comments (advising to set r.encoding by yourself before any call to .text is done)

Note that if the payload is binary (e.g. download a .gz file from S3), cchardet will quickly return an encoding of None.
But setting r.encoding = None is a no-op, so you still have to refrain from calling .text or .apparent_encoding afterwards, or these would trigger a new, slow, chardet detection

I took the path of... monkey-patching apparent_encoding... ¯_(ツ)_/¯
The following seems to be working (Python 3.7, Requests 2.22.0, YMMV)

import requests
import cchardet

class ForceCchardet:
    @property
    def apparent_encoding(obj):
        return cchardet.detect(obj.content)['encoding']
requests.Response.apparent_encoding = ForceCchardet.apparent_encoding