Requests: Charset detection in text() can be pathologically slow

Created on 27 Nov 2014  Â·  10Comments  Â·  Source: psf/requests

When calling response.text() and no encoding was set or determined, requests relies on chardet to detect the encoding:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library"""
        return chardet.detect(self.content)['encoding']

Unfortunately, chardet can be pathologically slow and memory-hungry to do its job. For example, processing the text property of a response with the following content:

"a" * (1024 * 1024) + "\xa9"

causes python-requests to use 131.00MB and 40+ seconds (for a 1MB response!)

Propose Close

Most helpful comment

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']

All 10 comments

We're aware of this. Do you have a proposal to replace it with something better?

How about using cChardet ?

We can't vendor anything that uses C extensions, so no @rsnair2 we can't use cChardet

Couldn't it be the solution in @marcocova's case though? If instead of letting requests determine the charset, he could use cChardet and tell requests what charset to expect.

@Terr the solution of setting the encoding manually is well documented, so I would hope that @marcocova had considered that.

@sigmavirus24, yes, workarounds are well understood. I was just concerned that the default behavior leaves the user exposed to this issue (with no warnings in the docs - that I could find at least).

Closing due to inactivity

@marcocova I had the same issue, then i found this: https://pypi.python.org/pypi/cchardet/
From 5 seconds (chardet), I got 1 milisecond :)

In case others are coming here, as I did, and wondering why cchardet isn't included in requests, well, I can provide an answer just gleaned by talking to one of the maintainers on IRC. There was, at one time, a conditional import for cchardet, but it has since been removed. I asked why. Two reasons. First, chardet and cchardet are not fully compatible and have different strengths and accuracies. So, having a conditional import means that requests wouldn't be deterministic. This is a very bad thing.

The second reason is that conditional imports are vaguely causing trouble in other areas of requests that the devs want to trim down. I don't know the details here, exactly, but there's an import for simplejson that they say has caused trouble, so they're disinclined to do more conditional imports.

So, if you want faster processing and your comfy with the fact that cchardet has different responses to chardet, you can do this in your code just before the first time you access r.text:

if r.encoding is None:
        # Requests detects the encoding when the item is GET'ed using
        # HTTP headers, and then when r.text is accessed, if the encoding
        # hasn't been set by that point. By setting the encoding here, we
        # ensure that it's done by cchardet, if it hasn't been done with
        # HTTP headers. This way it is done before r.text is accessed
        # (which would do it with vanilla chardet). This is a big
        # performance boon.
        r.encoding = cchardet.detect(r.content)['encoding']

Following up on @sigmavirus24 and @mlissner comments (advising to set r.encoding by yourself before any call to .text is done)

Note that if the payload is binary (e.g. download a .gz file from S3), cchardet will quickly return an encoding of None.
But setting r.encoding = None is a no-op, so you still have to refrain from calling .text or .apparent_encoding afterwards, or these would trigger a new, slow, chardet detection

I took the path of... monkey-patching apparent_encoding... ¯_(ツ)_/¯
The following seems to be working (Python 3.7, Requests 2.22.0, YMMV)

import requests
import cchardet

class ForceCchardet:
    @property
    def apparent_encoding(obj):
        return cchardet.detect(obj.content)['encoding']
requests.Response.apparent_encoding = ForceCchardet.apparent_encoding
Was this page helpful?
0 / 5 - 0 ratings