Requests: Requests extremely slow compared to urllib.request.urlopen

Created on 16 May 2017  路  9Comments  路  Source: psf/requests

I'm having poor performance getting the following URL using requests:

http://greycite.knowledgeblog.org/json?uri=http%3A%2F%2Fblog.dhimmel.com%2Firreproducible-timestamps%2F

The following notebook screenshot highlights the issue:

requests-greycite

Less than a second using urllib.request.urlopen but 15 seconds using requests.get. I've evaluated curl and my web browser, which both retrieve the response quickly. It looks like the holdup is on {method 'recv_into' of '_socket.socket' objects}.

Any help would be appreciated.

Most helpful comment

(For note, the reason urllib.request is so fast is because it doesn't use persistent connections: that is, it sends the header Connection: close. This forces the server to close the connection immediately, so that TCP FIN comes quickly. You can reproduce this in Requests by sending that same header.)

All 9 comments

Just to remove influence from importing the modules, could you repeat the tests measuring the time, after the import statements?

could you repeat the tests measuring the time, after the import statements?

@HelioGuilherme66 see below:

requests-greycite-2

To see if the problem replicates on your system, you can see if the following is slow or fast:

import requests
url = "http://greycite.knowledgeblog.org/json?uri=http%3A%2F%2Fblog.dhimmel.com%2Firreproducible-timestamps%2F"
response = requests.get(url)

Note that I've also confirmed the request is slow in requests version 2.14.2.

The remote server is at fault here: its response is not really valid HTTP/1.1. Here are the response headers:

HTTP/1.1 200 OK
Date: Wed, 17 May 2017 17:11:52 GMT
Server: Apache
Content-Type : application/citeproc+json
Access-Control-Allow-Origin: *
Content-Length: 426
Keep-Alive: timeout=15, max=50
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8

The problem is the Content-Type header. You are not allowed to have a space between the header name value and the colon: the specification forbids it. Python 3 hits a parsing problem on this, and so only sees the headers before that one:

>>> r.headers
{'Date': 'Wed, 17 May 2017 17:11:52 GMT', 'Server': 'Apache'}

Because we only see those headers, we don't know what the content-length is, so we have to wait for the TCP FIN to work out when the end of the body is. That's why we're taking so long. I suspect this is related to a bug in the standard library: probably CPython issue 24363.

(For note, the reason urllib.request is so fast is because it doesn't use persistent connections: that is, it sends the header Connection: close. This forces the server to close the connection immediately, so that TCP FIN comes quickly. You can reproduce this in Requests by sending that same header.)

@Lukasa thanks for diagnosing the problem. I've confirmed that I get the same faulty header of Content-Type : application/citeproc+json when I run:

curl --head http://greycite.knowledgeblog.org/json?uri=http%3A%2F%2Fblog.dhimmel.com%2Firreproducible-timestamps%2F

The Connection: close workaround was successful as shown below:

requests-fix

Tagging @phillord who I believe is an author of Greycite and thus may be able to fix the incorrect header.

Thanks for the information. I'll forward it to Lindsay Marshall (who is actually the author of greycite) and will see if we can get this fixed.

Hopefully fixed at our end now.

 In [1]: import requests
    ...: url = "http://greycite.knowledgeblog.org/json?uri=http%3A%2F%2Fblog.dhim
    ...: mel.com%2Firreproducible-timestamps%2F"
    ...: 

 In [2]: url
 Out[2]: 'http://greycite.knowledgeblog.org/json?uri=http%3A%2F%2Fblog.dhimmel.com%2Firreproducible-timestamps%2F'

 In [3]: %%time
    ...: response = requests.get(url)
    ...: 
 CPU times: user 8 ms, sys: 4 ms, total: 12 ms
 Wall time: 104 ms

Can you confirm? Thanks for the report BTW.

@phillord confirming that the requests.get without the Connection: close header is now speedy. Thanks!

Have you considered putting the Greycite source on GitHub? It's a really great application, and I'm sure many people would be interested in seeing how it works and contributing enhancements.

Hello,

I have a problem very similar to the previous one exposed, with the following url:

https://api.coinmarketcap.com/v1/ticker/

curl --head https://api.coinmarketcap.com/v1/ticker/
HTTP/1.1 200 OK
Date: Tue, 20 Mar 2018 03:19:00 GMT
Content-Type: application/json
Content-Length: 53976
Connection: keep-alive
Set-Cookie: __cfduid=dda658faa04cd96eb95607fa899d2bcc41521515940; expires=Wed, 20-Mar-19 03:19:00 GMT; path=/; domain=.coinmarketcap.com; HttpOnly; Secure
Access-Control-Allow-Origin: *
CF-Cache-Status: HIT
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Server: cloudflare
CF-RAY: 3fe508e2ba069b68-DFW

The response is very slow

import requests
API_URL = "https://api.coinmarketcap.com/v1/ticker/"
r = requests.get(API_URL)

headers --->
{'Set-Cookie': '__cfduid=d3a92212eccddc19e51f7d9872801cc5d1521520491; expires=Wed, 20-Mar-19 04:34:51 GMT; path=/; domain=.coinmarketcap.com; HttpOnly; Secure', 'Date': 'Tue, 20 Mar 2018 04:34:51 GMT', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Connection': 'keep-alive', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-Cache-Status': 'HIT', 'Transfer-Encoding': 'chunked', 'CF-RAY': '3fe57802aeac1fe8-DFW', 'Content-Type': 'application/json'}

Regards,
Ed

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brainwane picture brainwane  路  3Comments

tiran picture tiran  路  3Comments

eromoe picture eromoe  路  3Comments

ReimarBauer picture ReimarBauer  路  4Comments

remram44 picture remram44  路  4Comments