Using requests to download http://hdfeos.org/software/pyhdf/pyhdf-0.9.0.tar.gz results in a corrupted file.
Other methods of downloading the file (wget, urllib.urlretrieve) download the file correctly.
When the file downloads correctly the sha256 listed here can be obtained:
$ sha256sum pyhdf-0.9.0.tar.gz
c20c58e53f8fbdc47a1fcdec954262528f486cfcb4efa7e1c2e8847ad3e8092f pyhdf-0.9.0.tar.gz
$ sha256sum pyhdf-0.9.0.tar.gz
f1fd2d72838f30fc4e3a7688a4e1a395483b08e54f4955f3b4d384639e13c67a pyhdf-0.9.0.tar.gz
import requests
resp = requests.get('http://hdfeos.org/software/pyhdf/pyhdf-0.9.0.tar.gz')
with open('pyhdf-0.9.0.tar.gz', 'w') as f:
f.write(resp.content)
$ python -m requests.help
{
"chardet": {
"version": "3.0.2"
},
"cryptography": {
"version": "1.9"
},
"idna": {
"version": ""
},
"implementation": {
"name": "CPython",
"version": "2.7.13"
},
"platform": {
"release": "2.6.32-696.10.3.el6.x86_64",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "100020bf",
"version": "16.2.0"
},
"requests": {
"version": "2.18.4"
},
"system_ssl": {
"version": "100020cf"
},
"urllib3": {
"version": "1.21.1"
},
"using_pyopenssl": true
}
Requests gunzips the file:
$ file pyhdf-0.9.0.tar.gz
pyhdf-0.9.0.tar.gz: POSIX tar archive (GNU)
$ tar tf pyhdf-0.9.0.tar.gz
pyhdf-0.9.0/
pyhdf-0.9.0/PKG-INFO
pyhdf-0.9.0/pyhdf/
pyhdf-0.9.0/pyhdf/error.py
pyhdf-0.9.0/pyhdf/HC.py
pyhdf-0.9.0/pyhdf/HDF.py
pyhdf-0.9.0/pyhdf/hdfext.py
pyhdf-0.9.0/pyhdf/hdfext_wrap.c
pyhdf-0.9.0/pyhdf/SD.py
pyhdf-0.9.0/pyhdf/six.py
pyhdf-0.9.0/pyhdf/V.py
pyhdf-0.9.0/pyhdf/VS.py
pyhdf-0.9.0/pyhdf/__init__.py
pyhdf-0.9.0/pyhdf.egg-info/
pyhdf-0.9.0/pyhdf.egg-info/dependency_links.txt
pyhdf-0.9.0/pyhdf.egg-info/PKG-INFO
pyhdf-0.9.0/pyhdf.egg-info/SOURCES.txt
pyhdf-0.9.0/pyhdf.egg-info/top_level.txt
pyhdf-0.9.0/README.rst
pyhdf-0.9.0/setup.cfg
pyhdf-0.9.0/setup.py
Use stream=True:
Great, thanks for your help!
Sorry, I closed this prematurely. stream=True gives the same result...
Now:
with open('pyhdf-0.9.0.tar.gz', 'wb') as f:
f.write(resp.raw.data)
@djkirkham, stream=True should not be necessary unless you have a need for streaming the content.
You need to write resp.content to a file opened with the wb flag, not w. We've already ungzipped the content for you, so it will just be the tar file.
The goal is to verify the SHA256, so writing the ungzipped content is probably no option.
Actually it's not in my code. I'm using conda build, which is failing on this file.
What are the rules about when requests will gunzip a file? This is the first time I've run into this problem and I've been using conda build extensively with .tar.gz files
Requests will return uncompressed content by default so it can be written directly to file or passed around your code in the intended base format. Lots of web servers send content, that would otherwise be uncompressed, in the gzip format to reduce the bytes going across the wire.
The solution @lutzhorn noted is probably the best one if you need the content compressed. Otherwise, you'll receive the tar file.
If this is conda-build's code, it sounds like they may not be using Requests with this use case in mind. You'll likely want to open an issue there to discuss with them.
I've noticed that the response header for the url in questions contains the key/value pair Content-Encoding': 'gzip'. Is this what causes requests to gunzip it? Other files I've looked at don't contain this key.
This seems like a problem with the response itself. Based on my limited understanding it seems like specifying a Connent-Type of gzip is incorrect, as the intention is to provide a zipped file to the client.
In any case, it's not an issue with requests. Thanks for you help.
Yes, that is why Requests decompresses it. Content-Encoding means that the compression is not an intrinsic part of the file, so we remove it.