Environment
Description
Downloads of github release packages are not cached.
Expected behavior
Subsequent downloads read from the local cache, instead of downloading again.
How to Reproduce
$ python3 -m venv --clear /tmp/venv
$ /tmp/venv/bin/pip install --upgrade pip
Collecting pip
Using cached https://files.pythonhosted.org/packages/36/74/38c2410d688ac7b48afa07d413674afc1f903c1c1f854de51dc8eb2367a5/pip-20.2-py2.py3-none-any.whl
Installing collected packages: pip
Found existing installation: pip 18.1
Uninstalling pip-18.1:
Successfully uninstalled pip-18.1
Successfully installed pip-20.2
$ /tmp/venv/bin/pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
Collecting https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
Downloading https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz (268.3 MB)
|鈻堚枅鈻堚枅鈻堚枅鈻堚枌 | 63.1 MB 4.5 MB/s eta 0:00:46
When the above is repeated, pyspark-2.4.5.tar.gz is downloaded again, rather than used from the cache.
Strange, it is successfully cached for me.
Could you provide the full logs of pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz -v --no-deps
on first & second runs ?
Pip caches HTTP downloads based on the standard HTTP expiry headers, as far as I know. Is github sending the correct ETAG headers to confirm that the file hasn't changed?
A quick check with curl shows the pip wheel having a Cache-Control header, but github not having one.
I鈥檓 not sure I personally would want it cached. pip has no way to tell whether an arbitrary URL points to a stable target, e.g. if the URL points to a branch instead of a tag. It has to download the package to make sure.
@uranusjr HTTP-level caching should be fine (but as I say, github needs to send the cache control headers). But I agree pip shouldn't do any caching of its own because we can't validate that the file is unchanged.
@pfmoore Ah right, I confused this with the wheel cache rules, which does not apply here, of course. HTTP caching should be fine. Sorry.
The issue report is slightly confusing, because the pip upgrade shows use of the wheel cache which is different. But the PyTorch install is from a direct URL, so the wheel cache doesn't apply.
Yes, I also noticed the GitHub URL has a cache-control: no-cache
header.
Here's the verbose logs from the first run:
$ /tmp/venv/bin/pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz -v --no-deps
Using pip 20.2 from /tmp/venv/lib/python3.6/site-packages/pip (python 3.6)
Non-user install because user site-packages disabled
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-ephem-wheel-cache-kvd8heku
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Initialized build tracking at /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Created build tracker: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Entered build tracker: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-install-ezxfg51e
Collecting https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-unpack-33i419vo
Looking up "https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz" in the cache
No cache entry available
Starting new HTTPS connection (1): github.com:443
https://github.com:443 "GET /tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz HTTP/1.1" 302 633
Status code 302 not in (200, 203, 300, 301)
Looking up "https://github-production-release-asset-2e65be.s3.amazonaws.com/264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream" in the cache
No cache entry available
Starting new HTTPS connection (1): github-production-release-asset-2e65be.s3.amazonaws.com:443
https://github-production-release-asset-2e65be.s3.amazonaws.com:443 "GET /264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream HTTP/1.1" 200 268330064
Downloading https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz (268.3 MB)
|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 268.3 MB 5.0 MB/s eta 0:00:01 Updating cache with response from "https://github-production-release-asset-2e65be.s3.amazonaws.com/264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream"
Caching due to etag
|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 268.3 MB 13 kB/s
Added https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz to build tracker '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7'
Running setup.py (path:/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/setup.py) egg_info for package from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5
Running command python setup.py egg_info
Could not import pypandoc - required to package PySpark
zip_safe flag not set; analyzing archive contents...
pypandoc.__pycache__.__init__.cpython-36: module references __file__
Installed /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/pypandoc-1.5-py3.6.egg
Searching for wheel>=0.25.0
Reading https://pypi.org/simple/wheel/
Downloading https://files.pythonhosted.org/packages/8c/23/848298cccf8e40f5bbb59009b32848a4c38f4e7f3364297ab3c3e2e2cd14/wheel-0.34.2-py2.py3-none-any.whl#sha256=df277cb51e61359aba502208d680f90c0493adec6f0e848af94948778aed386e
Best match: wheel 0.34.2
Processing wheel-0.34.2-py2.py3-none-any.whl
Installing wheel-0.34.2-py2.py3-none-any.whl to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs
writing requirements to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/wheel-0.34.2-py3.6.egg/EGG-INFO/requires.txt
Installed /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/wheel-0.34.2-py3.6.egg
running egg_info
creating /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info
writing /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/PKG-INFO
writing dependency_links to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/dependency_links.txt
writing requirements to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/requires.txt
writing top-level names to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/top_level.txt
writing manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
package init file 'deps/bin/__init__.py' not found (or not a regular file)
package init file 'deps/jars/__init__.py' not found (or not a regular file)
package init file 'pyspark/python/pyspark/__init__.py' not found (or not a regular file)
package init file 'lib/__init__.py' not found (or not a regular file)
package init file 'deps/data/__init__.py' not found (or not a regular file)
package init file 'deps/licenses/__init__.py' not found (or not a regular file)
package init file 'deps/examples/__init__.py' not found (or not a regular file)
reading manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '.DS_Store' found anywhere in distribution
writing manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
Source in /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc has version 2.4.5, which satisfies requirement pyspark==2.4.5 from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
Removed pyspark==2.4.5 from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz from build tracker '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7'
Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed.
Installing collected packages: pyspark
....
The second run looks the same.
Hi,
after review the issue reported, IMO exist a limitation on cachecontrol vendor. Cachecontrol will cache only if the response status code is 301, and Github respopnse with 302.
So if I use a named urlspec, pip will use the cached version rather than downloading it again:
$ pip install 'pyspark @ https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz'
Processing /Users/tekumara/Library/Caches/pip/wheels/4d/8c/99/636fbcc2942d25483272d77c0654cd921907ff73717f7b5627/pyspark-2.4.5-py2.py3-none-any.whl
Collecting py4j==0.10.7
Using cached py4j-0.10.7-py2.py3-none-any.whl (197 kB)
Installing collected packages: py4j, pyspark
This is using pip 20.2.3. Seems like a pretty workable solution.
Hi, after feedback of https://github.com/pypa/pip/pull/8960#issuecomment-704294934 and https://github.com/ionrock/cachecontrol/pull/234#issuecomment-710095678 I understand that this response has the intention (and expected) to change So IMO, this ticket can be consider not a bug, current behavior is the expected, and can be closed, isn't? @uranusjr @pfmoore
Yup. thanks @eamanu! ^>^