Pip: downloads of github releases aren't cached

Created on 29 Jul 2020  路  11Comments  路  Source: pypa/pip

Environment

  • pip version: 20.2
  • Python version: 3.6.10
  • OS: macOS

Description

Downloads of github release packages are not cached.

Expected behavior

Subsequent downloads read from the local cache, instead of downloading again.

How to Reproduce

$ python3 -m venv --clear /tmp/venv
$ /tmp/venv/bin/pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/36/74/38c2410d688ac7b48afa07d413674afc1f903c1c1f854de51dc8eb2367a5/pip-20.2-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 18.1
    Uninstalling pip-18.1:
      Successfully uninstalled pip-18.1
Successfully installed pip-20.2

$ /tmp/venv/bin/pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz

Collecting https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
  Downloading https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz (268.3 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枌                        | 63.1 MB 4.5 MB/s eta 0:00:46

When the above is repeated, pyspark-2.4.5.tar.gz is downloaded again, rather than used from the cache.

cache

All 11 comments

Strange, it is successfully cached for me.
Could you provide the full logs of pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz -v --no-deps on first & second runs ?

Pip caches HTTP downloads based on the standard HTTP expiry headers, as far as I know. Is github sending the correct ETAG headers to confirm that the file hasn't changed?

A quick check with curl shows the pip wheel having a Cache-Control header, but github not having one.

I鈥檓 not sure I personally would want it cached. pip has no way to tell whether an arbitrary URL points to a stable target, e.g. if the URL points to a branch instead of a tag. It has to download the package to make sure.

@uranusjr HTTP-level caching should be fine (but as I say, github needs to send the cache control headers). But I agree pip shouldn't do any caching of its own because we can't validate that the file is unchanged.

@pfmoore Ah right, I confused this with the wheel cache rules, which does not apply here, of course. HTTP caching should be fine. Sorry.

The issue report is slightly confusing, because the pip upgrade shows use of the wheel cache which is different. But the PyTorch install is from a direct URL, so the wheel cache doesn't apply.

Yes, I also noticed the GitHub URL has a cache-control: no-cache header.

Here's the verbose logs from the first run:

$ /tmp/venv/bin/pip install https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz -v --no-deps
Using pip 20.2 from /tmp/venv/lib/python3.6/site-packages/pip (python 3.6)
Non-user install because user site-packages disabled
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-ephem-wheel-cache-kvd8heku
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Initialized build tracking at /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Created build tracker: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Entered build tracker: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7
Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-install-ezxfg51e
Collecting https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
  Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc
  Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-unpack-33i419vo
  Looking up "https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz" in the cache
  No cache entry available
  Starting new HTTPS connection (1): github.com:443
  https://github.com:443 "GET /tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz HTTP/1.1" 302 633
  Status code 302 not in (200, 203, 300, 301)
  Looking up "https://github-production-release-asset-2e65be.s3.amazonaws.com/264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream" in the cache
  No cache entry available
  Starting new HTTPS connection (1): github-production-release-asset-2e65be.s3.amazonaws.com:443
  https://github-production-release-asset-2e65be.s3.amazonaws.com:443 "GET /264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream HTTP/1.1" 200 268330064
  Downloading https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz (268.3 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 268.3 MB 5.0 MB/s eta 0:00:01  Updating cache with response from "https://github-production-release-asset-2e65be.s3.amazonaws.com/264437006/1b585a80-b367-11ea-9ddd-7283bd314577?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200729%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200729T112206Z&X-Amz-Expires=300&X-Amz-Signature=44170681905235fed8b3bf1db42ca98f3cf88d8a1144f7f3482367e9863f0cdb&X-Amz-SignedHeaders=host&actor_id=0&repo_id=264437006&response-content-disposition=attachment%3B%20filename%3Dpyspark-2.4.5.tar.gz&response-content-type=application%2Foctet-stream"
  Caching due to etag
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 268.3 MB 13 kB/s
  Added https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz to build tracker '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7'
    Running setup.py (path:/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/setup.py) egg_info for package from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
    Created temporary directory: /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5
    Running command python setup.py egg_info
    Could not import pypandoc - required to package PySpark
    zip_safe flag not set; analyzing archive contents...
    pypandoc.__pycache__.__init__.cpython-36: module references __file__

    Installed /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/pypandoc-1.5-py3.6.egg
    Searching for wheel>=0.25.0
    Reading https://pypi.org/simple/wheel/
    Downloading https://files.pythonhosted.org/packages/8c/23/848298cccf8e40f5bbb59009b32848a4c38f4e7f3364297ab3c3e2e2cd14/wheel-0.34.2-py2.py3-none-any.whl#sha256=df277cb51e61359aba502208d680f90c0493adec6f0e848af94948778aed386e
    Best match: wheel 0.34.2
    Processing wheel-0.34.2-py2.py3-none-any.whl
    Installing wheel-0.34.2-py2.py3-none-any.whl to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs
    writing requirements to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/wheel-0.34.2-py3.6.egg/EGG-INFO/requires.txt

    Installed /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc/.eggs/wheel-0.34.2-py3.6.egg
    running egg_info
    creating /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info
    writing /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/PKG-INFO
    writing dependency_links to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/dependency_links.txt
    writing requirements to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/requires.txt
    writing top-level names to /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/top_level.txt
    writing manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
    package init file 'deps/bin/__init__.py' not found (or not a regular file)
    package init file 'deps/jars/__init__.py' not found (or not a regular file)
    package init file 'pyspark/python/pyspark/__init__.py' not found (or not a regular file)
    package init file 'lib/__init__.py' not found (or not a regular file)
    package init file 'deps/data/__init__.py' not found (or not a regular file)
    package init file 'deps/licenses/__init__.py' not found (or not a regular file)
    package init file 'deps/examples/__init__.py' not found (or not a regular file)
    reading manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
    warning: no previously-included files matching '__pycache__' found anywhere in distribution
    warning: no previously-included files matching '.DS_Store' found anywhere in distribution
    writing manifest file '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-pip-egg-info-nhk_8sb5/pyspark.egg-info/SOURCES.txt'
  Source in /private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-build-ntw5z_qc has version 2.4.5, which satisfies requirement pyspark==2.4.5 from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz
  Removed pyspark==2.4.5 from https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz from build tracker '/private/var/folders/00/3ln54lf50bv0bhyp38nskdhr0000gn/T/pip-req-tracker-999d4rq7'
Using legacy 'setup.py install' for pyspark, since package 'wheel' is not installed.
Installing collected packages: pyspark
....

The second run looks the same.

Hi,

after review the issue reported, IMO exist a limitation on cachecontrol vendor. Cachecontrol will cache only if the response status code is 301, and Github respopnse with 302.

So if I use a named urlspec, pip will use the cached version rather than downloading it again:

$ pip install 'pyspark @ https://github.com/tekumara/spark/releases/download/v2.4.5-cloud/pyspark-2.4.5.tar.gz'
Processing /Users/tekumara/Library/Caches/pip/wheels/4d/8c/99/636fbcc2942d25483272d77c0654cd921907ff73717f7b5627/pyspark-2.4.5-py2.py3-none-any.whl
Collecting py4j==0.10.7
  Using cached py4j-0.10.7-py2.py3-none-any.whl (197 kB)
Installing collected packages: py4j, pyspark

This is using pip 20.2.3. Seems like a pretty workable solution.

Hi, after feedback of https://github.com/pypa/pip/pull/8960#issuecomment-704294934 and https://github.com/ionrock/cachecontrol/pull/234#issuecomment-710095678 I understand that this response has the intention (and expected) to change So IMO, this ticket can be consider not a bug, current behavior is the expected, and can be closed, isn't? @uranusjr @pfmoore

Yup. thanks @eamanu! ^>^

Was this page helpful?
0 / 5 - 0 ratings