I've followed the instructions to get pip cached successfully.
It's important to understand that the pip cache only affects downloading of packages, not the actual installation. My app is not huge by any means, but still, even with cached downloads, installation takes about a minute or more.
In fact, without pip cache at all, downloads are actually pretty quick.
I had much better performance with this hack:

Notice that I'm caching the actual pip directories. I could hard code the value (I only test on ubuntu), but wanted to stick to the principles of getting the site path.
Still hacky, because of binaries. I need to be able to run pytest, so just caching site-packages was not enough. The getsitepackages() will give you something like this /opt/hostedtoolcache/Python/3.7.6/x64/lib/python3.7/site-packages/. And if you restore that directory, pip will know not to reinstall stuff.
However, your binaries will be missing from the path. So I traverse out a few dirs and cache the whole thing.
Not sure if it's worth documenting this approach, or if there's a better layer based cache coming off some sort that will make this moot.
But it cut down my pip setup from about 1 minute to 1 second.
Thanks @silviogutierrez, I'll try and get some Python expertise to look at this and see if this approach is valid and will work for all scenarios.
It's "kinda" valid. But is safe to use only if the OS+Python version is exactly the same. So this has to be accounted for in the cache key.
@pradyunsg can this explode? ^
!
We cache .venv folder in GitLab CI and install all dependencies in that folder only. It takes 1 second to verify all dependencies are installed in comparison to 40 secs by following the example. But in GitHub Actions, this works for 1-2 times and then breaks after cache hit and restore with error:
.venv/bin/python: bad interpreter: No such file or directory
@string-areeb this probably means that they upgrade the Python interpreter and the virtualenv you created for another version doesn't fit anymore.
@webknjaz Thanks a lot. That must be it, we use our own Docker Image for GitLab tests so it always has same python version, that's why I didn't notice it here. Thanks again
It took me several tries to get this working, but I did. There is not much of an improvement. I don't see the minutes to seconds change noted by others.
On windows grabbing the whole hostedtoolcache is way too greedy. The cache archive for that is ~4 GB and takes 42 min to create.
I needed to reduce it catching only the Python tree. After that builds needed to be triggered two times before the caching worked: Once to build & save the python tools cache, and the second to build and save what our program adds to the cache when it builds it's artifact.
Now restoring the cache takes 4.5 min and all other steps are sub minutes (total 7.5m). Before I started experimenting with cacheing it took ~6 min for all steps. I'll let it stand for awhile and see what the average is over several builds.
Maybe I'm doing something wrong though. If you see something awry with the workflow file please feel free to point it out.
@maphew usually only pip's wheel cache should be cached in GHA. Caching the whole interpreter installation is going to create problems every time they upgrade it.
Also, AFAIU the slow part is iterating over individual files. Uploading one file of the same size is faster. So the hack would be manually archiving what you need and caching just that file, and then unpacking that also manually.
Please don't cache site-packages or entire interpreter trees. That is fragile (sensitive to python patch version and OS) and pip should be pretty fast as long it's cache is populated.
If there are instances where pip's still slow, please file an issue on pip's issue tracker (look for an existing one, before filing a new one), and we should put in the work to enhance pip. It's not gonna magically happen on it's own, but I don't think pushing for fragile optimizations on CI providers is a good alternative approach.
Users who are willing to deal with the consequences of such a caching strategy can do so in their specific CI pipelines themselves. I would be very concerned if this became something that GitHub Actions or Azure Pipelines start suggesting users to do, or even make it easier by providing a mechanism to do so via an official-esque package.
@pradyunsg and @webknjaz thanks for the comments.
I simulated a CI on my home Win10 machine. Using tar to cache ./Scripts and ./Lib/site-packages instead of relying on pip install -r requirements with a populated pip cache: tar is 17% faster. 120s vs 145s.
So an appreciable difference, but not _that_ much of a difference over all. Windows is just slow when handling thousands of small files (10,000 files in 400MB in this instance).
@pradyunsg and others: fair enough. My cache broke maybe once every couple of months when GitHub upgraded python minor versions, due to the brittleness.
But then I found this article: https://medium.com/ai2-blog/python-caching-in-github-actions-e9452698e98d
It seems to factor it all in, so maybe that should be the official recommendation.
Credit to @epwalsh for his work.
Most helpful comment
Please don't cache site-packages or entire interpreter trees. That is fragile (sensitive to python patch version and OS) and pip should be pretty fast as long it's cache is populated.
If there are instances where pip's still slow, please file an issue on pip's issue tracker (look for an existing one, before filing a new one), and we should put in the work to enhance pip. It's not gonna magically happen on it's own, but I don't think pushing for fragile optimizations on CI providers is a good alternative approach.
Users who are willing to deal with the consequences of such a caching strategy can do so in their specific CI pipelines themselves. I would be very concerned if this became something that GitHub Actions or Azure Pipelines start suggesting users to do, or even make it easier by providing a mechanism to do so via an official-esque package.