Environment
Description
pip download
does not prefer package found locally even if it satisfies the requirements when there is a newer available at the remote package index
Expected behavior
Prefer the already existing package as long as long as it satisfies the dependency requirements
How to Reproduce
pkg_cache
pip3 download --dest pkg_cache/ --find-links pkg_cache/ setuptools==39.0.1 && pip3 download --dest pkg_cache/ --find-links pkg_cache/ setuptools
Output
pip3 download --dest pkg_cache/ --find-links pkg_cache/ setuptools==39.0.1 && pip3 download --dest pkg_cache/ --find-links pkg_cache/ setuptools
Looking in links: pkg_cache/
Collecting setuptools==39.0.1
Using cached https://files.pythonhosted.org/packages/20/d7/04a0b689d3035143e2ff288f4b9ee4bf6ed80585cc121c90bfd85a1a8c2e/setuptools-39.0.1-py2.py3-none-any.whl
Saved ./pkg_cache/setuptools-39.0.1-py2.py3-none-any.whl
Successfully downloaded setuptools
Looking in links: pkg_cache/
Collecting setuptools
Using cached https://files.pythonhosted.org/packages/7f/e1/820d941153923aac1d49d7fc37e17b6e73bfbd2904959fffbad77900cf92/setuptools-39.2.0-py2.py3-none-any.whl
Saved ./pkg_cache/setuptools-39.2.0-py2.py3-none-any.whl
Successfully downloaded setuptools
That behaviour is by design. Pip will always prefer the latest available version, it takes no account of where a package comes from.
@pfmoore
I see. We have multiple requirement files, and since pip does not handle double requirements it is necessary to do multiple calls to pip download
, one for each requirements file. With the current behavior of pip, where one file has setuptools
and another has setuptools==39.0.1
, both 39.0.1
and 39.2.0
will be downloaded.
So? That's the point of pip download. I don't know if I'm missing something here but I can't see what the problem is. What exactly do you use the files downloaded via pip download
for? As per the docs the intention is that you use pip download
to populate a directory from which you can later use pip install --find-links
to do an install while offline. The pip install
command is perfectly capable of handling a --find-links
directory with multiple versions of the same package in it, so why are you bothered that this is happening?
The point is that consistency is useful. Things that behave differently all the time is less useful than things that do the same thing every time. Had pip supported handling multiple requirement files and dealt properly with the dependencies, this wouldn't be a problem though.
With two requirement files, as explained earlier, you never actually know exactly what package versions will be downloaded.
The pip install command is perfectly capable of handling a --find-links directory with multiple versions of the same package in it, so why are you bothered that this is happening?
Depending on the order of the requirement files you provide, different package versions are installed. Consistency is key.
Second reason is speed. By looking locally and finding a package that satisfies the dependencies, there is no need to check remotely. Therefore, a call to pip download
would be blazing fast if the packages are already downloaded. Currently it's very slow.
I'm not sure I follow. Pip's current behaviour is perfectly consistent - I described it above:
Pip will always prefer the latest available version, it takes no account of where a package comes from.
In fact, if we preferred local files, we'd be harming consistency, because you'd get something different installed depending on what was present locally.
I don't see anything actionable here. Pip's current behaviour is by design, if you want to propose a change, you'll need to provide details of what you propose, and you'll probably need more persuasive arguments than you've currently offered.
I'm not sure I follow. Pip's current behaviour is perfectly consistent - I described it above:
True. It's consistent in that you never know which version it will download in the scenario I describe.
In fact, if we preferred local files, we'd be harming consistency, because you'd get something different installed depending on what was present locally.
The whole point is to know exactly what will be installed based on the local files. But having pip download the same package versions each time is not possible with multiple requirement files as I describe.
I agree that the current default behaviour shouldn't be changed, but an option to be able to prefer local packages over checking remotely would still be useful.
What I propose is to have an option that makes pip check locally if a package that satisfies the given dependency already exists locally, and if so, do not check remotely.
OK, so what you're suggesting is an option to pip download
that says "for each requirement, if it can already be satisfied from the destination directory, skip it, otherwise download the requirement as normal and store the downloaded file in the destination directory.
I can see the logic in that. If you wanted to create a PR implementing it, I'm not going to object. I can't say that I find your justification for the behaviour compelling, but that's something that can be debated later, when there's a PR to review.
This would also be very useful for HPC clusters on which the staff may build python wheels that are optimized for their CPU architecture. The current behavior requires HPC staff to always be recompiling new versions as soon as they are out, or risk users using dramatically slower python packages in some situations. Being able to tell pip to favor a local wheelhouse over some minor version increase found online would be very useful to us.
@bendikro Any news/updates on this? This would be very useful for us.
An example of this causing an issue in practice:
Let's say I'm using python 2.7. Matplotlib 3 supports only python 3.5+. If I install a package that has matplotlib>=2.0
as a requirement (e.g. scikit-image
) then even though I have matplotlib 2.x installed locally, pip will try to download and install matplotlib 3.x which of course will fail.
I've created a PR suggesting a new option --prefer-local-compatible
: https://github.com/pypa/pip/pull/6023
I remain unconvinced that this is a good idea, but as I said above, if someone else feels it's worth taking this forward, I won't object.
For the record, though, my objection to this isn't so much that it's difficult to implement or explain the basics of the proposed behaviour, it's more about the maintenance burden:
pip install
, which is something I remain strongly against, as I noted above. Having to repeatedly argue against such requests is going to be a drain on developer resources.Wait... this is for pip install
, that's the whole point of it. Isn't it ?
Absolutely not. See all of my previous comments about why this should not be added to pip install
.
However, I've just noticed that the PR adds the option to pip install
. I'll register my objection to that on the PR, as well.
I disagree. As manager of an HPC cluster and a very comprehensive wheel house, we strongly want that for install too.
Wheels downloaded from online repositories break or under-perform way too often.
I'm not sure I follow. Pip's current behaviour is perfectly consistent - I described it above:
Pip will always prefer the latest available version, it takes no account of where a package comes from.
In fact, if we preferred local files, we'd be _harming_ consistency, because you'd get something different installed depending on what was present locally.
I don't see anything actionable here. Pip's current behaviour is by design, if you want to propose a change, you'll need to provide details of what you propose, and you'll probably need more persuasive arguments than you've currently offered.
"Always prefer the latest available version" is the complete opposite of consistency. It means that any two successive installation will yield different results, even when performed on the exact same host.
For the sake of argument, lets define consistency.
My definition of something consistent : Something is consistent when it yields the same result when executed
1) at two different times
2) in two different places
Getting both 1) and 2) is very hard. It requires basically having the whole software stack/operating system managed by the same system. This is never going to be achieved by pip
alone, and is - as far as I know - only managed by NixOS (https://github.com/NixOS/nixpkgs)
Getting 2) cannot possibly be achieved without having 1) unless you are executing things at the exact same time or unless you pin down the version of every package you install.
What is left is 1). Current pip behaviour does not give 1) at all. If install packages today, I will get widely different versions than what I installed 6 months ago.
Item 1) Can however be achieved assuming there is a local set of packages that are fixed/supported. This is the case on our HPC clusters. We also achieve 2) as long as users remain on our infrastructures (multiple clusters).
However, both 1) and 2) are jeopardized by the current pip behaviour and the lack of ability to tell pip that packages available in our wheel house are preferred to those more recent version that can be downloaded.
Sigh. I guess we're simply going to have to disagree. Pip has mechanisms (version pinning, hash checking) to satisfy your requirement (1). Just because you choose not to use them, or because they don't work easily in your particular situation, doesn't mean pip can't do that. Nor does it mean that pip needs another method of doing the same thing.
I remain -1 on this whole proposal, and you've pretty much convinced me that accepting the option for pip download
will set a precedent that will make it impossible to resist demands that we add it to pip install
. So I'm no longer willing to make an exception for pip download
.
It's not that I choose not to use them, it's because nobody (i.e. package developers) ever does.
I guess that an option --try-no-index
which would try to install/download/update without considering indexes first and go to the index only if it did not work would get the same -1 from you too @pfmoore ?
i would like to note that at work we always use version-pinning and/or constraint files, its simply insane not to have that in place in production environments where consistency is a must
also i wonder if the "strategy" option for pip install -U
would make sense for download, downloading only as needed to fulfill the requirement set
Yeah, advanced users of HPC will use version-pinning. But this is not just any lambda user. When managing a HPC cluster, you are dealing with thousands of users that know very little about good practices. Any small step you can take to reduce the amount of rope with which they can hang themselves is tickets and problems that are avoided.
Absolutely not. See all of my previous comments about why this should not be added to
pip install
.However, I've just noticed that the PR adds the option to
pip install
. I'll register my objection to that on the PR, as well.
@pfmoore I must admit I did not understand you from the previous discussion to be so strongly against having this option for pip install
as well. I understood the earlier discussion to be related to changing the default behavior of pip, which I agree is a very bad idea.
Due to the additional interest given to this ticket, I wanted to put together a PR with an implementation prototype with the new option. Including the option for pip install
was not due to actively ignoring your comments, but simply because it can be useful for the install
command as well as commented by @mboisson.
OK, fair enough. I still don't see sufficient benefit in this change to justify the cost, though.
Just as a question, why don't you use something like a local devpi instance that serves your "local" files, but if there are no local files for a package falls back to PyPI? I'm pretty sure devpi can do things like this (and if it's not the default behaviour there is a plugin system that lets you customise the behaviour). Or just simply write a small webapp that serves an index that behaves as you want it to?
I was unaware of devpi, but running a web server is not an option. We can't run a web server on a HPC cluster, and compute nodes on which jobs run and pip may be called don't necessarily have access to the web. We want the packages we serve to be available without needing web access. It currently works nicely by just having a directory containing the wheels which is accessible on our filesystems, and configuring find-links
to point to that directory in the PIP_CONFIG_FILE
.
The only caveat, as is being discussed, is that it will not limit itself to whatever it found first in the directory pointed to by find-links
, even if it matches the requirements. We even had to globally tell pip that our system is not manylinux1
compatible because these were taking precedence over our locally compiled wheels even if we had the most up to date version (it considers manylinux1
to be "more recent" than linux
).
Wait... this _is_ for
pip install
, that's the whole point of it. Isn't it ?
The initial reason I wanted to change the behavior is to make pip download
faster by avoiding lookups to remote indexes when not needed.
There are two cases where _not needed_ can be applied.
1) When a local package satisfies the requirement, e.g. requests>=2.18.3
where version 2.18.3
exists locally but is not the latest
2) When a local package satisfies a pinned requirement, e.g. requests==2.18.3
where version 2.18.3
exists locally.
Point 1 conflicts with the current pip behavior, where --prefer-local-compatible
causes the package version found locally to be installed even when newer versions are available on the remote index.
Point 2 will _not_ differ from the current pip behavior, where the result, i.e. the installed packages, are the same.
Currently, even with pinned versions for all packages, where all packages exists locally, pip download
still retrieves the available package versions from the remote indexes.
With a requirements file with 90 packages, --prefer-local-compatible
reduces the pip download
time from ~26 to ~7 seconds.
OK, fair enough. I still don't see sufficient benefit in this change to justify the cost, though.
Just as a question, why don't you use something like a local devpi instance that serves your "local" files, but if there are no local files for a package falls back to PyPI? I'm pretty sure devpi can do things like this (and if it's not the default behaviour there is a plugin system that lets you customise the behaviour). Or just simply write a small webapp that serves an index that behaves as you want it to?
I'll try to explain our use case.
We have a multitude of projects that rely on different virtual environments for different tasks, e.g. system tests, unit tests, running various python scripts, etc.
We used to have multiple requirement files containing only the strictly necessary requirement specifications for each virtual environment. However, due to the issue mentioned above, we ended up generating one requirement file with pinned package versions for each virtual environment instead.
Whenever the requirement file changes, or the virtual environment is removed (make clean), the required package versions are first downloaded to a local cache directory with pip download
, and then the virtual environment is created from these packages.
pip download
is run quite frequently to ensure all the required packages are available, which currently takes more time than strictly necessary, when most or all of the packages are already available in the cache directory.
There is not a set of specific package versions we use for all the projects, but each project, and each virtual environment has a set of requirements with pinned versions. Therefore, running a devpi instance is not very convenient.
I guess that an option
--try-no-index
which would try to install/download/update without considering indexes first and go to the index only if it did not work would get the same -1 from you too @pfmoore ?
@mboisson
How does that differ from the proposed --prefer-local-compatible
option in 6023?
@bendikro, it does not need to define a new "local" concept and check for file:
, so it reuses more mechanisms that are already in place.
This thread is getting very confused. I suspect we're hitting a case of the XY Problem.
The original statement of the problem here was that "pip download does not prefer package found locally even if it satisfies the requirements when there is a newer available at the remote package index". That's not a problem, because that's not how pip download
is defined to work. So taking a very naive viewpoint, this issue can be closed as "not a problem - user had misunderstood how pip works". But that's not very helpful.
It's possible that in attempting to solve an issue in their local environments, @bendikro and/or @mboisson have identified that if pip preferred "local" files over "remote" ones, then they could use that to solve their problem. That's fine, but as noted, it's not how pip works.
Rather than proposing that pip gets changed to work the way you wish it would in order to implement the solution you'd thought of, can I suggest that we go back to the underlying problems? If you raise one or more new issues describing what your underlying problem is, maybe we can either find a solution using pip as it currently works, or we can identify a change to pip that doesn't have the difficulties that "prefer local files" does but still helps address the problem.
(Disclaimer: My personal feeling is that there's likely an acceptable solution using pip as it stands, maybe with some local environment config changes, or with a process change in how you're working. What I've understood of the underlying problems so far doesn't seem like it's something that needs a pip change. But I may be wrong.)
@pfmoore How would you solve, with current pip, that users need to install the latest version in a specific wheelhouse even if there's a more recent version on PyPI? Ideally, the user only need to pip install <name>
.
For example, in the local wheelhouse, there's matplotlib v3.0.1 but on PyPI there's v3.0.2. The v3.0.1 is the _preferred candidate_ to be installed. As mentionned by @mboisson, --find-links
and --no-index
are already known and used.
@ccoulombe Pin the version. "Ideally, the user only need to pip install name
" isn't a requirement, just a preference. Or if you don't want to specify the version, --no-index
.
@pfmoore ok, let me roll back to our problems.
Problem 1)
Problem 2)
pip install X
. They will not pin versions. Especially when they don't know that we have built specific versions for them. When they don't and it fails, they will contact our support and create an undue workload on our staff. Problem 3)
pip
will try to download a version from online unless we specify --no-index
as soon as a newer version is released, even if that newer version is not needed by the requirements. Problem 4)
--no-index
in the PIP_CONFIG_FILE
, pip
won't even attempt to download packages, even those that are pure python and which would work just fine. This means that we would have to host a complete repository of all possibly python packages and is just unwieldy. Please suggest a solution that solves all 4 problems that does not equate to "prefer locally built packages".
Can we somehow tell pip to not ever download binary (i.e. compiled) packages from online repositories ? Pure python packages are usually alright.
@mboisson Thanks for the clarification. It'll take me a while to digest that but I appreciate the explanation.
Can we somehow tell pip to not ever download binary (i.e. compiled) packages from online repositories
Yes - --no-binary :all:
.
@mboisson Thanks for the clarification. It'll take me a while to digest that but I appreciate the explanation.
Can we somehow tell pip to not ever download binary (i.e. compiled) packages from online repositories
Yes -
--no-binary :all:
.
--no-binary :all:
will also block binary packages that are hosted in our wheelhouse (i.e. found through --find-links
)
I also realize that my question was not precise enough.
Can we somehow tell pip not to ever download binary packages, nor their source equivalent (i.e. only ever download pure python packages) ?
Not downloading the binary version of numpy
for example is no better because it will download the source version and try and fail to compile it optimally.
@mboisson If I follow that set of requirements, the only control you have over what options pip sees when run by your users is the global configuration file?
Also, you stated earlier that running a local index wasn't an option, but I don't see anything in your problem statement that precludes it. And my immediate thought when seeing your requirements is that PyPI is not a good fit for your requirements, and running a local index (that passes through to PyPI when appropriate) is exactly the solution that other environments I've hear of with similar constraints tend to use...
Can we somehow tell pip not to ever download binary packages, nor their source equivalent (i.e. only ever download pure python packages) ?
I think we're confusing each other here. What do you mean by "download"? From PyPI? If that, then only by using --no-index
and hosting a local index (which is why I think that's the most appropriate solution for you).
@mboisson If I follow that set of requirements, the only control you have over what options pip sees when run by your users is the global configuration file?
Also, you stated earlier that running a local index wasn't an option, but I don't see anything in your problem statement that precludes it. And my immediate thought when seeing your requirements is that PyPI is not a good fit for your requirements, and running a local index (that passes through to PyPI when appropriate) is exactly the solution that other environments I've hear of with similar constraints tend to use...
Correct, the only control we have over what options pip sees is the global configuration file.
Running an index which requires running a server is not an option. Having some sort of script that pip
would query locally and not require a web server to figure out whether it's redirected to pipy or the local repository could work.
Can we somehow tell pip not to ever download binary packages, nor their source equivalent (i.e. only ever download pure python packages) ?
I think we're confusing each other here. What do you mean by "download"? From PyPI? If that, then only by using
--no-index
and hosting a local index (which is why I think that's the most appropriate solution for you).
I mean that unless it's a pure python package act as if you were using --no-index
(i.e. just look in our local repository).
Running an index which requires running a web server is not an option.
Hmm, I'd like to say "why not?" but I'll accept that as a fact for now. In which case, you could (note, this is untested!) write a script that grabs https://pypi.org/simple and modifies it so that links to packages you have locally point to your local copies, and put that somewhere you can reference via a file: URL (which you can then treat as a repository index via --index-url
- PEP 503 doesn't mandate that an index is available via HTTP).
You'll need to refresh your local index regularly, but that's a cost of not being able to support a web server (which could do the refresh on the fly).
In spite of agreeing to accept "not able to use a web server" as a constraint, I'd also like to point out that you could put an index on an external site like heroku - after all, your users can access the internet, so it's not like they couldn't access that as an index...
I mean that unless it's a pure python package act as if you were using --no-index (i.e. just look in our local repository).
How do you detect that it's pure Python? That's not possible without building the package (as some packages have optional C extensions).
Running an index which requires running a web server is not an option.
Hmm, I'd like to say "why not?" but I'll accept that as a fact for now. In which case, you could (note, this is untested!) write a script that grabs https://pypi.org/simple and modifies it so that links to packages you have locally point to your local copies, and put that somewhere you can reference via a file: URL (which you can then treat as a repository index via
--index-url
- PEP 503 doesn't mandate that an index is available via HTTP).You'll need to refresh your local index regularly, but that's a cost of not being able to support a web server (which could do the refresh on the fly).
That's an interesting idea. We'll think about it.
In spite of agreeing to accept "not able to use a web server" as a constraint, I'd also like to point out that you could put an index on an external site like heroku - after all, your users can access the internet, so it's not like they couldn't access that as an index...
In some cases, our users do have Internet access, but in others they don't (i.e. when they are running on compute nodes on the cluster). So running a web server would be an option for some cases, but it would require to keep two distinct solutions, with the risk that they would eventually diverge in the list of packages they provide.
I mean that unless it's a pure python package act as if you were using --no-index (i.e. just look in our local repository).
How do you detect that it's pure Python? That's not possible without building the package (as some packages have _optional_ C extensions).
I'ld say that anything for which the wheel is cp27-cp27mu-linux_x86_64
(or similar for other versions of python) is compiled ? Although I guess that's hard to test without trying to install it ?
In some cases, our users do have Internet access, but in others they don't
Well, if they don't, they can't access PyPI so problem solved :smile:
I'd say that anything for which the wheel is cp27-cp27mu-linux_x86_64 (or similar for other versions of python) is compiled ?
But you said you didn't want to try to compile sdists that aren't "pure Python" either? There's no way of telling whether a sdist is "pure Python".
In some cases, our users do have Internet access, but in others they don't
Well, if they don't, they can't access PyPI so problem solved 馃槃
Well, yes, if they can still access our local index (and don't need packages that aren't in there), which they can't if it's a web server.
I'd say that anything for which the wheel is cp27-cp27mu-linux_x86_64 (or similar for other versions of python) is compiled ?
But you said you didn't want to try to compile sdists that aren't "pure Python" either? There's no way of telling whether a sdist is "pure Python".
"pure python" packages will usually end-up as a {py2,py3,py2.py3}-none-any.whl
, no ?
Well, yes, if they can still access our local index (and don't need packages that aren't in there), which they can't if it's a web server.
... and again we hit something that I don't follow. You say they don't have "internet access". What precisely do you mean by that? No access to any sort of IP connection other than the local machine? Or no access outside of the local network? How do they currently access your local index? As a shared filesystem? There's no reason that wouldn't still be possible (all of my suggestions have only been about hosting an index on a web server - the actual distribution files themselves would remain on the local filesystem).
You could have the global config set up as
# For people with no web access at all
find-links = /the/local/shared/filesystem
# For people who can access PyPI (and hence the Internet)
# This service points to the local shared filesystem for projects you want to serve locally,
# and PyPI for all other projects. It can be a web service as shown here, or probably just
# a file: URL pointing to a PEP 503 format simple index, if you're willing to handle
# regularly refreshing the HTML pages.
index-url = https://our.local/index
That's as far as I can reasonably go designing this for you - you'll need to do some work yourself to fill in the blanks, but hopefully it's enough to give you the idea.
"pure python" packages will usually end-up as a
{py2,py3,py2.py3}-none-any.whl
, no?
End up as, yes. But at the point we're trying to make a decision, they are just .tar.gz
sdists.
Well, yes, if they can still access our local index (and don't need packages that aren't in there), which they can't if it's a web server.
... and again we hit something that I don't follow. You say they don't have "internet access". What precisely do you mean by that? No access to any sort of IP connection other than the local machine?
Yes
Or no access outside of the local network? How do they currently access your local index? As a shared filesystem?
Yes, shared filesystem.
There's no reason that wouldn't still be possible (all of my suggestions have only been about hosting an _index_ on a web server - the actual distribution files themselves would remain on the local filesystem).
Yes, a local index file sounds like it could work.
Mmm, I believe that our issue seems to have been fixed and merged (a.k.a. the --prefer-binary
option) from this issue https://github.com/pypa/pip/issues/3785
Also, a colleague mentioned the option to have a constraint file, which can be set globally in a PIP_CONFIG_FILE, and in which one could exclude any version of a package which is more recent than the locally available version.
Most helpful comment
OK, so what you're suggesting is an option to
pip download
that says "for each requirement, if it can already be satisfied from the destination directory, skip it, otherwise download the requirement as normal and store the downloaded file in the destination directory.I can see the logic in that. If you wanted to create a PR implementing it, I'm not going to object. I can't say that I find your justification for the behaviour compelling, but that's something that can be debated later, when there's a PR to review.