pip 🚀 - What does a URL constraint mean?

To me, this means whenever packaging is requested, use this link I provided. Further version specifiers should still apply (and fail resolution if they are not satisfied by the link target). The behaviour would be similar to

pip install <packages-i-want> "packaging @ git+https://github.com/pypa/[email protected]"

but without packaging being marked as user_requested (unless it is listed in <packages-i-want>, of course).

uranusjr on 17 May 2020

👍3

I think of it as constraining the location from which distributions can be downloaded, just as a version expression like >=20.1 would constrain the version(s) that could be downloaded.

dhellmann on 18 May 2020

OK. I have done a more detailed investigation into the way that we'd implement NAME @ URL style constraints. The idea in principle is straightforward, but it has a number of interactions with other features of pip that make the overall behaviour far more complex.

The current implementation will have behaviours for all of these points that arise "by accident" as a result of the implementation choice to treat constraints as "just requirements that we don't install". The new resolver doesn't have that option, because the underlying model is completely different.

It's also true that there are likely reasonable choices to be made as to "how should this interaction work". It's not necessarily hard to choose a "correct" behaviour for any of the following points. But the implementation is decidedly non-trivial in the new resolver's existing model, so the cost of these features is disproportionately high, and we don't really have any idea at the moment if there's any actual use cases for these sorts of interactions (that's actual use cases as opposed to "yes I can see how this might be useful" 🙂)

Multiple URLs - should they be allowed? What would they mean? After all we can only get a project from one place. But consider this in relation to the next points.
Hash checking. If someone were to have a constraint file that said foo @ https://some/url and a requirements file that said foo --hash=xxx, what should happen? Should we hash-check the URL? Is it of any practical use? Should we select one of multiple URL constraints based on the hash?
Wheel compatibility tags. Should URLs for constraints be required to match the target wheel compatibility rules? Should we allow multiple URLs and pick the one that matches the target system? Have we just implemented a package index via constraint URLs?
Should we reject the edge cases that we don't support, or are we OK with just leaving them as "undefined behaviour - do not use"? In particular, even detecting some of these cases is costly.

There are quite likely other questions. These are basically just some that came up in the process of trying to see how to implement this and saying "ouch, that would be a problem" a few times 🙂

As I said, it's almost certainly possible to come up with a well defined spec for this feature. And maybe that spec would even match the current behaviour. But we don't have that spec right now, and no-one has really done any of the work needed to come up with one.

So I propose that we consider URLs as constraints as "deprecated pending work being done to properly define the feature"¹, and assume that they won't be supported in at least the initial version of the new resolver. People currently using URLs in constraint files will need to either stick to the old resolver, or find a workaround.

The only actual user of URLs as constraints that I am aware of is @sbidoul, who explained his use here. Maybe there are (more or less clumsy) workarounds that could be used instead? @dhellmann seemed OK here with not including this in the new resolver, at least initially.

To reiterate, this is not a blanket refusal to accept this feature. It's simply noting that we need to clarify the design in order to rewrite it for the new resolver, and that we're not blocking the release of the new resolver on this feature.

¹ This has horrible echoes of the dependency links mess. I really want to not have that happen again, so I want to be very clear that the plan this time is to drop the feature without any replacement if necessary, and not to leave it hanging around forever "until we have the replacement defined".

pfmoore on 29 May 2020

👍1

@pfmoore don't the edge case you mention also arise when direct URLs are used as regular install-requires?

sbidoul on 30 May 2020

@sbidoul I have no idea to be honest. I've not spotted any test failures that do that, but if there are any, then yes, that's something we'd need to address (and having addressed it, we may well then have a better chance of implementing URLs as constraints).

pfmoore on 30 May 2020

OTOH I very much doubt there are robust tests or user experience on this. The legacy resolver handles non-user-specified direct URLs very poorly, so there are likely many edge cases currently precluded by other restrictions. I’m +1 on delaying thoughts on this until we are able to make the new resolver widely available for a while, to let the users come up with ~~crazy~~ cool use cases that expose considerations we missed.

uranusjr on 31 May 2020

Agreed, this whole area is really emergent behaviour from a set of features that were implemented independently and never really considered in combination.

This is probably something that needs to be covered under #6536 (new resolver rollout) which is something that needs to be prioritised at some point. There will be areas where the new resolver works differently, that either aren't covered by tests, or where we've made a decision to keep a difference. and we need to look at how we get feedback on that.

pfmoore on 31 May 2020

I had opened an issue here https://github.com/pypa/pip/issues/8757

Just wanted to add a mention here that it is very helpful to be able to provide the url for installation a package from as a constraint when waiting for a contributed patch to be applied by upstream and released.

I.e. something like https://github.com/some/repo/archive/bugfix-issue-foo.tar.gz#egg=somepackage to allow installing the bugfix for package "somepackage" until upstream can cut a new release. Otherwise all packages have to be vendored and renamed and it is very complicated and duplicates a lot of work.

thenewguy on 14 Aug 2020

Right now if you drop https://github.com/some/repo/archive/bugfix-issue-foo.tar.gz#egg=somepackage in a constraint file and then install a requirements.txt that lists somepackage, you get https://github.com/some/repo/archive/bugfix-issue-foo.tar.gz. Please do not remove that workflow if you are able to keep it in some form

thenewguy on 14 Aug 2020

@thenewguy Thanks for your input. I've seen a few people describe use cases for URLs as constraints that all seem to have the same underlying motivation, and I'm fairly clear that we should have this feature in some form.

However, the key here is defining the details. I posted above a list of questions about how URLs as constraints should interact with other features of pip, and so far, no-one has come up with any answers. I don't personally have any intuition on what the answers should be, so it's essentially impossible to implement the feature for the new resolver. To be clear, the existing implementation is tightly linked to the old resolver code - it's not a case of "keeping what's there", we absolutely have to do a complete re-implementation, and as such, we need to know what to implement.

Please do not remove that workflow if you are able to keep it in some form

As I say, we aren't able to keep it. What we can do is to re-implement it. But to do that we need to understand what to implement - and "what the old implementation does" isn't even meaningful in terms of the new resolver...

pfmoore on 14 Aug 2020

👍1

As I say, we aren't able to keep it. What we can do is to _re-implement_ it. But to do that we need to understand what to implement - and "what the old implementation does" isn't even meaningful in terms of the new resolver...

Gotcha - well in my experience - this is typically used for temporary bugfix releases. It sounds like there is a lot more to consider than my knowledge of pip's internals allow, but imo, if a named package url is provided, it should be forced and installed regardless of any version specifier. If that is too much, perhaps failing if the version doesn't match. I've never used it in combination with strict versioning (because as a bugfix there really isn't a version to it as I am not in control of the release process for upstream). So perhaps, conceptually, it is an override rather than a constraint. I've not come across another way to do this.

However, it is also handy when working with a pre-release package that isn't published somewhere like pypi yet. Although this case is easier to work around than the previous use.

thenewguy on 14 Aug 2020

I don't use the feature, but it sounds like maybe one interpretation of how others use it is to (partially?) disable the resolver for a specific dependency and use the given distribution as the only available location and version. Maybe that implies some other behaviors about how that affects the other dependencies, I'm not sure. For example, should the resolver assume that the distribution at the given URL meets the version requirements of anything that depends on it, even if perhaps the version number in the distribution metadata says otherwise?

That would be kind of a "just do what I tell you" mode, which given the conversation elsewhere about how to reduce the scope of pip itself and encourage an ecosystem of surrounding tools, might make a lot of these sorts of features buildable outside of pip. For example, if pip could report the results of resolving a requirement set by producing a list of URLs without installing them (maybe it can do this already, or it means building another tool on top of the resolver library?), then someone could build that full list from a partial requirements.txt, edit the results to replace the URL for the dependency they are patching, and then install from that list of URLs without resolving dependencies at all (with pip, or something else) to get exactly what they want. That would give the user the convenience of pip helping to build the full dependency list, and mean that PyPA developers wouldn't have to entertain every variation of dependency management that anyone can come up with.

dhellmann on 14 Aug 2020

The “do as I tell you” usage is more difficult to implement, and should be discussed separately IMO (#8076 covers it). Constraints are currently implemented as adding to existing requirements. This makes them fit nicer with the rest of dependency resolution logic, and relatively easier to make sense. This is enough for most use cases in practice as well, since existing dependencies are correct most of the time, and a URL constraint just needs to add additional information to tell the resolver to specifically use that one particular artifact, much alike how version constraints do.

uranusjr on 15 Aug 2020

👍3

if a named package url is provided, it should be forced and installed regardless of any version specifier

Conceptually, that behaviour doesn't feel like a "constraint" to me, but more like an "override". I think it's important that constraints do what the name suggests, for both discoverability and understandability reasons. Overrides are being discussed separately in #8076 as @uranusjr says.

For me, conceptually, a URL constraint says "you can only get this project from this specific URL. If what's there doesn't satisfy the dependencies pip has calculated, you're out of luck".

To be fair, that logic mostly answers my above questions:

Multiple URLs - not allowed, reject as an error.
Hash checking - we should check the hash and fail if it doesn't match.
Wheel compatibility tags - fail with an error if the URL points to an incompatible wheel.

For the last question, "should we reject the edge cases that we don't support" I would be pragmatic and say that if we end up in a code branch that the above logic doesn't give us an answer for, we fail with an error saying it's unsupported. On the other hand, we don't go out of our way to check for special cases, and we document the intention and explicitly state in the documentation that any other usage is not supported, and behaviour is undefined and may change or be removed without notice.

If that's an acceptable approach, we can go ahead on that basis. (However, note that the period of paid work on developing the new resolver has completed now, so implementing this will be done on volunteer time - personally, I'd like to look at it but I don't know when I'll next have time to do so).

pfmoore on 15 Aug 2020

👍1

The “do as I tell you” usage is more difficult to implement, and should be discussed separately IMO (#8076 covers it).

Sure, I'll read that, too.

Constraints are currently implemented as adding to existing requirements. This makes them fit nicer with the rest of dependency resolution logic, and relatively easier to make sense. This is enough for most use cases in practice as well, since existing dependencies are correct most of the time, and a URL constraint just needs to add additional information to tell the resolver to specifically use that one particular artifact, much alike how version constraints do.

Sure, that's another way to look at it. I was trying to point out that if the phases of resolving the dependencies and installing the packages were exposed explicitly, then many (most?) other use cases, such as this one, could be addressed by modifying the output of the resolver before passing the list to the installer. Those modifications could be left up to individual users or authors of other tools that integrate with pip, but pip's resolver wouldn't have to contain the additional complexity.

dhellmann on 15 Aug 2020

I was trying to point out that if the phases of resolving the dependencies and installing the packages were exposed explicitly, then many (most?) other use cases, such as this one, could be addressed by modifying the output of the resolver before passing the list to the installer.

That's an interesting idea. We could have something like pip install --resolve-only --out=some_file.txt and pip install --from-resolve-data=some_file.txt. I could see some pretty significant issues to consider here (what if the pip options used in the two phases were different, for a start?) and I can't imagine that we'd ever support editing that intermediate file (it's expected that people do this, but it would have to be on a "we won't help if you break stuff" basis, I'd have thought), but I can see it would be useful.

(Of course, taking that idea to its logical conclusion, we'd break pip up into a suite of tools doing the various "bits" of the process, much like the Unix idea of combining many small tools - and I doubt we're actually likely to go down that route in reality).

pfmoore on 15 Aug 2020

I was trying to point out that if the phases of resolving the dependencies and installing the packages were exposed explicitly, then many (most?) other use cases, such as this one, could be addressed by modifying the output of the resolver before passing the list to the installer.

That's an interesting idea. We could have something like pip install --resolve-only --out=some_file.txt and pip install --from-resolve-data=some_file.txt. I could see some pretty significant issues to consider here (what if the pip options used in the two phases were different, for a start?)

Yes, that's close to what I was thinking. It might be easier to break down which options apply to each phase by thinking about new sub-commands with names like pip resolve and pip deploy, with pip install encompassing both phases transparently. Most of the options to the existing install command would apply to the resolve command, but not to deploy. Separate sub-commands also has the benefit of separate argument parsers, and if the options aren't available for pip deploy, they can't be different than the values given to pip resolve. :-)

and I can't imagine that we'd ever support editing that intermediate file (it's expected that people do this, but it would have to be on a "we won't help if you break stuff" basis, I'd have thought), but I can see it would be useful.

Yes, exactly. The deploy step would just take the data as input and do what it needs to do so the listed packages were on the import path (downloading things, turning them into wheels, writing files to the filesystem, etc.). It would only work from the list, though, without applying any additional rules or processing. It has to assume that the list is "correct". It would be up to the user to make the list contain what they want, and if what they want is broken somehow then that's not pip's problem.

I originally said a list of URLs, but the output of the resolve phase might be easier to consume if it includes more of the data that the resolver has. For example, the original requirement, the reason for a dependency being added if it wasn't in the original requirement list, the URL to the package, an optional second URL for a local cached copy, etc. A lot of that data could be optional because deploy wouldn't need it, but a program that wanted to modify the data could use it to make decisions about the modifications.

(Of course, taking that idea to its logical conclusion, we'd break pip up into a suite of tools doing the various "bits" of the process, much like the Unix idea of combining many small tools - and I doubt we're actually likely to go down that route in reality).

Separate sub-commands may make it easier for users to understand conceptually, but I wouldn't go so far as to create separate executables or main programs. And I wouldn't necessarily say that pip install should write the intermediate data to a file before deploying, but internally it should build the same data structure with the resolver and pass it to the deployment code.

dhellmann on 15 Aug 2020

IIUC, we're going in the direction of #53, lockfiles, #7819 here.

pradyunsg on 15 Aug 2020

Looking at the implementation of the install command, I see there are probably some options that would be needed by both phases. Whether to install to the user dir, the target directory, etc. would affect what is seen as already present by the resolver and then would also be needed by the deployer code to know where to put things. So, maybe some of those sorts of values need to be part of the data file produced by the resolve phase.

dhellmann on 15 Aug 2020

For what it's worth, "constraints file containing lines of the form package @ file:///local/path" was my interpretation of the following lines from the constraints file documentation:

Constraints files are used for exactly the same reason as requirements files when you don’t know exactly what things you want to install. For instance, say that the “helloworld” package doesn’t work in your environment, so you have a local patched version. Some things you install depend on “helloworld”, and some don’t.

One way to ensure that the patched version is used consistently is to manually audit the dependencies of everything you install, and if “helloworld” is present, write a requirements file to use when installing that thing.

Constraints files offer a better way: write a single constraints file for your organisation and use that everywhere. If the thing being installed requires “helloworld” to be installed, your fixed version specified in your constraints file will be used.

If that's not the intention, should there be a documentation bug or enhancement?

mwchase on 18 Nov 2020

If that's not the intention, should there be a documentation bug or enhancement?

Yes. The existing documentation is inaccurate for the new resolver, in that it doesn't explain exactly what is valid in a constraints file (and the new resolver has changed the specific details there). If anyone wants to raise a documentation PR for this, that would be helpful.

pfmoore on 18 Nov 2020

Or a PR to implement URL constraints along the path sketched in https://github.com/pypa/pip/issues/8253#issuecomment-674372005 ;)

sbidoul on 18 Nov 2020

👍1

Or a PR to implement URL constraints along the path sketched in #8253 (comment) ;)

Indeed, or that 🙂 I'm trying to do too many things at once at the moment, and not doing any of them well...

pfmoore on 18 Nov 2020

I'd definitely like to see URL constraints along those lines. So, I'm right now working my way through the developer documentation, in hopes of eventually understanding things well enough to write a PR.

I'm currently minimally familiar with pip's codebase, so I probably won't make good time on my own. I'll keep on with this on my own if need be, but if anyone with more experience decides to work on this, I'd probably be more effective if I were helping with their efforts in some way.

mwchase on 2 Dec 2020

Looking over the code, here's what I think needs to happen:

There must be some way of representing a URL constraint as a Constraint object. I'm not yet sure whether this should be an Optional field, factoring logic into subclasses, or what. I'm not sure whether it makes more sense to match Links, Candidates, or something else.
The URL Constraint objects must be constructed by the code paths that currently point the user to #8210.
URL constraints feed into the list of Links to use. I think this might just work, given the other changes, but I am not sure.

I'm not sure if I accidentally read code that's exclusive to the legacy resolver at some point, which would confuse things. In any case, it looks to me like most of the changes required follow naturally from the representation of URL constraints as Constraint objects, of which all I know is, there should be hashes, and some way of representing the URL.

Are there any big stumbling blocks I'm missing out on?

mwchase on 4 Dec 2020

Pip: What does a URL constraint mean?

Most helpful comment

All 25 comments

Related issues