pip 🚀 - Select file to download based on existing hashes when --require-hashes

FWIW, rgd this specific incidence, I don't care anymore. We'll replace msgpack-python with u-msgpack-python: https://github.com/crossbario/autobahn-python/issues/656

Note, dstufft explained some things on IRC:

<oberstet_> because this is what I think just happened with one of our dependencies, and that broke our (version and hash pinned) builds https://github.com/msgpack/msgpack-python/issues/183
* anthony25_ is now known as anthony25
<dstufft> oberstet_: correct
<oberstet_> dstufft: thanks. too bad. we're gonna switch to u-msgpack, as upstream doesn't want to change what he is doing =(
<dstufft> oberstet_: Ideally pip would be smarter here, and if there are 5 files available and one of them doesn't match a hash, it'll just ignore that file
<oberstet_> right. or seal the whole set of artifacts for a package version. or, what does ToR (was it?) have for "doing watertight releases" again?
<oberstet_> but thanks for comfirming!
<oberstet_> dstufft: meejah pointed me to something state of the art rgd all that (how to move bits to users tamperfree etc), but I forgot

oberstet on 29 Apr 2016

dstufft: thanks. too bad. we're gonna switch to u-msgpack, as upstream doesn't want to change what he is doing =(

It's OK to switch u-msgpack.
But I can't stand about you treated me as I ill-mannered person.
Releasing wheels after sdist is so bad thing?

About this specific version, I just tried manylinux1 wheel. (pip install msgpack-python is VERY FAST, yay!).
But I build wheels in other machines I usually use, especially windows. So I sometimes release wheels few day after sdist.

Sometimes, wheels is released much later because dependency problem.
For example, I'm waiting MySQL Connector/C supporting VC14. After it released, I'll add cp35-win wheels to PyPI.

methane on 30 Apr 2016

For example other than my project, Cython added manylinux1 wheels.
https://pypi.python.org/pypi/Cython/0.22

Cython 0.22 is released 2015-02-12. And manylinux1 wheels are added at 2016-04-22.
It makes me happy since it makes builds on Travis much faster.

Thanks to manylinux1, many projects are adding binary wheel for linux and CI on linux gets faster and faster.

methane on 30 Apr 2016

To stop creating such a dissension, I think:

pip uses sdist when hash mismatch happens
Officially recommend --no-binary option when using --require-hashes.
Officially recommend "release wheels at same time to sdist, increase version number only for adding wheels after".

methane on 30 Apr 2016

Releasing wheels after sdist is so bad thing?

In my opinion, changing the contents of a released package after the fact (4 months in this case) is bad practice. It is similar to moving a published Git release tag, which is even worse, granted. Technically, you can move tags around on a public repo, but you really shouldn't.

A released package on PyPI is identified by its version. Adding wheels to a release changes the set of artifacts inside this package. And that effectively changes the meaning of: msgpack==0.4.7 meant something different in 2016/01 (when it was released) as it means as of today.

And the problems with such practice are many fold. For example, would we (Crossbar.io) already use hashed/pinned requirements instead of only open ended requirements for the PyPI release (which luckily, we don't yet do - but we totally want to), then above would have broken not only our CI, but also our released package! And would have forced us to push a new release - only because upstream added a new wheel to an older release.

Looking at https://pypi.python.org/pypi/Cython/0.23. They do the same. The release artifacts are dated from: 2015-08-08, 2015-08-09 and 2016-04-22!

I uphold my view that this is really bad practice. @methane you are right then, it is probably even "common practice" - but it is bad common practice IMO.

Changing the set of release artifacts does change the meaning of a release (as in it is identified by the version only), and that should only be done in a new release, on a new release version of the package.

So instead of adding new wheels to an old release, you could have pushed a new release (0.4.8) with an extended set of artifacts.

Note that analyzing this took me like an hour alone, as the new artifacts only broke some of our CI platforms. Eg. CPython 3.3 continued to work - because wheels only were published for other versions (3.4./3.5).

oberstet on 30 Apr 2016

Now, that being said, here is my take on how pip could improve things:

pip install --require-hashes -r requirements.txt

should only ever consider the artifacts for which there are hashes in requirements.txt. It should ignore artifacts that it discovers as eligible when looking on PyPI, but for which it doesn't have hashes.

Currently, it will bail out. By changing that behavior, a package releaser can effectively seal the set of permissible dependents.

So, for example, in our case, we had a hash for the sdist of msgpack-python-0.4.7 inside our requirements.txt - but only that single hash (because the sdist was the only thing available at the time I did pip freeze our package).

Now, when a new artifact suddenly appears for 0.4.7 (like msgpack_python-0.4.7-cp35-cp35m-manylinux1_x86_64.whl), there is no hash in requirements.txt, and pip would simply have ignored that, even if the wheel would in principle be the preferred artifact (preferred over the sdist).

oberstet on 30 Apr 2016

As to my view why not sealing a released package is bad: I am concerned about security. Eg by allowing "open ended" dependencies, this effectively creates an uncontrolled vector of attack. Its just a matter of pushing a new artifact to an already released package (extending the set) which is preferred over other artifacts - this allows to inject anything of course.

I do understand that the PyPI/pip packaging infrastructure isn't (yet?) designed for such water tight releases. As a publisher of a security sensitive networking software, I want to protect my users. I want to assure them that they exactly run the bits that we have tested.

We do release binary OS packages for that matter (deb/rpm/pkg) which provide more of that confidence. These packages, for the same reasons, contain not only Crossbar.io, but everything but the libc! They bundle PyPy. They bundle OpenSSL. And hence, they are nearly self-contained (other than the C run-time library, which simply isn't possible to bundle).

FWIW, in my impression (thanks to @meejah for pointing me this!), this is going in the right direction: https://theupdateframework.github.io/

oberstet on 30 Apr 2016

In my opinion, changing the contents of a released package after the fact (4 months in this case) is bad practice.

No, I didn't change contents of package.
I added another format package of exactly same version.
Hash mismatch is happen because pip compares tar.gz hash with .whl.

methane on 30 Apr 2016

You did change the contents of the package (which is the set of all artifacts as far as PyPI is concerned) by adding new artifacts (the wheels) that weren't there before.

And yes, I do understand how hashing works.

oberstet on 30 Apr 2016

Hash checking is very new feature of pip. I think this issue is problem of current pip behavior.
When pinning package hash, it should pin package format too.

methane on 30 Apr 2016

See, we simply disagree on this. And that is just fine! I do not want to insult you. I just disagree. And I care enough about this stuff that I am willing to bite the bullet and move to a different dependency.

oberstet on 30 Apr 2016

I don't care about moving umsgpack. It's good pure Python implementation. Bye-bye.

I'm talking about how hash pinning feature should work.

methane on 30 Apr 2016

You did change the contents of the package (which is the set of all artifacts as far as PyPI is concerned) by adding new artifacts (the wheels) that weren't there before.

pip installs _one artifact_, not _set of artifacts_. Do you mean "pip installs not package, but one part of package?". I don't want to do such a word game.

If uploading one artifact first and another artifact after is bad practice, it should be officially deprecated on packaging.python.org and PyPI should provide some way to publish new version atomically.

If it's not a bad practice, pip should behave more friendly.

methane on 30 Apr 2016

PyPI should provide some way to publish new version atomically

I agree with that.

Either that, or being able to "seal" a release.

Like as in: have a hash of hashes of all artifacts of a release associated with the release version. And then I could get away with sticking that single hash into my requirement.txt (instead of having to stick many hashes in there).

oberstet on 30 Apr 2016

I added another format package of exactly same version.

In my view, this is a perfectly legitimate thing to do. The last thing we want is to _discourage_ people from publishing wheels, simply because they have to publish them with the initial release, or inflict a version bump containing no changes on users.

Hash mismatch is happen because pip compares tar.gz hash with .whl

This sounds to me like it's simply a bug in the hash pinning code. No big deal, just something that needs fixing, I'd say the hash checking code needs to be more clever about looking for "the same thing it downloaded originally" (whether that's a sdist, or a wheel that has since been superseded by a more specific wheel). But I don't use the hash pinning feature, so I'll leave the design to those that do - just don't impose new rules like "you can't add wheels to an already released version" that affect people who don't use hash pinning in the process of fixing the issue.

pfmoore on 30 Apr 2016

Hash mismatch is happen because pip compares tar.gz hash with .whl

It happens because pip decides the wheel is superior to source dist, and if the hash for the wheel isn't in the requirements.txt, it will bail out.

Put differently, --require-hashes requires all hashes for all build artifacts to be in the requirements.txt. I can do that - but if a dependency chooses to upload more artifacts, I have to update the requirements.txt and do another release myself. So essentially, this just move the problem further down the line ..

If pip --require-hashes would instead only consider artifacts eligible for which there is a hash in the requirments.txt, all would be fine. But that is not how it works today:

Hash verification is an all-or-nothing proposition.. which imposes several other security restrictions: hashes are required for all requirements. Hashes are required for all dependencies.

Source: https://pip.pypa.io/en/stable/reference/pip_install/#hash-checking-mode

I can't find were it says: "Hashes are required for all release artifacts" though ..

oberstet on 30 Apr 2016

This is harder to do than it appears on the surface because the hash checking mode is based off of the hash of the file we're downloading, but we don't know that hash until we've downloaded it so there is a Catch-22. We could possibly let PyPI _optionally_ tell us the hashes of the files and then if PyPI provides them then we can use that to make a better guess about what files are acceptable (while obviously still actually verifying what file we finally download).

dstufft on 7 Sep 2016

I just got bit by this. The maintainers of protobuf at 3.5.0.post1 added some wheels 5 days after the version was originally uploaded: https://pypi.python.org/pypi/protobuf/3.5.0.post1.

As I'd generated my Pipfile.lock between the initial publish and the wheel upload, my Pipfile.lock did not include the hashes for the new wheels. As such, my builds started failing once the new wheels were available, and Pip ended up downloading those instead of the earlier source distribution.

taion on 27 Nov 2017

For the "All packages for a release should be released as a single unit" option, see https://github.com/pypa/warehouse/issues/720 The general concept there is to start releases in some form of staging mode that allows artifacts to be freely added and removed (but doesn't make them available through the regular download APIs), and then locks them against further changes (including addition of new artifacts) once they're promoted to fully released.

I don't think we want to prohibit backfilling wheel archives for earlier releases though, so I like the idea of falling back to checking the sdist hash if the wheel hash doesn't match. (Adding a query API for artifact hashes wouldn't be as helpful, since it would take a while for caching proxies to catch up, whereas "try the sdist if the wheel hash fails to validate" is a client-side only change)

ncoghlan on 29 Nov 2017

That sounds reasonable.

Is there some sort of hole there where it would make builds less reproducible, though? Say initially we resolve to hash A for some wheel, then later another wheel makes us resolve to hash B, in which case the fallback logic might point us to the sdist with hash C?

For the build to be fully reproducible, I think there ought to be the option to get back exactly the same wheel.

taion on 29 Nov 2017

Thinking about this a bit more, it's possible that the implementations for getting hashes in pip-tools and Pipenv are just not quite right. Currently, both tools lock down all hashes available for a package version on PyPI.

For packages with an sdist or a universal wheel, I probably _do_ want to lock down the hash for that specific artifact. If my dev environment (or original build environment) is using the universal wheel, I want to keep using that universal wheel – I don't want to fall back to the sdist without action on my part, because the point of using hashes is to not do so.

It's tricky in a pragmatic sense for packages that aren't pure Python, though. For a typical dev setup where dev boxes use OS X, but deployment boxes use Linux, there isn't really a good answer, I think, unless I want to always use the sdist, but not all packages even have that (e.g. tensorflow-gpu, but also building NumPy from source is just not a great idea).

I don't really have an answer here in general, but I do think that if I'm currently using a universal wheel, I should be able to continue using that universal wheel, even if a more specific wheel becomes available.

taion on 29 Nov 2017

@taion

I don't want to fall back to the sdist without action on my part, because the point of using hashes is to not do so.

You should be able to control this with the --only-binary option.

Kentzo on 22 Dec 2017

@Kentzo That wouldn't solve the problem where a new wheel could cause a build failure when using --require-hashes, right? Because I'd get the new wheel, and I wouldn't have the hash for that wheel.

taion on 22 Dec 2017

@taion I'd expect pip to always bring a distribution with a matching hash. That was the topic of my original issue.

The --only-binary / --no-binary are good enough to pick between wheels and sdists.
Picking among wheels is more complicated. I believe pip should prefer the most specific version by default, but with user specifying a hash it makes sense to respect this choice. Only a warning is acceptable.

Kentzo on 22 Dec 2017

I am a +1 to this. This is basically a bug in pip's current version pinning logic.

I'm willing to fix this as a part of bringing in a proper resolver, so I'll assign this to myself. If someone else wants to fix this before that, I'm happy to help with that too. :)

pradyunsg on 5 Mar 2018

I took a look at this. The patch required (if my assessment is correct) would benefit from #5971, so I’ll probably wait to see whether that works out.

A question to people interested in this: What would be the expected error if hashes are provided, but no files match any of them? Hash mismatch (same as the one raised currently), or No matching distribution found (as the one if there’s no applicable files)?

uranusjr on 30 Mar 2019

👍1

Thanks for reopening and working on this! Cool.

So my main goal is to get to 100% reproducible builds (to the single bit .. I am aware this probably has more to it than only this issue).

pip install --require-hashes -r requirements.txt

should only ever consider the artifacts for which there are hashes in requirements.txt. It should ignore artifacts that it discovers as _technically_ eligible when looking on PyPI, but for which it doesn't have hashes in requirements.txt.

Essentially, my argument is: when there is no hash, it shouldn't be considered eligible no matter what.

Now, rgd your Q: I don't care much about the specific error raised: reuse existing (a) "hash mismatch" vs (b) "no matching distribution" are both fine. As long as the pip install comes back with a non-zero return code breaking the build I am cool. A build broken because of this should require user/developer investigation ..

That being said: (b) seems more explicit, as given the fixed and sealed set of hashes in requirements.txt does not contain a hash for the supposedly new wheel added after the fact (for which there is no hash in requirements.txt - the premise / situation this issue is about). Hence, it is not exactly a "mismatch", but a "no match" ..

oberstet on 30 Mar 2019

FYI, this has been addressed by PR #6699 (also see issue #5874 for more recent discussion).

cjerdonek on 15 Jul 2019

🎉1

Pip: Select file to download based on existing hashes when --require-hashes

Description:

All 28 comments

Related issues