Warehouse: What metadata/installability checks should Warehouse check uploads for?

Created on 6 Feb 2014  路  30Comments  路  Source: pypa/warehouse

Recently, a project that I worked on pushed a new version to PyPI and it turned out to be completely broken, because the setup.py was referencing a README.rst file that was not present in the sdist.

It would be awesome if PyPI could so checking of packages that are uploaded.

To start with, it could create a virtualenv and try pip installing the package and make sure that exits with status 0. Perhaps it could also try easy_install to make sure that works too.

There are more elaborate things that could be done like run tests if they are included (many packages don't bundle their tests though) or try to validate the RST, but I think just pip installing is a very good first start, as it detects packages that are completely broken.

feature request needs discussion testing

Most helpful comment

All 30 comments

What I'd love is to see the packaging/build tooling help prevent this before it ever gets up to pypi/warehouse, although it might be nice to do some sanity checking of its own as well.

Its a weird issue since the way this is done is people reading files in setup.py, but would be nice if this was somehow detected and alert you that you are referencing things that wont end up in the dist.

I don't think executing untrusted code on the PyPI server is an option. Can this be done without?

This is partly what the cheesecake project attempted to do; I like the idea in theory but it requires a bunch of work and support to make happen in practise.

I guess tox would've caught this. Not everyone uses tox though and even people who do sometimes forget. Also tox could probably pass inadvertently because of a file in the dev's workspace that's not pushed up to the canonical git repo when using setuptools_git

I note that CPAN does automated testing of packages - I have no idea how they sandbox. VMs would be my guess. Obviously Travis CI and drone.io sandbox code to run tests. A system with VMs or Docker containers could help.

That said I recognize that this requires non-trivial resources so it's not easy.

How awesome would it be to search PyPI and be able to sort by test coverage?!

What if setup.py had a field for ci_server and you could plug in Travis, Circle, Drone.io, etc. and they could run on your release tarball rather than your VCS repo?

What would be even more awesome is if we ran pyroma against uploads and users could filter out projects with low scores? I would personally find that VERY helpful. I would even vote for rejecting uploads that get lower than a minimum score. IIRC pyroma doesn't even require sandboxing.

Example of how CPAN runs tests on packages:

http://ppm4.activestate.com/i686-linux/5.16/1600/M/MS/MSABRAMO/App-TarColor-0.011.d/log-20120801T133902.txt

We can't let Perl show us up! :smile:

ActiveState is active in Python as well as Perl -- maybe they are interesting in hosting a service like this like they do for the Perl community?

@agronholm Ok so Looking at pyroma it has the following checks

  1. The package should have a name, a version and a Description. If it does not, it will recieve a rating of 0.
  2. The version number should be a string. A floating point number will work with distutils, but most other tools will fail.
  3. The version number should comply to PEP386.
  4. The long_description should be over a 100 characters.
  5. Pyroma will convert your long_description to HTML using Docutils, to verify that it is possible. This guarantees pretty formatting of your description on PyPI. As long as Docutils can convert it, this passes, even if there are warnings or error in the conversion. These warnings and errors are printed to stdout so you will see them.
  6. You should have a the following meta data fields filled in: classifiers, keywords, author, author_email, url and license.
  7. You should have classifiers specifying the sypported Python versions.
  8. If you are using setuptools or distribute you should specify zip_safe, as it defaults to "true" and that's probably not what you want.
  9. If you are using setuptools or distribute you can specify a test_suite to run tests with 'setup.py test'. This makes it easy to run tests for both humans and automated tools.
  10. If you are checking on a PyPI package, and not a local directory or local package, pyroma will check the number of owners the package has on PyPI. It should be three or more, to minimize the "Bus factor", the risk of the index owners suddenly going off-line for whatever reason.
  11. If you are checking on a PyPI package, and not a local directory or local package, Pyroma will look for documentation for your package at pythonhosted.org and readthedocs.org. If it can't find it, it prints out a message to that effect. However, since you may have documentation elsewhere, this does not affect your rating.

So going through these checks!

  1. Yes, Yes, Description or Summary? Either way, Maybe.
  2. Yes, but I don't think PyPI can actually look at this without executing the setup.py
  3. Yes, once PEP440 is finalized I want to enforce this.
  4. Maybe
  5. Pyroma probably does it wrong, because just because it works in docutils doesn't mean PyPI will render it. But I agree with the check in spirit.
  6. * Classifiers -> Yes, but specific ones

    • Keywords -> No, they are kinda dumb

    • Author -> Sure

    • Author Email -> No

    • URL -> No, sometimes the PyPI Page _is_ the URL

    • License -> I think this is better handled by classifiers

  7. Yes-ish, In reality a dedicated metadata would be better for this but this is what we have ATM
  8. We can't determine this from PyPI w/o executing
  9. Useful, but I'm not sure we want to be hardcoding in setup.py stuff like this right now
  10. Ehh, I'm not real excited about this one. 3 is an arbitrary number and I don't think giving people a negative mark because they didn't convince two other people to maintain their library isn't very nice.
  11. Documentation is a big one i'd like to do as well, Docs on pythonhosted.org we can detect super easily, however RTD is harder. Crate did this and had to maintain a table of mappings because it had to guess at what the RTD link would be. This will be easier if we key it off a specially named project-url, but setuptools doesn't submit project urls ATM so it's hard to give people a negative mark for something that they can't actually give us yet if they aren't using pythonhosted.org

What I propose is:

  1. Enforcing that all the minimum files for installation (setup.py and egg-info/dist-info/whatever) is in place
  2. Enforcing that name, description, long_description and license are filled out
  3. Enforcing that the version conforms to the relevant PEPs (for new projects at least)
  4. Enforcing that the supported range of Python versions is somehow indicated in the metadata
  5. Filtering out projects in search that don't have downloadable distributions available
  1. Yes!, PyPI Legacy "tries" to do this now, but I don't think it does a very good job at it.
  2. Techincally License field isn't supposed to be filled out if a license classifier fits, so as long as we have that logic in place, then OK by me.
  3. Need to figure out what sort of coverage the related PEPs have over the current versions on Python, but I agree.
  4. Probably OK
  5. Yes!

This would be a fantastic feature. I'm only moving this to a future milestone because it is not on the critical path for the immediate goal of switching from the old PyPI to Warehouse.

Is this something people are still interested in?

If we had a system in place that allows to fully install a package there鈥檚 lots of other useful information we can extract from it. Transitive dependencies, metadata, even tests.

Yes, people are still interested in this but it's been deferred until the immediate goals have been met and Warehouse has replaced Cheeseshop as the official PyPI.

Best thing is probably to make this a sandboxed standalone project with some kind of API that warehouse can call.

I鈥檓 currently working on a POC for this. If anyone is interested, please let me know.

@jayfk I would love for you to post to pypa-dev to give a heads-up that you're doing this, so people can speak up there if they're interested in helping.

@jayfk could you post to pypa-dev or the distutils-sig list about your proof of concept? Thanks!

Jannis posted to distutils-sig -- thanks @jayfk!

Yep! Thanks for pointing that out @brainwane!

Actually, Perl has CPANTesters, which lets the community help to do the testing over a lot of different platforms and version combinations.
And they have had it for more than a decade.

This is definitely something where Python can learn from Perl.

This is an issue that came up in a discussion of improving pip's automated tests last week; pip would find this feature helpful.

Reminder to folks following this: as @ewdurbin and @pradyunsg and I talked about in a meeting about the pip resolver work, getting this implemented might help us smooth testing of and the effects of the resolver rollout. So if any volunteers could help get this finished and merged, we'd appreciate that! cc @yeraydiazdiaz

@uranusjr @pfmoore @pradyunsg I'd appreciate if you could review this issue, especially https://github.com/pypa/warehouse/issues/194#issuecomment-41196203 and the comments below it, which discuss some upload checks we could implement in Warehouse (blocking noncompliant uploads). Which of those checks would be particularly valuable to the pip resolver work? Once we have that list, we can make a checklist so this issue becomes more completable.

One wild wish I鈥檝e always hoped for is to eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies. One of the most resource-consuming (both for development and runtime) part of dependency resolution is the need to download a package matching the host environment, extract it, and potentially build it to get dependencies. This can cost a tremendous amount of time if you鈥檙e on a platform without wheels (e.g. musl), and it would be a vast improvement if we are able to download (say) a wheel for Windows and know its metadata would match if I build it on a random Linux machine. m

It would be great to have a check that the name and version in the metadata match the filename (sdist or wheel), so that we don't have to abort on mismatches. (We'd still have to check, for local files and other edge cases, but knowing that the filename is reliable would still be useful).

As @uranusjr mentioned, the only really useful checks from the pip resolver POV are ones which would allow us to minimise downloading of distributions. All we use is name, version and dependencies.

So for us:

  1. As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.
  2. Short of Warehouse growing a "give me the metadata" API, we have to download to get dependencies, so the only savings here for us would be across files (i.e., checks that all sdists and wheels for package FOO version X.Y.Z return the same dependency metadata). If we had that, we could cache dependencies for name/version, or extract them from a wheel rather than the corresponding sdist, saving a build step. But we'd still need to special-case PyPI to enable that, so the cost in complexity may mean it's not worth it.

As I said, knowing that the filename gives us the correct name/version is massive. Sufficiently so that we make that assumption already and crash out if it turns out to be wrong.

I'd argue this is already something you should be able to assume. If those two things don't line up I'd just call it an error. We've never made any promises that it would work, and if it does currently work I'd call it an implementation detail that just happened to allow it for awhile.

Short of Warehouse growing a "give me the metadata" API

I'm fairly sure that at some point we will supplant the simple API with a better one that will include the needed information, but that is obviously not yet there. That being said, it's probably useful to start checking/enforcing any useful checks now to get better data in the long run.

eliminate dynamic dependency altogether, and ensure all dists of a given version specify exactly the same set of dependencies.

I'm not sure that this is possible, or even desirable. At least when we talked about PEP 517 @njsmith argued pretty strongly that we need some sort of hook that enabled programatic dependencies, and that we simply could not depend on static only metadata for sdists.

At the summit last year @crwilcox asked:

can we fail when there's missing author, author_email, URL. Currently warnings on setup.

@dstufft said, above (in 2014):

  • Author -> Sure
  • Author Email -> No
  • URL -> No, sometimes the PyPI Page _is_ the URL

@dstufft do you agree with your 2014 self that author email and project URL should not be mandatory?

@dstufft @pfmoore @pradyunsg @ewdurbin @di @uranusjr @techalchemy OK, I reread the above discussion. It sounds like the metadata/installability checks that make sense for Warehouse to mandate are:

  • Package name

    • must exist

    • must match filename (sdist or wheel)

  • Version

    • must exist

    • must be PEP 440-compliant

    • must match filename (sdist or wheel)

  • Description: must exist
  • Author name: must exist
  • Classifiers for supported Python versions - potentially supplanted by #3889/Requires-Python
  • Certain additional classifiers (which ones?)

    • License

Does this sound right? Are any of these already mandated?

Sounds reasonable to me.

I'd argue that there should be some means of contacting the author, but I concede that "issue tracker on the project webpage" is entirely reasonable for that. So we move out of the area of what should be checked, and into a new metadata field.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hartwork picture hartwork  路  4Comments

mbakke picture mbakke  路  3Comments

Lawouach picture Lawouach  路  3Comments

nlhkabu picture nlhkabu  路  4Comments

toddrme2178 picture toddrme2178  路  3Comments