Warehouse: Add API endpoint to get latest version of all projects

Created on 9 Jan 2015 · 28Comments · Source: pypa/warehouse

There are projects like https://release-monitoring.org/ which want to monitor PyPI to see when new versions of specific projects are released. Currently this requires doing 1 HTTP request per tracked project which can easily turn into hundreds or thousands of HTTP requests. Offering a JSON endpoint that simply lists the names and versions of all projects can make this take a single HTTP request.

APIfeeds feature request

Source

dstufft

👍10

Most helpful comment

+1 here, the same is needed for repology.org, from which I've had to remove PyPi support since https://pypi.python.org/pypi/ which it used got deprecated.

So, it'd be nice to have a (not necessary realtime, regenerated hourly is ok for my purposes) machine-readable dump of all PyPi packages with versions and preferrably other metadata such as summaries and licenses.

Example of what I'd like to have for repology:

[
    {
        "name": "requests",
        "version": "2.18.1",
        "summary": "Python HTTP for Humans."
    }
    ...
]

AMDmi3 on 19 Jul 2017

👍4

All 28 comments

\cc

pypingou on 19 Feb 2015

It'd also be helpful if it was possible to search through the list with some filters. I've been meaning to add an instant answer to duckduckgo for python packages, but sadly it's not possible right now as there is no search API. I can also work on it. Just wondering why it doesn't exit yet.

iambibhas on 4 Apr 2015

👍2

Would it be possible to solve this need (the use case for release-monitoring.org) instead by streaming and parsing the PyPI update changelog, similar to how bandersnatch figures out what new updates to sync on a mirror?

fungi on 13 Jul 2016

Would be also useful for PyCharm IDE to notify about outdated packages in the integrated package manager.

traff on 13 Jul 2016

It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?

Is there something like DRPMS or rsync over HTTP that could reasonably be added in a view and generated in a celery WarehouseTask?
etags, If-Modified-Since

westurner on 13 Jul 2016

👍1

@traff it seems that pulling current versions for _all_ packages on the index may be a bit overzealous. getting a list from /simple should be sufficient for offering users a dialog of installable packages... the current/latest version identifier is arguably unimportant at that point.

perhaps after a user has identified a package by name, a cheap (cached at edge) call to the json API would be successful for offering a list of installable versions. GET https://pypi.python.org/pypi/<package_name>/json

regarding offering updates, again the existing json api would support a call per package to get current versions using the same call as above . these calls are cached at the edge, should be fast for end users, and are 100% fair game for community use.

i suppose my question is how is 1 long (2-10s) request to obtain a list of all package/versions better than N 200-500ms requests (which can be submitted concurrently) in order to provide information on current versions and available updates.

ewdurbin on 13 Jul 2016

to be clear, i'm not opposed to supporting the specific feature request in this issue. but am trying to aide in guiding PyCharm off of the currently used index page they are scraping for this information.

that page is incredibly expensive to generate and often causes congestion for the PyPI backends.

ewdurbin on 13 Jul 2016

👍2

@ewdurbin Yes, that makes sense. The only doubt we had is that we thought that making N request is worse than making one. But in this case, it could be on the contrary.

traff on 13 Jul 2016

indeed, @dstufft did a great job of summarizing the state of the world:

here and here

ewdurbin on 13 Jul 2016

Relevant mailing list threads:

"[Distutils] PyPI index workaround" https://mail.python.org/pipermail/distutils-sig/2016-July/029236.html
- re: efficient ways to poll for latest versions of multiple PyPI packages

westurner on 13 Jul 2016

One thing like this could allow for is providing a JSON API for this info instead of having to parse the HTML of the /simple page to just get a list of projects.

brettcannon on 6 Aug 2016

❤1

So, what is the objective here:

"Add API endpoint to get latest version of all projects"
- add versions to /simple
- cache invalidate on every package upload and then JOIN all packages
- (so that everyone re-downloads the whole catalog every time)
"Add API endpoint to get latest versions and package checksums of a
specific subset of projects matching version and/or package and platform
constraints"

On Aug 6, 2016 3:18 PM, "Brett Cannon" [email protected] wrote:

One thing like this could allow for is providing a JSON API for this info
instead of having to parse the HTML of the /simple page to just get a list
of projects.

So, caching and conditional requests (with etags) would probably be the
only way to afford this functionality (SELECT * FROM packages WHERE package
IN {pip, pytest, virtualenv, scripttest, mock, pretend})

Seemingly OT, but this may be the best guide to REST API cache management
(cache keys, etags, invalidation) I've ever read:
http://chibisov.github.io/drf-extensions/docs/#caching

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

westurner on 6 Aug 2016

👍1

Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi. I make a GET at, for example, https://pypi.python.org/pypi/django/json - which is a fairly large response (last time I checked, content-length: 128597) whilst the only data I'm actually interested in is under the key data['info'] to see the latest version details. Most of the response bytes are in data['releases'] which I just ignore.

It would be very helpful if there was an API endpoint which could return me just this latest version info for the package. Or, even, better for a user-specified list of packages - so I could make just one request instead of N requests. Thanks!

wimglenn on 9 Aug 2016

👍4

Would love to be able to checksums for the releases from this API somehow.

jakirkham on 30 Aug 2016

@jakirkham Checksum for what? The files?

dstufft on 30 Aug 2016

👍1

Sorry for the delay. Yes, for the files.

jakirkham on 21 Sep 2016

Added issue ( https://github.com/pypa/warehouse/issues/1638 ) for getting the checksums via the API.

jakirkham on 3 Jan 2017

+1 here, the same is needed for repology.org, from which I've had to remove PyPi support since https://pypi.python.org/pypi/ which it used got deprecated.

Example of what I'd like to have for repology:

[
    {
        "name": "requests",
        "version": "2.18.1",
        "summary": "Python HTTP for Humans."
    }
    ...
]

AMDmi3 on 19 Jul 2017

👍4

I recognize that this is beyond the scope of this particular issue, but,
how is this problem different from BitTorrent w/ HTTP web seeds?
https://en.wikipedia.org/wiki/BitTorrent#Web_seeding

manifest
hashes
mirroring

westurner on 11 Aug 2017

On Thursday, August 10, 2017, Wes Turner wes.turner@gmail.com wrote:

I recognize that this is beyond the scope of this particular issue, but,
how is this problem different from BitTorrent w/ HTTP web seeds?
https://en.wikipedia.org/wiki/BitTorrent#Web_seeding

manifest

hashes

mirroring

(In context to mirroring with e.g. bandersnatch).

MD5 is no longer recommended?
https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues
- [EDIT] https://cryptography.io/en/latest/hazmat/primitives/cryptographic-hashes/#md5

westurner on 11 Aug 2017

It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?

Is there something like DRPMS or rsync over HTTP that could reasonably be added in a view and generated in a celery WarehouseTask?

etags, If-Modified-Since

Zsync is like rsync over HTTP
http://zsync.moria.org.uk/

zsync provides transfers that are nearly as efficient as rsync -z or cvsup, without the need to run a special server application. All that is needed is an HTTP/1.1-compliant web server.

[...]

Single meta-file — zsync downloads are offered by building a .zsync file, which contains the meta-data needed by zsync. This file contains the precalculated checksums for the rsync algorithm; it is generated on the server, once, and is then used by any number of downloaders.

westurner on 17 Oct 2017

I'm grateful for the discussion here and apologize for the slow response.

There's now an open issue #1478 for getting a regular dump of the PyPI database, and other open issues tagged as "APIs/feeds". And now, https://warehouse.readthedocs.io/api-reference/ has a bunch more guidance on how developers can use the Warehouse APIs (RSS feeds, JSON, /simple/ emulation of the legacy API, and XML-RPC methods) already available at https://pypi.org .

The folks working on Warehouse have gotten funding to concentrate on improving and deploying Warehouse, and have kicked off work towards our development roadmap -- the most urgent task is to improve Warehouse to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable, and shut down the legacy site. We discussed this issue in our core developers' meeting today. Since this feature isn't something that the legacy site has, I've moved it to a future milestone.

I would be pleased to add this issue to the list of things people work on at this year's PyCon sprints if folks are interested.

Thanks and sorry again for the wait.

brainwane on 12 Mar 2018

👍1

Folks who need this might want to check whether the Libraries.io API for https://libraries.io/pypi might suit their needs in the short term.

brainwane on 15 Mar 2018

👍1

It doesn't look suitable - I see no means to bulk get information on all projects. The closest search endpoint returns right information, but it's only possible to request packages page by page and maximal page size is 100 items, which, given API rate limit of 60 requests/minute and PyPi size of >136k packages gives us more than 20 minutes needed to get all the data - this is too much. There are other issues as well:

there's no option of getting results sorted by package name (other sort variants may lead to lost or duplicate packages because of reordering during retrieval)
the API doesn't seem to be stable, I'm having 500 errors when requesting above 100th page
the requirement to get an API key may be unacceptable to some users (including me)
the requirement to use third party site is also not good

AMDmi3 on 6 Jun 2018

👍2

Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi.

@wimglenn which util is this? I'm looking for this type of tool.

retr0h on 16 Jul 2018

I predict that work on this may depend on the progress of #284.

From December 2017 till the end of April 2018, PyPI had funding to get the new site up and running and perform the switchover. Then the grant ran out and we have, as far as I know, no one paid to work on PyPI; volunteers are maintaining and improving the software and infrastructure sides of things, but we need dedicated funding to add complex features. The Packaging Working Group is seeking donations and applying for further grants to fund more design work, more and faster development (including reviewing code contributed by volunteers), and requisite project management.

Sorry for the wait.

brainwane on 10 Aug 2018

@retr0h This was a tool developed internally at $EMPLOYER, but I've since requested permission to open source it and $EMPLOYER has agreed. It's on PyPI so you can pip install luddite and the project homepage is now right here on github.

wimglenn on 14 Aug 2018

🎉1

process_line() in https://github.com/pypa/pip/blob/master/src/pip/_internal/req/req_file.py may be helpful for parsing requirements files; though this does not solve for "API endpoint to get latest version of all projects".

westurner on 14 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Consider the role of our internal bandersnatch mirror, and if it makes sense to continue to use it

dstufft · 3Comments

[Project-scoped API tokens] aren't available to maintainers

webknjaz · 4Comments

pip via system package python-pip broken since upgrade (ubuntu 14.04)

LarsFronius · 4Comments

Sort user page project list by last release date

mahmoud · 4Comments

Missing translations

NathanBnm · 3Comments