There are projects like https://release-monitoring.org/ which want to monitor PyPI to see when new versions of specific projects are released. Currently this requires doing 1 HTTP request per tracked project which can easily turn into hundreds or thousands of HTTP requests. Offering a JSON endpoint that simply lists the names and versions of all projects can make this take a single HTTP request.
\cc
It'd also be helpful if it was possible to search through the list with some filters. I've been meaning to add an instant answer to duckduckgo for python packages, but sadly it's not possible right now as there is no search API. I can also work on it. Just wondering why it doesn't exit yet.
Would it be possible to solve this need (the use case for release-monitoring.org) instead by streaming and parsing the PyPI update changelog, similar to how bandersnatch figures out what new updates to sync on a mirror?
Would be also useful for PyCharm IDE to notify about outdated packages in the integrated package manager.
It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?
@traff it seems that pulling current versions for _all_ packages on the index may be a bit overzealous. getting a list from /simple should be sufficient for offering users a dialog of installable packages... the current/latest version identifier is arguably unimportant at that point.
perhaps after a user has identified a package by name, a cheap (cached at edge) call to the json API would be successful for offering a list of installable versions. GET https://pypi.python.org/pypi/<package_name>/json
regarding offering updates, again the existing json api would support a call per package to get current versions using the same call as above . these calls are cached at the edge, should be fast for end users, and are 100% fair game for community use.
i suppose my question is how is 1 long (2-10s) request to obtain a list of all package/versions better than N 200-500ms requests (which can be submitted concurrently) in order to provide information on current versions and available updates.
to be clear, i'm not opposed to supporting the specific feature request in this issue. but am trying to aide in guiding PyCharm off of the currently used index page they are scraping for this information.
that page is incredibly expensive to generate and often causes congestion for the PyPI backends.
@ewdurbin Yes, that makes sense. The only doubt we had is that we thought that making N request is worse than making one. But in this case, it could be on the contrary.
Relevant mailing list threads:
One thing like this could allow for is providing a JSON API for this info instead of having to parse the HTML of the /simple page to just get a list of projects.
So, what is the objective here:
On Aug 6, 2016 3:18 PM, "Brett Cannon" [email protected] wrote:
One thing like this could allow for is providing a JSON API for this info
instead of having to parse the HTML of the /simple page to just get a list
of projects.
So, caching and conditional requests (with etags) would probably be the
only way to afford this functionality (SELECT * FROM packages WHERE package
IN {pip, pytest, virtualenv, scripttest, mock, pretend})
Seemingly OT, but this may be the best guide to REST API cache management
(cache keys, etags, invalidation) I've ever read:
http://chibisov.github.io/drf-extensions/docs/#caching
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi. I make a GET at, for example, https://pypi.python.org/pypi/django/json - which is a fairly large response (last time I checked, content-length: 128597) whilst the only data I'm actually interested in is under the key data['info'] to see the latest version details. Most of the response bytes are in data['releases'] which I just ignore.
It would be very helpful if there was an API endpoint which could return me just this latest version info for the package. Or, even, better for a user-specified list of packages - so I could make just one request instead of N requests. Thanks!
Would love to be able to checksums for the releases from this API somehow.
@jakirkham Checksum for what? The files?
Sorry for the delay. Yes, for the files.
Added issue ( https://github.com/pypa/warehouse/issues/1638 ) for getting the checksums via the API.
+1 here, the same is needed for repology.org, from which I've had to remove PyPi support since https://pypi.python.org/pypi/ which it used got deprecated.
So, it'd be nice to have a (not necessary realtime, regenerated hourly is ok for my purposes) machine-readable dump of all PyPi packages with versions and preferrably other metadata such as summaries and licenses.
Example of what I'd like to have for repology:
[
{
"name": "requests",
"version": "2.18.1",
"summary": "Python HTTP for Humans."
}
...
]
I recognize that this is beyond the scope of this particular issue, but,
how is this problem different from BitTorrent w/ HTTP web seeds?
https://en.wikipedia.org/wiki/BitTorrent#Web_seeding
On Thursday, August 10, 2017, Wes Turner wes.turner@gmail.com wrote:
>
I recognize that this is beyond the scope of this particular issue, but,
how is this problem different from BitTorrent w/ HTTP web seeds?
https://en.wikipedia.org/wiki/BitTorrent#Web_seeding
- manifest
- hashes
- mirroring
(In context to mirroring with e.g. bandersnatch).
It could be efficient to also serve cached deltas of the whole catalog? e.g. an optional from=iso8601datetime?
- Is there something like DRPMS or rsync over HTTP that could reasonably be added in a view and generated in a celery WarehouseTask?
- etags, If-Modified-Since
Zsync is like rsync over HTTP
http://zsync.moria.org.uk/
zsync provides transfers that are nearly as efficient as rsync -z or cvsup, without the need to run a special server application. All that is needed is an HTTP/1.1-compliant web server.
[...]
Single meta-file — zsync downloads are offered by building a
.zsyncfile, which contains the meta-data needed by zsync. This file contains the precalculated checksums for the rsync algorithm; it is generated on the server, once, and is then used by any number of downloaders.
I'm grateful for the discussion here and apologize for the slow response.
There's now an open issue #1478 for getting a regular dump of the PyPI database, and other open issues tagged as "APIs/feeds". And now, https://warehouse.readthedocs.io/api-reference/ has a bunch more guidance on how developers can use the Warehouse APIs (RSS feeds, JSON, /simple/ emulation of the legacy API, and XML-RPC methods) already available at https://pypi.org .
The folks working on Warehouse have gotten funding to concentrate on improving and deploying Warehouse, and have kicked off work towards our development roadmap -- the most urgent task is to improve Warehouse to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable, and shut down the legacy site. We discussed this issue in our core developers' meeting today. Since this feature isn't something that the legacy site has, I've moved it to a future milestone.
I would be pleased to add this issue to the list of things people work on at this year's PyCon sprints if folks are interested.
Thanks and sorry again for the wait.
Folks who need this might want to check whether the Libraries.io API for https://libraries.io/pypi might suit their needs in the short term.
It doesn't look suitable - I see no means to bulk get information on all projects. The closest search endpoint returns right information, but it's only possible to request packages page by page and maximal page size is 100 items, which, given API rate limit of 60 requests/minute and PyPi size of >136k packages gives us more than 20 minutes needed to get all the data - this is too much. There are other issues as well:
Hi there, just to throw in my $0.0.2 here - I use a utility which checks a requirements.txt file's pinned versions to see if there are newer releases existing on pypi. If there are N packages in the file I currently make N requests to pypi.
@wimglenn which util is this? I'm looking for this type of tool.
I predict that work on this may depend on the progress of #284.
From December 2017 till the end of April 2018, PyPI had funding to get the new site up and running and perform the switchover. Then the grant ran out and we have, as far as I know, no one paid to work on PyPI; volunteers are maintaining and improving the software and infrastructure sides of things, but we need dedicated funding to add complex features. The Packaging Working Group is seeking donations and applying for further grants to fund more design work, more and faster development (including reviewing code contributed by volunteers), and requisite project management.
Sorry for the wait.
process_line() in https://github.com/pypa/pip/blob/master/src/pip/_internal/req/req_file.py may be helpful for parsing requirements files; though this does not solve for "API endpoint to get latest version of all projects".
Most helpful comment
+1 here, the same is needed for repology.org, from which I've had to remove PyPi support since https://pypi.python.org/pypi/ which it used got deprecated.
So, it'd be nice to have a (not necessary realtime, regenerated hourly is ok for my purposes) machine-readable dump of all PyPi packages with versions and preferrably other metadata such as summaries and licenses.
Example of what I'd like to have for repology: