Warehouse: Public Dataset for distribution metadata

Created on 18 Feb 2020  路  11Comments  路  Source: pypa/warehouse

As requested in https://github.com/pypa/packaging-problems/issues/323, we should explore publishing the metadata for each released distribution in a public dataset via BigQuery.

I'm imagining that each row would contain all the core metadata fields included in each release, as well as filename, digests, file size, upload time, URL to the distribution, etc. Essentially everything in the "Release" JSON API, with the per-release info field included for every individual distribution.

Once we're publishing to the dataset on upload, we'd also need to backfill prior distributions as well.

Not entirely sure what we'd name it, does the-psf:pypi.distributions make sense?

feature request

Most helpful comment

One problem with distributing via BigQuery is that it adds additional barriers to access the data. Although sometimes it is useful to run a quick query on the data without setting up anything.

What will be the size of metadata dump? I think it should only be a few GBs. Can't it be distributed via alternate channels?

All 11 comments

One problem with distributing via BigQuery is that it adds additional barriers to access the data. Although sometimes it is useful to run a quick query on the data without setting up anything.

What will be the size of metadata dump? I think it should only be a few GBs. Can't it be distributed via alternate channels?

@ChillarAnand you bring up a great point about barrier to entry, but as far as I'm aware there isn't really any good "requestor pays" options for online queryable datasets aside from BigQuery. We could publish it as a single file, but I'm not sure how much less of a barrier that is.

In addition when combined with the existing data in BigQuery that we already have this metadata would provide all kinds of interesting options for analyzing downloads.

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset? If the public dataset matches PyPI's db 1:1 it would really be a headache for people doing retrospective analysis.

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset?

IMO, they should not.

@ewdurbin I agree. I was wondering if the dump size is less, we can also distribute via google drive or dropbox or any other channels. This makes it easy to play with data offline.

@ChillarAnand Ultimately I'm not sure if the limited volunteer admin time can be spent maintaining two sources, but the dataset is permissibly licensed under a Creative Commons Attribution 4.0 International License so redistributions of dumps of metadata discussed here would be 100% ok.

For the time being I'm going to work under the assumption that we'll be outputting to BigQuery.

Actually, this brings up a possible concern with licensing. We'd need to be careful to ensure that what is published in these tables _can_ be licensed under Creative Commons.

This may exclude us from some fields like description/description_html.

Leaving this open until the dataset has been fully backfilled and is ready for use. We should probably update our documentation about the datasets as well.

Quick update here: this has been enabled for PyPI and TestPyPI. The backfilling has been completed for TestPyPI and is in progress for PyPI.

Once the backfilling is complete, we can merge https://github.com/pypa/warehouse/pull/8240 (documentation updates) and this issue should be complete.

Could it be exported to wikidata?

We could publish it as a single file, but I'm not sure how much less of a barrier that is.

Please do so, for it actually is much less of a barrier. BigQuery is not an option at all as it makes the data unavailable for people without google account.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

apogoreliy picture apogoreliy  路  4Comments

zt2 picture zt2  路  4Comments

nlhkabu picture nlhkabu  路  4Comments

Lawouach picture Lawouach  路  3Comments

ewjoachim picture ewjoachim  路  3Comments