This is a discussion issue atm. Also a place to ask questions about how best to patch OpenRefine to support this.
Here's a link to the specification: http://www.dataprotocols.org/en/latest/simple-data-format.html
@rufuspollock We have a first working version of this, just merged. @jackyq2015 has implemented import and export for data packages distributed as Zip files, and also data packages hosted on the web as a JSON file.
The various properties of columns (types, constraints, descriptions, …) are not currently exposed in the UI - this will require more work and coordination.
I also plan to work on import/export for data packages with embedded tabular data. This might require PRs to the java library for data packages.
By the way, thanks for setting up https://github.com/frictionlessdata/test-data, this is very useful.
@wetneb awesome 👍 👏 /cc @pwalsh @vitorbaptista
Amazing @wetneb . Happy to support with reviews and discussion on the Data Package Java Library.
@wetneb So what's left to deliver the features on this issue ? I'd like to see a Task breakdown please with linked issue numbers on each task remaining.
I don't think we have issues for these yet:
Hi! I've noticed that the datapackage import & export functionality is no longer present in the latest versions (3.2 and 3.3) but it was there in 3.0. Are there any plans to re-implement this functionality? If so, is there anything you need help with to get it working?
Thanks!
Hi @lwinfree,
Yes indeed, we removed it because it relied on a non-free library, see https://github.com/frictionlessdata/datapackage-java/issues/26.
It would be great to have this back though! We do not have short-term plans to work on this but would surely welcome PRs in that direction.
In my opinion, the integration we had in 3.0 lacked vision a bit: we should think about concrete user workflows where the integration would really make a difference. As a user, how do I want to turn a messy CSV into a nice validated data package? This means thinking about the interaction between the spec's notions (such as type constraints on columns) and OpenRefine's data model, for instance. What I mean is that it's not enough to just have an importer and an exporter if the importer discards most of the interesting metadata and the exporter produces a jsonified CSV… that defeats the purpose of data packages!
It might be worth looking at use cases in communities who already rely on data packages, seeing what benefit they get out of it, how do they produce them, how could they use OpenRefine as part of their existing workflows, and so on.
Hi @wetneb, Thanks this is really helpful information! I work on the frictionless data team and we are interested in getting this functionality fixed. I'll keep you posted on our progress :-)
Also, yes I agree it would be great to have use cases. One of our current tool fund grantees (https://github.com/frictionlessdata/FrictionlessDarwinCore) actually inspired this issue as he was planning on working with Open Refine.
Thanks for the quick response, and I'll stay in touch.
It looks like datapackage-java has been updated with a new JSON library.
https://github.com/frictionlessdata/datapackage-java/pull/35
Doesn't look like they do formal releases: https://github.com/frictionlessdata/datapackage-java/releases
so not sure how long a cooling off period we should allow before grabbing a snapshot.
I would be wary of simply restoring the previous integration with the migrated library, since it did not really enable any useful workflow for users as far as I can tell.
I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.
@wetneb I've been told that you can find that Data Packages are used in a lot of statistical & bio tools. To name a few important ones in the R lang community:
https://cran.r-project.org/web/packages/datapackage.r/index.html
https://cran.r-project.org/web/packages/dpmr/index.html
https://cran.r-project.org/web/packages/codebook/index.html
The bio/stat/scientific community are looking for more data tools to support editing metadata and help improve reproducibility with pushing good practices for publishing data, which involves producing a data dictionary, making it machine readable, etc.
https://arxiv.org/pdf/2002.11626.pdf
The last 3 times I went to the R lang meetup in Dallas, they all asked "Does OpenRefine support adding metadata editing of the table schema yet?" My reply 3 times: "nope"
I personally kinda like the approach of vertically scrolling through the columns to edit that many existing data tools use. Makes the metadata entry faster:

What I would like is a concrete description of a workflow in OpenRefine.
This seems to relate somewhat to the FAIR OpenRefine plugin project : https://github.com/FAIRDataTeam/OpenRefine-metadata-extension
FAIR metadata would seem to aspire to some of the same goals of Data Packages tech. I came across the FAIR plugin earlier in the year but have not have the chance to play with it much. FAIR data is very much about replicability, data rights, and data re-use. https://www.go-fair.org/fair-principles/
I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.
@wetneb i'm not fully up to speed on exact flow in OpenRefine itself but the overall workflow this supports is something like as follows (I imagine).
Let me know if this is the kind of thing you were looking for or now.
I imagine there are situation where OpenRefine would benefit from being able to consume data that is already described a data package, for example, a user has already:
1989 is a year not an integeretc
tagging @andrejjh, Andre we have updated our datapackage-java so it can theoretically be integrated with Open Refine again. Would you be able to write a short summary of the use case you would do if this integration was added back in? It would help the Open Refine team understand and prioritize. Thanks!
Thanks @rufuspollock for the workflows, they sound very sensible to me!
I do not think pulling back the integration we had before will address these. More work is needed to make these workflows possible and smooth:
The reason I'm desperately waiting for data package compatible OpenRefine
is this:
Open Refine is quite popular amongst the GBIF community. It is often used
by data holders to prepare institutional data for publication.
On the other side, GBIF data users mostly use RGBIF but some download data
in simple CSV or DwC-Archive format.
With the tool I made any DwC-A can be converted into data package which
enable great tool such as GoodTables and...OpenRefine if it can ingest
datapackage.
I'm pretty sure this functionality was supported by OpenRefine 2.x and
would be greatly appreciated by the GBIF community.
Ingested data packages contain enough information to be saved, after
processing in OpenRefine.
On top of that, OpenRefine is a powerful tool that could help people
dealing with any data packages.
I hate walls, love bridges. Best regards,
André Heughebaert https://andrejjh.github.io/
So many acronyms, so few links or definitions. Here are some links to help others:
[GBIF](https://www.gbif.org/ - Global Biodiversity Information Facility - "international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth."
RGBIF - R package for dealing with GBIF data (currently 404 for its web page)
DwC-A - Darwin Core Archive - "biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset." Domain specific packaging/archiving (yet another). Appears to be zip file containing two XML files and one or more CSVs, all tied together.
"tool I made" - perhaps Darwin Core Archive Assistant
Now that I understand the terms, I'm not sure I'm any closer to understanding the workflow.
"prepare institutional data for publication" = ?
Hi Tom,
Sorry for assuming everybody knows all the acronyms I used.
But you find almost all of them except those two:
This is graph explaining the complete data workflow if the missing
link(datapackage ingestion in OpenRefine) is added: see annex
Hope it clarify a bit,
with the correct graph now ;-)
@andrejjh I take it you're using email rather than the web interface? If there was an attachment, it got stripped along the way.
Please go to https://github.com/OpenRefine/OpenRefine/issues/778 and add it (or just post the URL).
Most helpful comment
The reason I'm desperately waiting for data package compatible OpenRefine
is this:
Open Refine is quite popular amongst the GBIF community. It is often used
by data holders to prepare institutional data for publication.
On the other side, GBIF data users mostly use RGBIF but some download data
in simple CSV or DwC-Archive format.
With the tool I made any DwC-A can be converted into data package which
enable great tool such as GoodTables and...OpenRefine if it can ingest
datapackage.
I'm pretty sure this functionality was supported by OpenRefine 2.x and
would be greatly appreciated by the GBIF community.
Ingested data packages contain enough information to be saved, after
processing in OpenRefine.
On top of that, OpenRefine is a powerful tool that could help people
dealing with any data packages.
I hate walls, love bridges. Best regards,
André Heughebaert https://andrejjh.github.io/