Openrefine: Support for Simple Data Format Data Packages

Created on 9 Aug 2013 · 21Comments · Source: OpenRefine/OpenRefine

This is a discussion issue atm. Also a place to ask questions about how best to patch OpenRefine to support this.

Introduction and overview to Simple Data Format (full spec)

enhancement import metadata new data format

Source

rufuspollock

Most helpful comment

The reason I'm desperately waiting for data package compatible OpenRefine
is this:

Open Refine is quite popular amongst the GBIF community. It is often used
by data holders to prepare institutional data for publication.

On the other side, GBIF data users mostly use RGBIF but some download data
in simple CSV or DwC-Archive format.

With the tool I made any DwC-A can be converted into data package which
enable great tool such as GoodTables and...OpenRefine if it can ingest
datapackage.

I'm pretty sure this functionality was supported by OpenRefine 2.x and
would be greatly appreciated by the GBIF community.
Ingested data packages contain enough information to be saved, after
processing in OpenRefine.

On top of that, OpenRefine is a powerful tool that could help people
dealing with any data packages.

I hate walls, love bridges. Best regards,
André Heughebaert https://andrejjh.github.io/

andrejjh on 23 Jun 2020

👍3 😄2

All 21 comments

Here's a link to the specification: http://www.dataprotocols.org/en/latest/simple-data-format.html

tfmorris on 9 Aug 2013

@rufuspollock We have a first working version of this, just merged. @jackyq2015 has implemented import and export for data packages distributed as Zip files, and also data packages hosted on the web as a JSON file.

The various properties of columns (types, constraints, descriptions, …) are not currently exposed in the UI - this will require more work and coordination.
I also plan to work on import/export for data packages with embedded tabular data. This might require PRs to the java library for data packages.

By the way, thanks for setting up https://github.com/frictionlessdata/test-data, this is very useful.

wetneb on 2 Feb 2018

👍1

@wetneb awesome 👍 👏 /cc @pwalsh @vitorbaptista

rufuspollock on 7 Feb 2018

Amazing @wetneb . Happy to support with reviews and discussion on the Data Package Java Library.

pwalsh on 7 Feb 2018

@wetneb So what's left to deliver the features on this issue ? I'd like to see a Task breakdown please with linked issue numbers on each task remaining.

thadguidry on 30 May 2018

I don't think we have issues for these yet:

expose column metadata in the UI and let the user edit it #1726
design how type validation and other constraint validation (uniqueness, format) should be handled in OpenRefine ; implement it #1727
add support for inline data packages (one JSON file storing both metadata and tabular data) #1728

wetneb on 30 May 2018

Hi! I've noticed that the datapackage import & export functionality is no longer present in the latest versions (3.2 and 3.3) but it was there in 3.0. Are there any plans to re-implement this functionality? If so, is there anything you need help with to get it working?

Thanks!

lwinfree on 23 Oct 2019

Hi @lwinfree,

Yes indeed, we removed it because it relied on a non-free library, see https://github.com/frictionlessdata/datapackage-java/issues/26.

It would be great to have this back though! We do not have short-term plans to work on this but would surely welcome PRs in that direction.

In my opinion, the integration we had in 3.0 lacked vision a bit: we should think about concrete user workflows where the integration would really make a difference. As a user, how do I want to turn a messy CSV into a nice validated data package? This means thinking about the interaction between the spec's notions (such as type constraints on columns) and OpenRefine's data model, for instance. What I mean is that it's not enough to just have an importer and an exporter if the importer discards most of the interesting metadata and the exporter produces a jsonified CSV… that defeats the purpose of data packages!

It might be worth looking at use cases in communities who already rely on data packages, seeing what benefit they get out of it, how do they produce them, how could they use OpenRefine as part of their existing workflows, and so on.

wetneb on 23 Oct 2019

Hi @wetneb, Thanks this is really helpful information! I work on the frictionless data team and we are interested in getting this functionality fixed. I'll keep you posted on our progress :-)
Also, yes I agree it would be great to have use cases. One of our current tool fund grantees (https://github.com/frictionlessdata/FrictionlessDarwinCore) actually inspired this issue as he was planning on working with Open Refine.
Thanks for the quick response, and I'll stay in touch.

lwinfree on 23 Oct 2019

It looks like datapackage-java has been updated with a new JSON library.
https://github.com/frictionlessdata/datapackage-java/pull/35

Doesn't look like they do formal releases: https://github.com/frictionlessdata/datapackage-java/releases
so not sure how long a cooling off period we should allow before grabbing a snapshot.

tfmorris on 20 Jun 2020

I would be wary of simply restoring the previous integration with the migrated library, since it did not really enable any useful workflow for users as far as I can tell.

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

wetneb on 20 Jun 2020

@wetneb I've been told that you can find that Data Packages are used in a lot of statistical & bio tools. To name a few important ones in the R lang community:
https://cran.r-project.org/web/packages/datapackage.r/index.html
https://cran.r-project.org/web/packages/dpmr/index.html
https://cran.r-project.org/web/packages/codebook/index.html

The bio/stat/scientific community are looking for more data tools to support editing metadata and help improve reproducibility with pushing good practices for publishing data, which involves producing a data dictionary, making it machine readable, etc.
https://arxiv.org/pdf/2002.11626.pdf

The last 3 times I went to the R lang meetup in Dallas, they all asked "Does OpenRefine support adding metadata editing of the table schema yet?" My reply 3 times: "nope"

I personally kinda like the approach of vertically scrolling through the columns to edit that many existing data tools use. Makes the metadata entry faster:

thadguidry on 21 Jun 2020

What I would like is a concrete description of a workflow in OpenRefine.

wetneb on 21 Jun 2020

👍1

This seems to relate somewhat to the FAIR OpenRefine plugin project : https://github.com/FAIRDataTeam/OpenRefine-metadata-extension

FAIR metadata would seem to aspire to some of the same goals of Data Packages tech. I came across the FAIR plugin earlier in the year but have not have the chance to play with it much. FAIR data is very much about replicability, data rights, and data re-use. https://www.go-fair.org/fair-principles/

jimfhahn on 21 Jun 2020

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

@wetneb i'm not fully up to speed on exact flow in OpenRefine itself but the overall workflow this supports is something like as follows (I imagine).

Let me know if this is the kind of thing you were looking for or now.

Export flow

User has data to tidy e.g. csv
They load into OpenRefine and wrangle. This work including adding some type information
They data is re-exported for consumption in some other tool. That export includes the datapackage.json with the Table Schema describing the table
Another tool (be that a data validator, a visualization tool, or a data loader to a DB) uses that metadata as part of its processing flow

Ingest flow

I imagine there are situation where OpenRefine would benefit from being able to consume data that is already described a data package, for example, a user has already:

Added information about the CSV dialect e.g. knowing that ; is the separator, that quotes are ' rather than " etc.
They have already added information about types e.g. 1989 is a year not an integer
Knowing the decimal character
Using the validation information (beyond types) in Table Schema to flag values out of range

etc

rufuspollock on 22 Jun 2020

👍3

tagging @andrejjh, Andre we have updated our datapackage-java so it can theoretically be integrated with Open Refine again. Would you be able to write a short summary of the use case you would do if this integration was added back in? It would help the Open Refine team understand and prioritize. Thanks!

lwinfree on 22 Jun 2020

Thanks @rufuspollock for the workflows, they sound very sensible to me!

I do not think pulling back the integration we had before will address these. More work is needed to make these workflows possible and smooth:

Export flow: that requires exposing more column metadata (and decide what OpenRefine should do with it - such as validation of the values in the column: in which form?)
Ingest flow: that requires feeding in metadata from the data package metadata to the importer - on top of my head I do not think this was supported, but I am not sure about that.

wetneb on 22 Jun 2020

The reason I'm desperately waiting for data package compatible OpenRefine
is this:

Open Refine is quite popular amongst the GBIF community. It is often used
by data holders to prepare institutional data for publication.

On the other side, GBIF data users mostly use RGBIF but some download data
in simple CSV or DwC-Archive format.

With the tool I made any DwC-A can be converted into data package which
enable great tool such as GoodTables and...OpenRefine if it can ingest
datapackage.

On top of that, OpenRefine is a powerful tool that could help people
dealing with any data packages.

I hate walls, love bridges. Best regards,
André Heughebaert https://andrejjh.github.io/

andrejjh on 23 Jun 2020

👍3 😄2

So many acronyms, so few links or definitions. Here are some links to help others:

[GBIF](https://www.gbif.org/ - Global Biodiversity Information Facility - "international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth."

RGBIF - R package for dealing with GBIF data (currently 404 for its web page)

DwC-A - Darwin Core Archive - "biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset." Domain specific packaging/archiving (yet another). Appears to be zip file containing two XML files and one or more CSVs, all tied together.

"tool I made" - perhaps Darwin Core Archive Assistant

Now that I understand the terms, I'm not sure I'm any closer to understanding the workflow.

"prepare institutional data for publication" = ?

tfmorris on 23 Jun 2020

Hi Tom,
Sorry for assuming everybody knows all the acronyms I used.
But you find almost all of them except those two:

Tool I made : Frictionless DarwinCore
https://github.com/frictionlessdata/FrictionlessDarwinCore
Prepare institutional data for publication: Quick guide to publishing
data through GBIF https://www.gbif.org/publishing-data

This is graph explaining the complete data workflow if the missing
link(datapackage ingestion in OpenRefine) is added: see annex
Hope it clarify a bit,

andrejjh on 23 Jun 2020