Openrefine: Add column from Wikidata

Created on 11 Mar 2017 · 13Comments · Source: OpenRefine/OpenRefine

The Freebase extension for OpenRefine had a nice feature: once you had reconciled a column to Freebase, you could fetch data from Freebase using the properties associated to the items.

https://youtu.be/5tsyz3ibYzk

It is currently possible to do that for Wikidata, using the "Add column by fetching URLs" feature. The reconciliation endpoint provides an (undocumented) API to help you do that.

https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?item=Q1377&prop=P856

With the following parameters:

item=Q1377 gives the item to fetch the value from
prop=P856 gives the property storing the value
flat=true returns the plain value instead of a JSON payload
label=false can be used when the property points to an item and we want to retrieve the identifier instead of the label

This works well, but is clearly not as user-friendly as the Freebase interface! So I see two options:

either we migrate the freebase extension to Wikidata
or we recognize that fetching data associated with ids is a fairly generic use case, so we augment the reconciliation API with an additional endpoint to do that. This would enable other reconciliation endpoints to implement the feature and have it nicely integrated in OpenRefine.

What do you think?

P.S: bounties welcome!

enhancement reconciliation

Source

wetneb

👍2

Most helpful comment

I vote for option 2 make it a generic use case. There is a lot of
reconciliation endpoint out there and I think the community will greatly
appreciate it.

2017-03-10 19:14 GMT-05:00 Antonin Delpeuch notifications@github.com:

The Freebase extension for OpenRefine had a nice feature: once you had
reconciled a column to Freebase, you could fetch data from Freebase using
the properties associated to the items.

https://youtu.be/5tsyz3ibYzk

It is currently possible to do that for Wikidata, using the "Add column by
fetching URLs" feature. The reconciliation endpoint provides an
(undocumented) API to help you do that.

https://tools.wmflabs.org/openrefine-wikidata/en/fetch_
values?item=Q1377&prop=P856

With the following parameters:

item=Q1377 gives the item to fetch the value from

prop=P856 gives the property storing the value

flat=true returns the plain value instead of a JSON payload

label=false can be used when the property points to an item and we
want to retrieve the identifier instead of the label

This works well, but is clearly not as user-friendly as the Freebase
interface! So I see two options:

-

either we migrate the freebase extension to Wikidata
-

or we recognize that fetching data associated with ids is a fairly
generic use case, so we augment the reconciliation API with an additional
endpoint to do that. This would enable other reconciliation endpoints to
implement the feature and have it nicely integrated in OpenRefine.

What do you think?

P.S: bounties welcome!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1179, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACeDkHMG0RmUIVcK08RgDgSxyMojP0r0ks5rkedogaJpZM4MaAv7
.

magdmartin on 11 Mar 2017

👍2

All 13 comments

I vote for option 2 make it a generic use case. There is a lot of
reconciliation endpoint out there and I think the community will greatly
appreciate it.

2017-03-10 19:14 GMT-05:00 Antonin Delpeuch notifications@github.com:

The Freebase extension for OpenRefine had a nice feature: once you had
reconciled a column to Freebase, you could fetch data from Freebase using
the properties associated to the items.

https://youtu.be/5tsyz3ibYzk

It is currently possible to do that for Wikidata, using the "Add column by
fetching URLs" feature. The reconciliation endpoint provides an
(undocumented) API to help you do that.

https://tools.wmflabs.org/openrefine-wikidata/en/fetch_
values?item=Q1377&prop=P856

With the following parameters:

item=Q1377 gives the item to fetch the value from

prop=P856 gives the property storing the value

flat=true returns the plain value instead of a JSON payload

label=false can be used when the property points to an item and we
want to retrieve the identifier instead of the label

This works well, but is clearly not as user-friendly as the Freebase
interface! So I see two options:

-

either we migrate the freebase extension to Wikidata
-

or we recognize that fetching data associated with ids is a fairly
generic use case, so we augment the reconciliation API with an additional
endpoint to do that. This would enable other reconciliation endpoints to
implement the feature and have it nicely integrated in OpenRefine.

What do you think?

P.S: bounties welcome!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1179, or mute the thread
https://github.com/notifications/unsubscribe-auth/ACeDkHMG0RmUIVcK08RgDgSxyMojP0r0ks5rkedogaJpZM4MaAv7
.

magdmartin on 11 Mar 2017

👍2

@wetneb yes, option 2. Although, I'm not sure if some of the older recon services, like VIAF, still provide the service or not (but it shouldn't stop our wanted position to extend a users data with additional data. https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#examples

Just to make sure your looking at the right places also Antonin...

2.6-rc1 was the last time we had freebase extension fully included here: https://github.com/OpenRefine/OpenRefine/tree/v2.6-rc1/extensions
and the start of the operation was called ExtendDataOperation.java btw

I snipped the following from our wiki as a bit of provenance as well... https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API

-- snip --
Because the reconciliation service only returns a single ID, getting additional column values could be done one of two ways:

The Freebase way using a separate command a la "Add column by fetching from Freebase" (perhaps just using "Add column by fetching URL" against a local REST service)
By defining multiple "types" in the reconciliation service mapped to the same table, but returning different columns as the ID. In Mr/Ms 9er's example, you could have:

CA Corporation ID returning the id column value as the ID
CA Corporation CorpID returning the corpid column value as the ID

both mapped to the same table. The user would then choose the "type" that they want based which ID they want returned.

thadguidry on 11 Mar 2017

👍1

I just ping @codeforkjeff and @nickynicolson on twitter for their inputs. The conciliator and RBGKEW Reconciliation -and-Matching-Framework projects will definitely benefit from it.

magdmartin on 13 Mar 2017

Hi, thank you for the ping! This definitely sounds useful, and I would be happy to add support for a new endpoint to conciliator for VIAF. I guess the details to be figured out are: how to provide a list of available properties to OpenRefine (should the service metadata be extended?) and making sure that the new endpoint supports a "multiple query mode" for better performance.

I can also try implementing some "test ideas" for this new API in a branch, if anyone wants to test alpha/beta code for this feature against a live service. Let's use email, gitter, or slack, to coordinate that?

codeforkjeff on 14 Mar 2017

@codeforkjeff Yes, we need to design the API for essentially two things:

Retrieving the list of properties that can be fetched from a reconciled column. This could take as parameter the type we reconciled the column against, and/or a few sample reconciled identifiers. In the case of Wikidata, the proposed property list will never be complete, so we need to let users set their own properties using the existing suggest features.
Retrieving the column itself. One HTTP request should cover a batch of rows.

I think these new functionalities should be exposed in two new endpoints that the service metadata would point to.

About discussing things, how about the OpenRefine mailing list? (I am also happy with this GitHub issue)

wetneb on 14 Mar 2017

@wetneb Sounds good. I see what you mean about users needing to set their own properties.Yes, the mailing list works for me, I've just subscribed to it and will keep a look out for developments on this front.

codeforkjeff on 14 Mar 2017

@codeforkjeff @magdmartin As announced on the mailing list, I have drafted an API specification here:
https://github.com/OpenRefine/OpenRefine/wiki/Data-Extension-API
Comments welcome!

wetneb on 6 Jul 2017

@wetneb quick question: in the example response shown in the API spec, there is an "ids" key in addition to "rows" and "meta", but this key isn't described. Should that really be part of the response or is that a copy and paste error?

codeforkjeff on 20 Jul 2017

@codeforkjeff that is definitely a copy and paste error, sorry about that!

wetneb on 20 Jul 2017

👍1

@wetneb Another option for standardizing would be not using REST and instead work on adding support for GraphQL to eliminate extra round trips and this also aligns with how MQL query engine worked in Freebase and stills does with Facebook and Financial Times. Query for the exact data that you need. These articles explain it nicely https://medium.freecodecamp.org/rest-apis-are-rest-in-peace-apis-long-live-graphql-d412e559d8e4 and https://code.facebook.com/posts/1691455094417024/graphql-a-data-query-language/

thadguidry on 25 Jul 2017

@thadguidry GraphQL is fantastic, but as I explained on the mailing list, I don't think it would be a good choice for this protocol. Here are the reasons. (Some of them are new!)

The reconciliation API already uses a JSON-based format (not sure we can really call that REST), so it is more consistent to extend it in a consistent way with the rest (no pun intended). Moreover, we need other endpoints such as the one for property proposal: these are outside the scope of GraphQL (or at least it would be quite weird to construe them in this language, I think).
GraphQL is hard to implement for service providers: although quite a few bindings are provided for various languages, this is quite a heavy machinery to plug to reconciliation services, which are usually fairly simple services.
GraphQL makes it a lot harder to ensure that batches of queries will be evaluated efficiently by the service. With the current API, if you run a reconciliation service from a SQL database (for instance), it is straightforward to convert a data extension query to a single efficient SQL query that will return all the results you need. If you had to write a full GraphQL server, good luck with that. (For instance, in Python, the graphene library (the standard GraphQL binding for Python), they use aiodataloader to ensure that, which requires the fancy native asynchronous io capabilites of Python 3.5. That's not widely available.)
Requiring services to implement GraphQL would go against the goal of presenting users only a very simple interface that does not expose the query language. Sure, we would be able to build such an interface, but it would be frustrating for the service providers to comply with such a rich API and to be only called with a very particular form of query (which happens to be hard to evaluate efficiently).
Sure, GraphQL "eliminates extra round trips", but so does this API draft! And similarly, the current API also "queries for the exact data that you need" and nothing else. So using GraphQL would not bring any efficiency at all (in fact, as explained above, it would make it harder for data providers to evaluate queries efficiently).
It would also introduce an asymmetry between the notion of properties the reconciliation API currently uses (an ID and a name), and the fields GraphQL uses (just an ID). That's probably not too big a deal but it's just the tip of the iceberg: I'm sure if you try to actually implement what you are proposing you will run into other discrepancies like that, and as you cannot tweak the way GraphQL works, it is going to feel awkward for both the service provider and the user.

Anyway: I think GraphQL looks very promising and it would be fantastic to have support for it in OpenRefine (just like SQL, SPARQL, and probably many others I am not aware of). The reconciliation API is just not the right place for that, as far as I can tell.

But enough said, I think I'll just wait for @codeforkjeff's feedback on this as he has had a look at the current draft.

wetneb on 25 Jul 2017

@codeforkjeff: friendly ping! We are thinking about making a release in a few weeks, so the API specs will be harder to change after that.

wetneb on 20 Sep 2017

I have reached out to various organizations who run reconciliation services for feedback and no issue was raised as to these specifications. I will therefore consider that this data extension API is good to go.

wetneb on 10 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings