One of the powerful features of OpenRefine is the reconciliation service, which allows users to normalise values in their own dataset agains matches or near matches within other datasets.
At the current time, it is difficult for a general user to create a simple reconciliation service directly from their own dataset (eg represented within simple CSV file) that they can use to normalise or clean other datasets they are working on.
In the past, there have been examples of simple reconciliation services (eg https://github.com/OpenRefine/reconciliation_service_skeleton ) but these appear to have long since rotted or been deprecated.
It would be useful if there was an example of a simple working reconciliation server such as a revised version of https://github.com/OpenRefine/reconciliation_service_skeleton . (More generally, it might make sense to explore the feasibility of a plugin for the datasette ecosystem to implement a reconciliation service against data contained in a SQLite database via a datasette API service. For example, I note another recent plugin to that system that implements fuzzy searching.)
However, I also wonder if it would make sense for OpenRefine to bundle a service that would allow a user to define a reconciliation service from an OpenRefine project, for example by:
1) selecting a project;
2) identifying the "key" column that a reconciliation match attempt is applied to;
3) identifying a fuzzy match function used to return a match score.
Users could then try to reconcile data in one project against data in a second project mediated by a reconciliation service running via OpenRefine applied to the second project.
That was indeed suggested before: #941.
David Huynh and I wanted to make this easier since almost day one.
See #176 "Reconcile (extension or enhancement) between 2 projects". The reference to "matchmaker" interface is essentially Reconciling, which was a service hosted in Freebase at the time.
It would be nice to note the differences between "Reconciling 2 projects" and "Joining across 2 projects".
That was always something that confused a few folks, and yet, we always heard the needs as being different. Some thought that Joining was cross() or a smarter cross() with Reconcile features. Others thought a single self-hosted Reconcile service could perform any number of Joining or Reconciling functions, some automated, some manual that needed review.
I suspect that there is at least one requirement missing from the list of things needed here:
However, I also wonder if it would make sense for OpenRefine to bundle a service that would allow a user to define a reconciliation service from an OpenRefine project, for example by:
- selecting a project;
- identifying the "key" column that a reconciliation match attempt is applied to;
- identifying a fuzzy match function used to return a match score.
I think in addition reconciliation would require
This would need to be something that could be referenced over time and even given changes to the project serving as a reconciliation data source to make sure functions like "add column from reconciled value" can work reliably
I've been wondering about this again. Would it make sense to explore this in the context of a reconciliation plugin to datasette?
datasette fronts sqlite, has in-built support for publishing a datasette server to various (free) online hosts, and is increasingly customisable. As well as being able to get data into SQLite using OpenRefine, there's a wide range of tools to support getting data into sqlite from various file formats.
It looks like a cookie-cutter for datasette plugins is on the way, which would simplify things even further...
Most helpful comment
I've been wondering about this again. Would it make sense to explore this in the context of a reconciliation plugin to
datasette?datasettefronts sqlite, has in-built support for publishing adatasetteserver to various (free) online hosts, and is increasingly customisable. As well as being able to get data into SQLite using OpenRefine, there's a wide range of tools to support getting data into sqlite from various file formats.It looks like a cookie-cutter for
datasetteplugins is on the way, which would simplify things even further...