Describe the bug
Reconciling a list of 250 author names against Wikidata type Q5 (human) stayed at 0% complete with the spinner going forever.
Looking at the console showed a traceback from an unhandled exception:
Exception in thread "Thread-9" java.lang.IllegalArgumentException: Strings must not be null
at org.apache.commons.lang.StringUtils.getLevenshteinDistance(StringUtils.java:6164)
at com.google.refine.model.recon.StandardReconConfig.computeFeatures(StandardReconConfig.java:577)
at com.google.refine.model.recon.StandardReconConfig.createReconServiceResults(StandardReconConfig.java:560)
at com.google.refine.model.recon.StandardReconConfig.batchRecon(StandardReconConfig.java:489)
at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:282)
at java.lang.Thread.run(Thread.java:748)
Expected behavior
At a minimum, errors in the reconciliation process shouldn't kill the whole process and hang the UI. It looks like any exceptions other than IOErrors will currently kill the reconciliation process (as noted in #1128).
This specific bug should be fixed as well. It looks like the reconciliation service is returning a Null as a candidate since that's the only way I can see to trigger the error in this code:
Desktop :
OpenRefine (please complete the following information):
I feel like there are tons of error handling failures like this in the reconciliation code. Both in OpenRefine as a client and in the Wikidata reconciliation service as a server actually…
The candidate coming back for "Arthur Conan Doyle" is
{"result":[{"id":"Q35610","score":0.5,"match":false,"type":[{"id":"Q5","name":"human"}]}],"total_search_results":375}
which has no name key, causing things to blow up.
@wetneb Any idea what's going on here on the Wikidata side of things?
I can fix the error handling, but it seems like there's more going on.
@tfmorris Wasn't name optional back in the day of Freebase? I recall we would sometime only get the mid back if a name wasn't applied to the Topic yet.
@tfmorris which Wikidata reconciliation service are you using by the way? I don't recognize the total_search_results key at all, it does not seem to be my service.
With the one I maintain:
https://tools.wmflabs.org/openrefine-wikidata/en/api?queries=%7B%22q0%22%3A%7B%22query%22%3A%22Arthur+Conan+Doyle%22%2C%22type%22%3A%22Q5%22%7D%7D
Names, particularly in a given language, are logically optional for both Freebase and Wikidata entities, but the OpenRefine code appears to have always (ie for 10 yrs) assumed that a name will be returned, so I'm guessing it was never null for the Freebase reconciliation service. For these examples in Wikidata, they all have English language labels, so I have no idea why they're not being returned. While I can make it not error with no name, it doesn't present anything useful in the UI.
I thought Wikidata was commonly used as a reconciliation service. It doesn't appear usable in its current form at all.
@wetneb Apologies I thought this was your service, but following the link in https://github.com/OpenRefine/OpenRefine/issues/1128, I see that it belongs to Magnus. How many Wikidata reconciliation services are there?
Those are the only two services I am aware of. With a recent version of OpenRefine you should have mine installed by default.
I'm running a private build from the current master branch and it appears that his service is the default. I'll investigate why.
For the error handling/reporting, I'll improve 3 things:
I suspect you might have tried his service in the past, and this has persisted in your workspace configuration ever since. I think we might only register my Wikidata reconciliation service if there are no others in the list.
Ideally the JSON schemas we wrote in the reconciliation CG could be useful for this validation work. When deserializing JSON responses from a service we could check them against the schema, although this should ideally be redundant with the checks Jackson does when deserializing JSON to our POJOs, if they are annotated correctly. One interesting experiment would be to use Jackson's own JSON schema generation capabilities to see if the one it generates from our classes match what we came up with in the specs…
@tfmorris Hmm...reading through this... Does Magnus have work to do to complete the picture for this issue? I didn't see phabricator issues mentioned. Are we coordinating with him or need to?
No, it's an abandoned project. I'm just using it for negative testing.