Describe the bug
Some changes of a project have disappeared when I closed and reopened the project. Before closing it I exported the project and in the history folder there are all the changes. But even re-importing the project I can't see them on screen.
See: https://groups.google.com/forum/#!topic/openrefine/Pw1Bk6eV9hY
To Reproduce
It seems that the bug could be reproduced using a reconciliator as the Library of Congress Conciliator or the Geonames conciliator. But it's a strange behaviour that I never encountered before.
Desktop (please complete the following information):
OpenRefine (please complete the following information):
@marlara I think it would help a lot if you could share your exported project with us. If you do not want to make it public, you can also email it to me (first name @ last name .eu).
Edit: I have the project now, thanks!
I am actually using a reconciliation service in OR 3.1 and, under the hood, I see these messages. Could it be related to this bug?
18:06:34.541 [ ProjectManager] Saving some modified projects ... (300022ms)
java.lang.NullPointerException
at com.google.refine.operations.recon.ReconOperation.getBriefDescription(ReconOperation.java:107)
at com.google.refine.operations.recon.ReconOperation.write(ReconOperation.java:116)
at com.google.refine.history.HistoryEntry.write(HistoryEntry.java:115)
at com.google.refine.io.FileHistoryEntryManager.save(FileHistoryEntryManager.java:69)
at com.google.refine.history.HistoryEntry.save(HistoryEntry.java:121)
at com.google.refine.history.History.save(History.java:297)
at com.google.refine.model.Project.saveToWriter(Project.java:156)
at com.google.refine.model.Project.saveToOutputStream(Project.java:138)
at com.google.refine.io.ProjectUtilities.saveToFile(ProjectUtilities.java:103)
at com.google.refine.io.ProjectUtilities.save(ProjectUtilities.java:66)
at com.google.refine.io.FileProjectManager.saveProject(FileProjectManager.java:254)
at com.google.refine.ProjectManager.saveProjects(ProjectManager.java:316)
at com.google.refine.ProjectManager.save(ProjectManager.java:231)
at com.google.refine.RefineServlet$AutoSaveTimerTask.run(RefineServlet.java:92)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
@ettorerizza could well be! Can you let us know which service it is?
My own reconciliation service with AAT Thesaurus, but copied without shame from LC Reconcile, which served as a basis for quite a few other services in Python.
Thanks.
Based on the project communicated by @marlara it looks like the issue could well come from issues during JSON serialization or deserialization of reconciliation operations. This is probably due to some unusual settings in the reconciliation service (such as not providing an identifierSpace and schemaSpace). Otherwise it could be an issue in OR which might have been fixed in master by the Jackson migration.
Because the change files are still in the history/ folder of the project it is conceivable to recover these changes: it would require re-generating the JSON items for the corresponding operations, but it is not clear if that would be easier than redoing the work directly.
So it's indeed because these reconciliation services do not declare an identifierSpace and schemaSpace. These are two fields that should be added to the service metadata: https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#service-metadata. These fields are required, so it's the services fault.
That being said OR should handle these more gracefully than this. In OR 3.1 we have exceptions like this:
18:32:19.335 [ recon-config] Reconstruct failed (2ms)
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.google.refine.model.recon.ReconConfig.reconstruct(ReconConfig.java:100)
at com.google.refine.model.changes.ReconChange.load(ReconChange.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.google.refine.history.History.readOneChange(History.java:83)
at com.google.refine.history.History.readOneChange(History.java:69)
at com.google.refine.io.FileHistoryEntryManager.loadChange(FileHistoryEntryManager.java:96)
at com.google.refine.io.FileHistoryEntryManager.loadChange(FileHistoryEntryManager.java:80)
at com.google.refine.history.HistoryEntry.revert(HistoryEntry.java:155)
at com.google.refine.history.History.undo(History.java:236)
at com.google.refine.history.History.undoRedo(History.java:172)
at com.google.refine.history.HistoryProcess.performImmediate(HistoryProcess.java:82)
at com.google.refine.process.ProcessManager.queueProcess(ProcessManager.java:97)
at com.google.refine.commands.history.UndoRedoCommand.doPost(UndoRedoCommand.java:69)
at com.google.refine.RefineServlet.service(RefineServlet.java:190)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.json.JSONException: JSONObject["identifierSpace"] not a string.
at org.json.JSONObject.getString(JSONObject.java:721)
at com.google.refine.model.recon.StandardReconConfig.reconstruct(StandardReconConfig.java:116)
... 42 more
In master, the services are not usable because running a reconciliation operation gives this:
18:37:04.795 [ refine] POST /command/core/reconcile (1726ms)
Exception in thread "Thread-6" java.lang.NullPointerException
at com.google.refine.model.recon.StandardReconConfig.computeFeatures(StandardReconConfig.java:567)
at com.google.refine.model.recon.StandardReconConfig.createReconServiceResults(StandardReconConfig.java:555)
at com.google.refine.model.recon.StandardReconConfig.batchRecon(StandardReconConfig.java:485)
at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:282)
at java.lang.Thread.run(Thread.java:748)
That's already better because it will not cause data losses in the first place.
So to solve this we need to change the way reconciliation services are parsed to include default schema spaces and identifier spaces for these reconciliation services.
@marlara so I think I have a fix in #1937 - but unfortunately it is not going to get you your data back by itself.
I am sure it is very frustrating to lose hours of work (apologies for this on behalf of the rest of the team!).
If I the data loss is a serious problem for you I can try to hack something around get the data back.
@marlara so I think I have a fix in #1937 - but unfortunately it is not going to get you your data back by itself.
I am sure it is very frustrating to lose hours of work (apologies for this on behalf of the rest of the team!).If I the data loss is a serious problem for you I can try to hack something around get the data back.
Hi, sorry for answering so late. It's great that you have understood the issue so quickly and I'm happy for have been useful.
I would like to have the data back, if you coul manage something it's fine. But I don't want to bother too much, so if I can do the work with some tips I could try.
You can write me some instruction, if you like, also by mail.
Since at least one other OpenRefine user has been faced with the same problem (Micon, in the Google Group), could the hack to retrieve the data be documented somewhere publicly?
I would mostly like for it not to happen again, so I can proceed with my task. Is an update for OR expected soon?
@MiconSchorsij The fastest for you would probably be to modify your reconciliation service to add the missing items. Which service do you use?
I use the LoC reconcile script by mphilli: https://github.com/mphilli/LoC-reconcile
Yes, right, you already mentioned it. This fork of the service contains default values for identifierSpace and schemaSpace, and should therefore avoid the bug in question.
Basically, this is the approach I would use to recover the data: for each history file that does not correspond to an operation in the history, generate a JSON representation of an operation with the same timestamp (the description and the actual operation should not matter as long as they correspond to some actual operation). Then add these JSON representations back in the data.txt file (in the section of unapplied changes). This should make it possible to redo the changes from the UI, updating the grid with the new data.
Apparently a broken project stays broken. Or something else fails. I followed the following steps:
Ok, I'll try, hope I can manage
If you wish, I can also take a look. Just send me the project to ettorerizza at gmail.com.
Apparently a broken project stays broken.
@MiconSchorsij Sorry for the late response. Indeed, it seems that if the file data.txt in data.zip always contains a reconciliation performed with a service that was incorrectly set, the error will continue. It even looks like the projects are corrupted to the point that we cannot extract the history by clicking on "undo/redo -> Extract". It seems possible to repair a project. This requires to :
"identifierSpace":null,"schemaSpace":null by identifierSpace":"something", "schemaSpace":"something".I can do that for you if you want.
This won't restore lost operations. In your case, you will recover only 59 changes, but your project will now be healthy. To recover the missing operations, it would be necessary to parse each history file. But as already said @wetneb, it is not certain that it goes faster than to start again thirty manipulations.
@marlara I will try to find the time to recover your data tomorrow.
@marlara I have tried to recover the data with the idea sketched above, however I suspect that the state of the project is inconsistent with the position in the history. If you had the source file for this project it would help: that would let me replay all the changes from the start.
@wetneb thanks a lot, I'll you a mail
Hi @wetneb , I have the same problem described here. tried to use the mailing list but no luck yet.
I'll try to elaborate a little:
Last Thursday, by 13:00/1PM, I still had a column of VIAF_ID fully reconciled with Wikidata (a large part of it was matched, but not all) and many more thousands of values in 2 columns (split 1 and split 2) reconciled with data. Also, the recent history in the "undo" was from the morning and the night and day before.
Something happened on Thursday afternoon and in the evening, the project suddenly lost the VIAF reconciled column and most of the reconciliation on the other columns. the "undo" panel showed actions that I did back in June.
When I look at the workspace directory, I see files in the history folder of the project through Thursday, before and after this change, but after copying the project folder, removing history files from Thursday afternoon and reopening Open Refine, nothing changes and I still get the old state from June. Is there a way to re-update the project using the history files? I uploaded the project to here: https://drive.google.com/drive/folders/1FYgAIHV2ZDeJIZre4JLChn3_u5f6Zy1D?usp=sharing
many thanks,
Sinai
Most helpful comment
So it's indeed because these reconciliation services do not declare an
identifierSpaceandschemaSpace. These are two fields that should be added to the service metadata: https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#service-metadata. These fields are required, so it's the services fault.That being said OR should handle these more gracefully than this. In OR 3.1 we have exceptions like this:
In master, the services are not usable because running a reconciliation operation gives this:
That's already better because it will not cause data losses in the first place.
So to solve this we need to change the way reconciliation services are parsed to include default schema spaces and identifier spaces for these reconciliation services.