Sometimes large strings are being stored in the metadata.json that can reach into the Megabyte and Gigabytes of size.
Issue was found at the following JSON paths (could not provide JSON snippets since it exceeded the memory needed to show in full hierarchy form):
Steps to reproduce the behavior:
Project metadata should not be exceedingly large for projectName or sheets name and should probably have a confined limit and some error handling to warn user when this limit is exceeded (sometimes for known reasons, and proactively, for unknown reasons).
Additional Exception handling should also be in place with appropriate logging for users to capture and give us feedback on issues, just in case we miss some serialization issues or POI corner cases, etc. for Project Name writing/reading from metadata.json.


https://drive.google.com/file/d/1l9B-77ixA7mmghSbCUwH9tj3aRLC_ci7/view?usp=sharing
https://groups.google.com/g/openrefine/c/GVNKIM8BM1M/m/_5KNHbiSBAAJ
@thadguidry This is the XLSX you asked for:
https://drive.google.com/file/d/1cVMO5X7hdVMG9J6-sjTzq4t3jjmxmTfm/view?usp=sharing
@VojtechDostal
I was able to download the XLSX file correctly.
Thanks for the info privately... I wanted to be sure and reproduce the steps exactly on my Windows 10 machine.
Is it possible that some of the older projects, perhaps like that XLSX file... was initially created with an older OpenRefine version? Quite likely or no?
Specifically, for this XLSX file, I see a date of "created":"2020-03-30T20:31:43Z" and wondering what version of OpenRefine was used during that time or what other factors might have contributed (new computer, default Locale changed in Windows 10, etc.)
Reason I ask is because we did have a few bugs that might be closely aligned to affecting something like this.
I imported your XLSX file and didn't see any problem with the metadata... it looks fine to me with the current 3.4.1 handling, and does not have the super long json as your metadata files contain and causing the memory loading issue for you.

@VojtechDostal Can you tell us when you downloaded and actually began using OpenRefine 3.4.1? That will help eliminate my concerns around the Importers refactoring that our kind @tfmorris did beginning in #2963 where a bunch of bugs were indeed found, fixed, and then landed in 3.4.1 release on 2020-09-24
regarding this screenshot:
https://user-images.githubusercontent.com/986438/102655165-74972000-4137-11eb-8701-86e2596f9718.png
It's interesting that the metadata is still kept from the time when I included non-ASCII characters in descriptions. I then deleted those descriptions with non-ASCII characters, but it seems you can still retrieve them from the project.
Is it possible that some of the older projects, perhaps like that XLSX file... was initially created with an older OpenRefine version? Quite likely or no?
Can you tell us when you downloaded and actually began using OpenRefine 3.4.1?
Yes, that is quite possible. I only downloaded 3.4.1. a few months ago but I don't remember when precisely. Could have been in November or October. Most likely before 24 September if that helps. My download log in Chrome suggests it might have been on 10 November.
Then that might explain it "I think".
Now the question is HOW to fix all your metadata.json files. That could probably be done with a cleanup script that would iterate over all of them and then automatically substring cut down to only the first 255 chars of each ones sheets [{ "name"="..." field.
Do you have bash or python skills to handle that script yourself?
This doesn't sound familiar to me. What is the content of the multihundreds of MB (ie where did it come from)? Is it the data content of the original file or ...? Is it always the sheet names that are the offenders? Are they all from XLSX imports?
@thadguidry were you able to reproduce the problem when importing into the current version or is this purely a historical issue? (not to say that that's not a problem too)
@tfmorris It is always the sheet names field from XLSX files in the metadata.json. I tried for 4 hours this afternoon to try to reproduce with 3.4.1 and could not using weird XLSX files and sheet names with Unicode as I could find. So likely this is a POI issue now fixed with update, or your refactoring that has now fixed something. I did try 3.3 to see if I could get really long sheet name in metadata, but again, it seemed to work OK for me. It's likely some system issue, locale issue, etc. that I couldn't reproduce on my English Windows 10.
Now the question is HOW to fix all your metadata.json files. That could probably be done with a cleanup script that would iterate over all of them and then automatically substring cut down to only the first 255 chars of each ones sheets [{ "name"="..." field.
Do you have bash or python skills to handle that script yourself?
Unfortunately I don't have programming skills. I tried to open one of the larger metadata.json files (about 500 MB) and edit it manually, but I cannot even open it. Running a script might be not only the simplest, but possibly the only option... I am able to run simple commands from the command line, but iterating over a directory tree and replacing some text in these large files is above my level.
Potentially relevant issues / PRs:
There are 10 projects with large (>2KB) metadata (sizes in MB):
536 ./1781864111473.project/metadata.json
478 ./1621259914321.project/metadata.json
303 ./1865935361423.project/metadata.json
238 ./2042959776995.project/metadata.json
76 ./2207521617959.project/metadata.json
53 ./2193321283605.project/metadata.json
49 ./1752256550554.project/metadata.json
47 ./1638916963113.project/metadata.json
11 ./2335512028683.project/metadata.json
which were created between 2020-03-30 and 2020-09-29:
2020-03-30T20:28:55Z ./2042959776995.project/metadata.json.new
2020-03-30T20:31:09Z ./2335512028683.project/metadata.json.new
2020-03-30T20:31:43Z ./1752256550554.project/metadata.json.new
2020-04-02T15:32:42Z ./2207521617959.project/metadata.json.new
2020-04-02T15:32:42Z ./2207521617959.project/metadata.json.new.new
2020-04-02T15:35:08Z ./1638916963113.project/metadata.json.new
2020-04-02T15:35:08Z ./1638916963113.project/metadata.json.new.new
2020-08-17T17:22:34Z ./1621259914321.project/metadata.json.new
2020-08-30T13:50:54Z ./2193321283605.project/metadata.json.new
2020-09-29T07:02:38Z ./1865935361423.project/metadata.json.new
2020-09-29T07:15:58Z ./1781864111473.project/metadata.json.new
Unfortunately, we don't keep the OpenRefine version in the project metadata. Dates for release are:
v3.4.1 - Sep 2020
v3.3 - Jan 2020
v3.2 - Jul 2019
The most likely explanation would seem to be that either Apache POI or OpenRefine used the entire sheet object instead of just it's name, but a cursory search doesn't turn up an matching bug.
It looks like the files are not too big to process with jq, so this one-liner will take care of most of them (after you make sure that you've got a good backup):
find . -name metadata.json -size +1M -exec sh -c 'jq -c -r "del(.importOptionMetadata[].sheets) | del(.importOptionMetadata[].fileSource)" "$1" > "$1.new"' -- {} \; -print
You can run it from your workspace directory and it will create a new set of metadata files:
1 ./2042959776995.project/metadata.json.new
1 ./2335512028683.project/metadata.json.new
1 ./1621259914321.project/metadata.json.new
1 ./1752256550554.project/metadata.json.new
47 ./1638916963113.project/metadata.json.new
1 ./2193321283605.project/metadata.json.new
1 ./1781864111473.project/metadata.json.new
81 ./2207521617959.project/metadata.json.new
1 ./1865935361423.project/metadata.json.new
Two of the files are smaller, but still much too large. They have 47 MB and 81 MB projectName values. They can be fixed up with
find . -name metadata.json.new -size +1M -exec sh -c 'jq -c -r "del(.importOptionMetadata[].projectName)" "$1" > "$1.new"' -- {} \; -print
which will create *.new.new versions with that field deleted as well.
If, after careful inspection, the new metadata files look correct, you can rename them on top of the old bloated files and delete the originals.
If you don't have jq installed, you can find it here: https://stedolan.github.io/jq/
Since it's not believed to be a bug in the current version and we have a way to fix up the metadata, I've removed the Critical priority from this.
@VojtechDostal If you have trouble with jq on Windows (like I did this afternoon) I can help you get it setup correctly with Windows Terminal and a Profile that includes (commandline "cmd /K doskey jq=E:\Downloads\jq-win64.exe")
or ideally through Git for Windows installed and then just use Git Bash:
$ alias jq='/e/Downloads/jq-win64.exe'
then Tom's one-liner should work.
Thanks so much @tfmorris and @thadguidry . I wasn't able to install jq from Git (not sure why) but I set it up in my Cygwin. The commands worked perfectly using Cygwin and I am now able to open the whole original project directory without problems! I recommend the solution above to anyone having the same sort of issues.
@VojtechDostal Great! We'll close this issue out now.
@tfmorris Thanks so much for providing great support on this!
Most helpful comment
Thanks so much @tfmorris and @thadguidry . I wasn't able to install jq from Git (not sure why) but I set it up in my Cygwin. The commands worked perfectly using Cygwin and I am now able to open the whole original project directory without problems! I recommend the solution above to anyone having the same sort of issues.