When importing JSON to OpenRefine through 'Create Project', if there is a tab inside a value in the JSON file, the import succeeds but only imports data up to the tab character. e.g. JSON:
{
"rows": [
{
"Content": "A"
},
{
"Content": " B"
},
{
"Content": "C"
}
]
}
(tab character is before the 'B')
Since tab is an illegal character in this position (should be escaped as \t) it makes sense for the import to fail - but failing silently is a problem.
Not sure what the best solution is - three possibilities:
I think 3 would be most appropriate given OpenRefine is a data cleansing tool. And we should try to cleanse a bit of that. Folks expect it not to fail for a String value and we can just escape for those cases.
Looks like we might be able to just expand on the CASE logic for VALUE_EMBEDDED_OBJECT
https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/importers/JsonImporter.java#L125
using JsonToken
http://fasterxml.github.io/jackson-core/javadoc/2.2.0/index.html?com/fasterxml/jackson/core/JsonParseException.html
or perhaps using JsonParser.Feature methods might do it
http://fasterxml.github.io/jackson-core/javadoc/2.2.0/index.html?com/fasterxml/jackson/core/JsonParseException.html
Hi, I am now working with this issue and had make some progress.
Just double check if I understand the expected effect correct: file with { "rows": [ { "Content": "A" }, { "Content": " B" }, { " Content": "C" } ] } should be able to get loaded and there are two types of Content, one with a tab and one do not. And there is a tab before value B .
I am not sure what you mean by "two types of Content": I would expect A, B and C to end up in the same column.
I see, so unquoted control chars (like \t, \n) should be ignored during the parsing right?
I would not ignore these unescaped control characters - see the 3 options mentioned by Owen above.
This currently does produce an error:
Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value
although it doesn't happen until you make your record selection (ie not in initial preview).
Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.
We clean messy data. We DO NOT clean messy formats? I guess that's what we should adhere to since we are dependent on a large ecosystem of helper libraries for parsing and validating before importing the data.
In general, I take back my comment above that we should offer an option to cleanse unescaped values to clean messy JSON, since I was not aware that this was in fact an invalid or unknown format parsing error for JSON (Thanks to @tfmorris for helping me understand now on this)
Let's close this issue if we all agree and that our existing error message is adequate enough for this.
Or if we need to work further on a fix regarding @ostephens possible solution 3:
Import should fail completely with an error
I think option 3, permissive parsing, always enabled with no option setting is the right answer here.
- Import should succeed with full data, converting illegal character to escaped version
More generally, I'd really like to avoid the situation where someone works on a PR that addresses an issue only to have the code review devolved into a labored discussion as to whether the original issue was valid or not. Let's get that resolved up front and not flip flop on our decisions. We have a number of open PRs where this is happening and it's got to be discouraging to new contributors.
@tfmorris Ah, so you agreed with my original feeling, OK. I thought you were disagreeing. The flip-flopping happens because I read one thing instead of hearing you voice the comment. So I think about it more, because I value your opinion @tfmorris and think around it. No problem. I'm very used to code review opening up more questions back to the original design issue #. I think new contributors should be comfortable with some of that; they'll navigate that in the real world soon enough. But agree that the issues are where we should do the collaborating on the design or on the dev mailing list.
Closed by #2431
Most helpful comment
This currently does produce an error:
although it doesn't happen until you make your record selection (ie not in initial preview).
Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.