Openrefine: Tab character in a value in a JSON file results in a truncated data import

Created on 9 Dec 2016 · 11Comments · Source: OpenRefine/OpenRefine

When importing JSON to OpenRefine through 'Create Project', if there is a tab inside a value in the JSON file, the import succeeds but only imports data up to the tab character. e.g. JSON:

{ "rows": [ { "Content": "A" }, { "Content": " B" }, { "Content": "C" } ] }
(tab character is before the 'B')

Since tab is an illegal character in this position (should be escaped as \t) it makes sense for the import to fail - but failing silently is a problem.

Not sure what the best solution is - three possibilities:

Import should fail completely with an error
Import should succeed with the data truncated as currently but with feedback to the user that the truncation has occurred and why
Import should succeed with full data, converting illegal character to escaped version

JSON enhancement import

Source

ostephens

Most helpful comment

This currently does produce an error:

Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value

although it doesn't happen until you make your record selection (ie not in initial preview).

Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.

tfmorris on 21 May 2020

👍2

All 11 comments

I think 3 would be most appropriate given OpenRefine is a data cleansing tool. And we should try to cleanse a bit of that. Folks expect it not to fail for a String value and we can just escape for those cases.

thadguidry on 9 Dec 2016

Looks like we might be able to just expand on the CASE logic for VALUE_EMBEDDED_OBJECT
https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/importers/JsonImporter.java#L125
using JsonToken
http://fasterxml.github.io/jackson-core/javadoc/2.2.0/index.html?com/fasterxml/jackson/core/JsonParseException.html
or perhaps using JsonParser.Feature methods might do it
http://fasterxml.github.io/jackson-core/javadoc/2.2.0/index.html?com/fasterxml/jackson/core/JsonParseException.html

thadguidry on 9 Dec 2016

Hi, I am now working with this issue and had make some progress.

Just double check if I understand the expected effect correct: file with { "rows": [ { "Content": "A" }, { "Content": " B" }, { " Content": "C" } ] } should be able to get loaded and there are two types of Content, one with a tab and one do not. And there is a tab before value B .

zengchu2 on 17 Mar 2020

I am not sure what you mean by "two types of Content": I would expect A, B and C to end up in the same column.

wetneb on 17 Mar 2020

I see, so unquoted control chars (like \t, \n) should be ignored during the parsing right?

zengchu2 on 17 Mar 2020

I would not ignore these unescaped control characters - see the 3 options mentioned by Owen above.

wetneb on 17 Mar 2020

This currently does produce an error:

Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value

although it doesn't happen until you make your record selection (ie not in initial preview).

tfmorris on 21 May 2020

👍2

We clean messy data. We DO NOT clean messy formats? I guess that's what we should adhere to since we are dependent on a large ecosystem of helper libraries for parsing and validating before importing the data.

In general, I take back my comment above that we should offer an option to cleanse unescaped values to clean messy JSON, since I was not aware that this was in fact an invalid or unknown format parsing error for JSON (Thanks to @tfmorris for helping me understand now on this)

Let's close this issue if we all agree and that our existing error message is adequate enough for this.
Or if we need to work further on a fix regarding @ostephens possible solution 3:

Import should fail completely with an error

thadguidry on 22 May 2020

I think option 3, permissive parsing, always enabled with no option setting is the right answer here.

Import should succeed with full data, converting illegal character to escaped version

More generally, I'd really like to avoid the situation where someone works on a PR that addresses an issue only to have the code review devolved into a labored discussion as to whether the original issue was valid or not. Let's get that resolved up front and not flip flop on our decisions. We have a number of open PRs where this is happening and it's got to be discouraging to new contributors.

tfmorris on 22 May 2020

👍1

@tfmorris Ah, so you agreed with my original feeling, OK. I thought you were disagreeing. The flip-flopping happens because I read one thing instead of hearing you voice the comment. So I think about it more, because I value your opinion @tfmorris and think around it. No problem. I'm very used to code review opening up more questions back to the original design issue #. I think new contributors should be comfortable with some of that; they'll navigate that in the real world soon enough. But agree that the issues are where we should do the collaborating on the design or on the dev mailing list.

thadguidry on 22 May 2020

👍1

Closed by #2431

wetneb on 16 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings