Openrefine: Tab character in a value in a JSON file results in a truncated data import

Created on 9 Dec 2016  路  11Comments  路  Source: OpenRefine/OpenRefine

When importing JSON to OpenRefine through 'Create Project', if there is a tab inside a value in the JSON file, the import succeeds but only imports data up to the tab character. e.g. JSON:

{ "rows": [ { "Content": "A" }, { "Content": " B" }, { "Content": "C" } ] }
(tab character is before the 'B')

Since tab is an illegal character in this position (should be escaped as \t) it makes sense for the import to fail - but failing silently is a problem.

Not sure what the best solution is - three possibilities:

  1. Import should fail completely with an error
  2. Import should succeed with the data truncated as currently but with feedback to the user that the truncation has occurred and why
  3. Import should succeed with full data, converting illegal character to escaped version
JSON enhancement import

Most helpful comment

This currently does produce an error:

Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value

although it doesn't happen until you make your record selection (ie not in initial preview).

Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.

All 11 comments

I think 3 would be most appropriate given OpenRefine is a data cleansing tool. And we should try to cleanse a bit of that. Folks expect it not to fail for a String value and we can just escape for those cases.

Hi, I am now working with this issue and had make some progress.

Just double check if I understand the expected effect correct: file with { "rows": [ { "Content": "A" }, { "Content": " B" }, { " Content": "C" } ] } should be able to get loaded and there are two types of Content, one with a tab and one do not. And there is a tab before value B .

I am not sure what you mean by "two types of Content": I would expect A, B and C to end up in the same column.

I see, so unquoted control chars (like \t, \n) should be ignored during the parsing right?

I would not ignore these unescaped control characters - see the 3 options mentioned by Owen above.

This currently does produce an error:

Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value

although it doesn't happen until you make your record selection (ie not in initial preview).

Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.

We clean messy data. We DO NOT clean messy formats? I guess that's what we should adhere to since we are dependent on a large ecosystem of helper libraries for parsing and validating before importing the data.

In general, I take back my comment above that we should offer an option to cleanse unescaped values to clean messy JSON, since I was not aware that this was in fact an invalid or unknown format parsing error for JSON (Thanks to @tfmorris for helping me understand now on this)

Let's close this issue if we all agree and that our existing error message is adequate enough for this.
Or if we need to work further on a fix regarding @ostephens possible solution 3:

Import should fail completely with an error

I think option 3, permissive parsing, always enabled with no option setting is the right answer here.

  1. Import should succeed with full data, converting illegal character to escaped version

More generally, I'd really like to avoid the situation where someone works on a PR that addresses an issue only to have the code review devolved into a labored discussion as to whether the original issue was valid or not. Let's get that resolved up front and not flip flop on our decisions. We have a number of open PRs where this is happening and it's got to be discouraging to new contributors.

@tfmorris Ah, so you agreed with my original feeling, OK. I thought you were disagreeing. The flip-flopping happens because I read one thing instead of hearing you voice the comment. So I think about it more, because I value your opinion @tfmorris and think around it. No problem. I'm very used to code review opening up more questions back to the original design issue #. I think new contributors should be comfortable with some of that; they'll navigate that in the real world soon enough. But agree that the issues are where we should do the collaborating on the design or on the dev mailing list.

Closed by #2431

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kushthedude picture kushthedude  路  3Comments

dantexier picture dantexier  路  4Comments

stellasia picture stellasia  路  4Comments

davidegiunchidiennea picture davidegiunchidiennea  路  3Comments

thadguidry picture thadguidry  路  3Comments