Openrefine: CSV files with BOM don't correctly strip BOM on import

Created on 31 Aug 2017 · 19Comments · Source: OpenRefine/OpenRefine

Steps to reproduce:
1) Import CSV file testUNID.zip
2) Set encoding to UTF-8
3) Open project
4) Click "Edit column -> Add column based on this column"
5) In "New column name" type "Foo"
6) In "Expression" type cells["UNID"].value

Value is not loaded because the column name is prepended with a Byte Order Mark (BOM).

CSTSV bug import Medium

Source

eximius313

All 19 comments

I have no problem with this file and this GREL formula.

screenshot-127 0 0 1-3333-2017-08-31-12-27-46

ettorerizza on 31 Aug 2017

wrong file - sorry. Try this one: testUNID.zip

I believe it's because UTF-8 with BOM

eximius313 on 31 Aug 2017

I believe it's because UTF-8 with BOM

Yep, there is an invisble character juste before the word UNID, so cells['UNID'].value doesn't work. If you rename the column UNID, it works.

ettorerizza on 31 Aug 2017

But it's rather a bug.
User doesn't have to know the details of encoding and should not be forced to change the name

eximius313 on 31 Aug 2017

👍1

@eximius313 Did OpenRefine guess the correct encoding in the importer options at the bottom of the Preview as UTF-8 with Bom ? or just UTF-8 ? Did you adjust or change the encoding options at the bottom of the importer Preview ?

thadguidry on 31 Aug 2017

Just investigating this for the moment

I tested the supplied file with and without UTF-8 set on import, and found the same result both ways. Potentially related to a known issue in Java which will not be fixed http://bugs.java.com/view_bug.do?bug_id=4508058

Potentially could use http://commons.apache.org/proper/commons-io/javadocs/api-release/index.html to detect BOM and handle appropriately

ostephens on 10 Oct 2017

👍1

I don't think there is much we can do here other than escaping a column name.
Perhaps something like

\u00A0UNID

Would that help you catch something next time @eximius313 ?

thadguidry on 10 Nov 2017

Maybe two aspects to this:

1) Displaying any special characters in Column names - this might overlap with #1286 ?
2) Making sure BOM is never treated as part of the first column name on import?

I've not looked at the code, but feels like the latter could be a relatively easy think to check for on import and fix?

ostephens on 10 Nov 2017

@ostephens
RE 1 - No it won't overlap with #1286 since that will be a CSS style applied, not a value replacement. For this issue, I would rather do a value replacement by just escaping hidden characters on a created column name as in my example and then the user can see clearly and can rename it.

RE 2 No, that can cause a few other problems. We try really hard to give the users all their data that they import, even on Column names. They might want to clean up their source generation or whatever, so let's be a good citizen and just inform them visually, casually. So let's just give them their data...but slightly tweak for hidden characters in Column names. Once #1286 lands then they also will see hidden characters in cell values as well. But for column names, let's just escape text, instead of CSS styling.

thadguidry on 10 Nov 2017

@thadguidry how about Making sure BOM is never treated as part of the first column name on import + proper warning that it happened?
I don't see any scenario why I would want to have invalid characters in my column names

eximius313 on 13 Nov 2017

@eximius313 "invalid" means different things to different people and machines. That's why.

Maybe someone before thought it was nice to have a black heart because they LOVE something about a column name. http://graphemica.com/%E2%9D%A4

But that black heart will actually display. The issue is when things are hidden and do not display.

I am suggesting that it is a bad practice to hide things from users...when they are hidden...you don't know or cannot see if it has an impact. So this is more about showing users those things that are hidden.

The best way to do that in any programming language is un-hide those characters...and that's always done with _escaping_ into displayable character sequences. Like /u00A0 or whatever.

thadguidry on 13 Nov 2017

Furthermore, the "UTF-8 with BOM" option is missing inside the "Select Encoding" window in the "Configuring Parsing Options" section (see screenshot).

I think this is the first thing to add.

os: Windows 10 64bit | OpenRefine: v2.8 and v3.4 beta | java: JRE 1.8.0_251 x64

iomicifikko on 3 Jun 2020

@iomicifikko Sounds great! Can you add all that to a brand new issue for us? Thanks!

thadguidry on 3 Jun 2020

It seems highly unlikely that users want a BOM in their data, so I'm not sure we need an option to control this. The comment from 2017 by @ostephens is right on the money. https://github.com/OpenRefine/OpenRefine/issues/1241#issuecomment-335403657

We should use commons.apache.org.BOMInputStream to strip it for the UTF cases (and take it into account for encoding guessing, if we don't do so already).

tfmorris on 3 Jun 2020

It seems highly unlikely that users want a BOM in their data, so I'm not sure we need an option to control this. The comment from 2017 by @ostephens is right on the money. #1241 (comment)

True

We should use commons.apache.org.BOMInputStream to strip it for the UTF cases (and take it into account for encoding guessing, if we don't do so already).

Encoding guessing is great but I don't know if there is any chance of wrong choice by the software.
OpenRefine let the user choose between, for instance, x-UTF-16LE-BOM and UTF-16LE during parsing configuration. Same thing for UTF-32. Actually there is not similar option for UTF-8.

@thadguidry : thanks for the link but maybe it is better to wait a little longer, the discussion is still evolving.

iomicifikko on 3 Jun 2020

@iomicifikko Sure. But in this issue, can you tell us if you have the same issue as the original poster? or slightly different and what is that? It would still be nice to know your problem exactly, which wasn't apparent in your comments above.

thadguidry on 3 Jun 2020

You're right, sorry. Same problem as the original poster.

iomicifikko on 3 Jun 2020

👍1

@iomicifikko That list of encodings comes from Java and is the complete list that they support. It does not included UTF-8 with BOM, because that's not something they support. BTW x-UTF-16LE-BOM is redundant with UTF-16 which allows for a BOM.

The reason that Java doesn't support UTF-8 with BOM is that the BOM isn't recommended in that case because UTF-8 is defined on a byte level and doesn't change. Unfortunately a major platform (cough, Windows, cough) decide that their bundled text editor was going to go around polluting files with this, so, as a practical matter apps need to deal with it.

Any UTF-8, UTF-16, or UTF-32 file that begins with a BOM should alway shave it removed. For UTF-16 & UTF-32, Java wil do that automatically. For UTF-8, we'll implement that ourselves using Apache Commons.

tfmorris on 3 Jun 2020

👍1

@tfmorris thanks for the explanation, all clear.

iomicifikko on 3 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

incompatible with http2

davidegiunchidiennea · 3Comments

Add a preference to control the choice of row quantity displayed (i.e. other than [5, 10, 25, 50])

antoine2711 · 3Comments

Can't select a different sheet than the default in a gSheet importation

antoine2711 · 3Comments

Text facet sort by name should use case & diacritic insensitive collation

tfmorris · 3Comments

Unicode support for regex

lapoisse · 3Comments