Openrefine: HTML/CSV export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000

Created on 6 Jun 2017  ·  27Comments  ·  Source: OpenRefine/OpenRefine

OpenRefine 2.7 rc2

After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred.

displayed characters
image
有限会社なべ茶屋あさ𡌛

Exported garbled characters
image
有限会社なべ茶屋あさ����

other garbled export charactor sample
𣘺𣳾

CSTSV bug encoding export import High

Most helpful comment

Excellent. Thank you very much for testing @Yuutakasan

All 27 comments

@Yuutakasan hmm, that's weird. What is interesting is that it is showing 4 bytes (4 question marks) to hold just 1 character. I see that the code for that last character is actually 6 bytes however (which is the maximum that UTF-8 can hold per character.

𡌛 = \x0A\xF0\xA1\x8C\x9B\x0A

Further interesting is that when I copy and paste your last character into a single OpenRefine cell, I actually get a different character...

ጛ = \xE1\x8C\x9B

instead of

𡌛 = \xF0\xA1\x8C\x9B

@jackyq2015 Can you debug this ?

I will attach a sample file for reference.

import file
import.txt
export file
export.txt

There is a sense that this letter is actually used in the name of the corporation registered in Japan.

有限会社なべ茶屋あさ𡌛
株式会社石𣘺組
有限会社𣳾新商事
𣳾幸1合同会社

@Yuutakasan When I export your import.txt file... I get
capture

有限会社なべ茶屋あさ𡌛

You are probably not using a viewer like Notepad++ or similar that can show that last character as being \xED\xA1\x84\xED\xBC\x9B ?

But regardless... its a bug somewhere because somehow during export we change the bytes...

from
\xF0\xA1\x8C\x9B

to
\xED\xA1\x84\xED\xBC\x9B

I currently use EmEditor, I will try using Notpad ++.
EmEditor
https://www.emeditor.com/

I was able to reproduce the same phenomenon.
image

thank you. @thadguidry.
I tried testing with multiple character codes.
It seems that a character with a code point of 10000 or more will be garbled.

import txt
import-test-sample.txt
image

export txt
export-test-sample-txt.txt
image

@Yuutakasan Thanks, we'll have to let @jackyq2015 look into this specifically. My hunch is that we might not actually be storing it correctly in cell and so this https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L108 might be giving back the wrong data in the first place. Otherwise its an issue in csvwriter itself here https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L114

can you try to add -Dfile.encoding=UTF-8 to java command options to enforce the encoding?

OK.I will try it!

@jackyq2015
Can I test by adding "-Dfile.encoding = UTF-8" setting to the openrefine.l4j.ini file?

@Yuutakasan Yes, but you can also test it by adding it to the refine.ini file and starting refine.bat or refine.sh if your on Linux. Just uncomment the JAVA_OPTIONS= line

Since there is no pervert, I was worried whether the setting really worked.
it is executed once, I will share the result later.

@Yuutakasan Given your description, your file is not properly decoded as utf8. that's why I asked you to enforce it. Please note that system cannot 100% accurate to detect the encoding of random stream. There is some library like icu4j can help to improve the accuracy. Actually there is a PR(not merged yet) to introduce it. If you want to wet your hand, you can create your own branch and merge the PR to your own branch and have a try.

@jackyq2015 @thadguidry
Sorry for being late. I tried the settings I got the other day.

  1. I downloaded openrefine - 2.8.
  2. It changed to the following setting.
    openrefine.l4j.zip
  3. I imported garbled data before.
    import-test-sample.zip
    image

It is displayed normally

image

image

4.The exported file is garbled.
export-test-sample-txt.zip
image

I think that it is a character string conversion mistake at export timing, not an encoding discrimination bug at import timing.

can you please add the encoding switch I provided above and try again?

@jackyq2015
The above processing is executed with the following settings.

configfile
https://github.com/OpenRefine/OpenRefine/files/1492661/openrefine.l4j.zip

openrefine.l4j.ini
***********************
-Xms256M
-Xmx1024M
-Djava.net.useSystemProxies=true
-Dfile.encoding="UTF-8"

Hopefully this has been fixed, but we should confirm for the 3.4 release.

From my limited testing, it looks like XLSX export is OK (at least for Numbers on my Mac), CSV is totally corrupted, and HTML broken for the higher code points as shown above.

@Yuutakasan Sorry for the long delay. The fix for this should make it into 3.4.

thank you. I'll test.

@tfmorris
I've tested it and it still seems to cause garbled text.
@wetneb Could you please open this issue?

USE OpenRefine 3.4 beta

1.import-test-sample.txt import ( No more problems than before. )
import-test-sample.txt
image

  1. Configure Parsing Options ( No more problems than before. )
    image
    3.create project ( No more problems than before. )
    image
    4.export( Garbled characters in some formats. )
    I used to test for HTML and CSV, but I also tested for other formats.
    export openrefine 3.4.zip

①export tsv (Garbled characters)
image

②export csv (Garbled characters)
image

③export html (Garbled characters)
View in a text editor
image
View in a chrome
image

④export excel (NOT Garbled characters)
image

⑤export excel2007+ (NOT Garbled characters)
image

⑥export ODF SpreadSheet (NOT Garbled characters)
image

⑦export SQL (NOT Garbled characters)
image

⑧export SpreadSheet (NOT Garbled characters)
image
https://docs.google.com/spreadsheets/d/12zaOy_Mh9d-85Cv7pVXYV_TTc9gJJALR8gbXi-aAvuQ/edit?usp=sharing

@Yuutakasan this has not been fixed in 3.4 beta - that version was released before this fix. For a version that we expect not to have the issue, try this one:
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3

@wetneb Thanx. I'll re-test.

@wetneb @tfmorris
I have confirmed that the garbling has been resolved. Thank you very much.

USE openrefine-win-3.4-beta-148-gf88c0e3
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3

export openrefine 3.4-beta-148-gf88c0e3.zip

export tsv
image

export csv
image

export html
image

export excel
image

export excel2007+
image

export odf
image

export spreadsheet
image
https://docs.google.com/spreadsheets/d/1TFTVPR-H-CDrGDC7yw252qc3QoBmOJVS9Uztu9VD7Eo/edit?usp=sharing

Excellent. Thank you very much for testing @Yuutakasan

Was this page helpful?
0 / 5 - 0 ratings