OpenRefine 2.7 rc2
After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred.
displayed characters
有限会社なべ茶屋あさ𡌛
Exported garbled characters
有限会社なべ茶屋あさ����
other garbled export charactor sample
𣘺𣳾
@Yuutakasan hmm, that's weird. What is interesting is that it is showing 4 bytes (4 question marks) to hold just 1 character. I see that the code for that last character is actually 6 bytes however (which is the maximum that UTF-8 can hold per character.
𡌛 = \x0A\xF0\xA1\x8C\x9B\x0A
Further interesting is that when I copy and paste your last character into a single OpenRefine cell, I actually get a different character...
ጛ = \xE1\x8C\x9B
instead of
𡌛 = \xF0\xA1\x8C\x9B
@jackyq2015 Can you debug this ?
I will attach a sample file for reference.
import file
import.txt
export file
export.txt
There is a sense that this letter is actually used in the name of the corporation registered in Japan.
@Yuutakasan When I export your import.txt file... I get
有限会社なべ茶屋あさ𡌛
You are probably not using a viewer like Notepad++ or similar that can show that last character as being \xED\xA1\x84\xED\xBC\x9B ?
But regardless... its a bug somewhere because somehow during export we change the bytes...
from
\xF0\xA1\x8C\x9B
to
\xED\xA1\x84\xED\xBC\x9B
I currently use EmEditor, I will try using Notpad ++.
EmEditor
https://www.emeditor.com/
I was able to reproduce the same phenomenon.
thank you. @thadguidry.
I tried testing with multiple character codes.
It seems that a character with a code point of 10000 or more will be garbled.
import txt
import-test-sample.txt
export txt
export-test-sample-txt.txt
@Yuutakasan Thanks, we'll have to let @jackyq2015 look into this specifically. My hunch is that we might not actually be storing it correctly in cell and so this https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L108 might be giving back the wrong data in the first place. Otherwise its an issue in csvwriter itself here https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L114
can you try to add -Dfile.encoding=UTF-8 to java command options to enforce the encoding?
OK.I will try it!
@jackyq2015
Can I test by adding "-Dfile.encoding = UTF-8" setting to the openrefine.l4j.ini file?
@Yuutakasan Yes, but you can also test it by adding it to the refine.ini file and starting refine.bat or refine.sh if your on Linux. Just uncomment the JAVA_OPTIONS= line
Since there is no pervert, I was worried whether the setting really worked.
it is executed once, I will share the result later.
@Yuutakasan Given your description, your file is not properly decoded as utf8. that's why I asked you to enforce it. Please note that system cannot 100% accurate to detect the encoding of random stream. There is some library like icu4j can help to improve the accuracy. Actually there is a PR(not merged yet) to introduce it. If you want to wet your hand, you can create your own branch and merge the PR to your own branch and have a try.
@jackyq2015 @thadguidry
Sorry for being late. I tried the settings I got the other day.
It is displayed normally
①
②
4.The exported file is garbled.
export-test-sample-txt.zip
I think that it is a character string conversion mistake at export timing, not an encoding discrimination bug at import timing.
other export pattern
Excel
import-test-sample-xlsx.zip
can you please add the encoding switch I provided above and try again?
@jackyq2015
The above processing is executed with the following settings.
configfile
https://github.com/OpenRefine/OpenRefine/files/1492661/openrefine.l4j.zip
openrefine.l4j.ini
***********************
-Xms256M
-Xmx1024M
-Djava.net.useSystemProxies=true
-Dfile.encoding="UTF-8"
Hopefully this has been fixed, but we should confirm for the 3.4 release.
From my limited testing, it looks like XLSX export is OK (at least for Numbers on my Mac), CSV is totally corrupted, and HTML broken for the higher code points as shown above.
@Yuutakasan Sorry for the long delay. The fix for this should make it into 3.4.
thank you. I'll test.
@tfmorris
I've tested it and it still seems to cause garbled text.
@wetneb Could you please open this issue?
USE OpenRefine 3.4 beta
1.import-test-sample.txt import ( No more problems than before. )
import-test-sample.txt
①export tsv (Garbled characters)
②export csv (Garbled characters)
③export html (Garbled characters)
View in a text editor
View in a chrome
④export excel (NOT Garbled characters)
⑤export excel2007+ (NOT Garbled characters)
⑥export ODF SpreadSheet (NOT Garbled characters)
⑦export SQL (NOT Garbled characters)
⑧export SpreadSheet (NOT Garbled characters)
https://docs.google.com/spreadsheets/d/12zaOy_Mh9d-85Cv7pVXYV_TTc9gJJALR8gbXi-aAvuQ/edit?usp=sharing
@Yuutakasan this has not been fixed in 3.4 beta - that version was released before this fix. For a version that we expect not to have the issue, try this one:
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3
@wetneb Thanx. I'll re-test.
@wetneb @tfmorris
I have confirmed that the garbling has been resolved. Thank you very much.
USE openrefine-win-3.4-beta-148-gf88c0e3
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3
export openrefine 3.4-beta-148-gf88c0e3.zip
export tsv
export csv
export html
export excel
export excel2007+
export odf
export spreadsheet
https://docs.google.com/spreadsheets/d/1TFTVPR-H-CDrGDC7yw252qc3QoBmOJVS9Uztu9VD7Eo/edit?usp=sharing
Excellent. Thank you very much for testing @Yuutakasan
Most helpful comment
Excellent. Thank you very much for testing @Yuutakasan