OpenRefine messes up encoding when exporting as CSV

Created on 2 Jan 2015  ·  15Comments  ·  Source: OpenRefine/OpenRefine

I am able to import utf-8 encoded french text into open refine. When exporting as csv, Open refines adds garbage to exported csv. exporting as tsv and excel works fine but csv produces garbage chracters.

CSTSV bug encoding export import

Most helpful comment

@answerquest If you use the straight "comma-separated value" export option, it should default to using UTF-8 encoding - this is set in https://github.com/OpenRefine/OpenRefine/blob/5a0304f3636386af506c47f3c73a38129e478a40/main/src/com/google/refine/commands/project/ExportRowsCommand.java#L104

If the data in your project isn't utf-8 encoded you need to use the 'custom tabular exporter' - in that you can set the encoding you want to use for the export

Obviously if your text is utf-8 encoded but it isn't exporting correctly with the default csv export we need to investigate further - it would be great to get a sample project we can look at to investigate/test with in that case

All 15 comments

If you have a chance to describe your workaround or misunderstanding, that might help the next user with the same issue...

I think this encoding issue is still persistent. I though I could fix it by changing the encoding to uft-8. I can contain the issue with cell level transformation to replace the garbage characters. Out of 400k rows there always appears about 150 rows with garbage character when i export to csv. I can share the data if you would like to see it.

David

I am having the same issues. Is there any update on this bug? Data seems to export correctly to Excel but there is the limit of 65536 rows.

@patd0000 We update the issue tracker whenever there's a change in status, so there is no new news. It's likely to be an easy fix, but someone needs to find the time to look at it. Pull requests accepted, of course. :-)

Thanks Tom. I wonder how anyone can process international data if the
character sets don't work.

Put new tag "encoding" to this issue and will be addressed with the ones with the same tag in next release.

~Hi, i'm also facing similar issues. The CSV writer part must be set to default encoding instead of unicode. If someone can guide me to where the relevant code lines are, I might be able to do a PR.~
EDIT: My bad. Had to set encoding as UTF-8 at file import stage.

@answerquest yes! It would be great if you could make a PR for this.

@answerquest Tom is not working on the project anymore.
CSVWriter is provided by the OpenCSV package (see the import at the top of the file).
By the way the version of this library that we use would benefit from being updated to a newer version - that would be my first step, I think.

@answerquest If you use the straight "comma-separated value" export option, it should default to using UTF-8 encoding - this is set in https://github.com/OpenRefine/OpenRefine/blob/5a0304f3636386af506c47f3c73a38129e478a40/main/src/com/google/refine/commands/project/ExportRowsCommand.java#L104

If the data in your project isn't utf-8 encoded you need to use the 'custom tabular exporter' - in that you can set the encoding you want to use for the export

Obviously if your text is utf-8 encoded but it isn't exporting correctly with the default csv export we need to investigate further - it would be great to get a sample project we can look at to investigate/test with in that case

Hi @ostephens thanks for the workaround suggestions. I've created a test file with the troublesome chars in one column for testing:

name,char
left quote,“
right quote,”
left single quote,‘
right single quote,’
en dash,–
em dash,—
hyphen,-
ellipsis,…

I just loaded it at my end (a Lubuntu OS) and exported as csv and there was no problem in the output. But the people who worked on the earlier file whose screenshot I shared had worked on win10 OS, so I'll have them test it and share the results.

@ostephens ok cancel all my last posts.. it turns out in Windows we have to set the encoding to UTF-8 when we import the file, else it reads as ansii. In linux it was UTF-8 by default. I found out from running the test file on my colleague's windows computer. There is no problem with the CSV exporter so I was barking up the wrong tree, my bad. I'm going to ~strike~ my comments above.

@answerquest Oh really ? If that is happening on Windows, then that is a bug on our part. We should be defaulting to UTF-8 whenever the user does not pick the encoding on the exporter. We fixed that and made UTF-8 default very long ago.

@ostephens Thoughts on why UTF-8 is not being defaulted across all the OS's we support ? Can you find the original issue for that ?

I think #2713 fixes the last of the bugs (famous last words) in this space, but it only affected UTF-8 characters outside of the Basic Multilingual Plane (ie Unicode codepoint > 10000). We definitely set the encoding explicitly on output, regardless of operating system, so there shouldn't be an issue on Windows either.

I'm going to close this since it's 5 years old and I believe it's fixed. If anyone comes up with a current example that doesn't work, please create a new issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thadguidry picture thadguidry  ·  4Comments

dantexier picture dantexier  ·  4Comments

davidegiunchidiennea picture davidegiunchidiennea  ·  3Comments

anchardo picture anchardo  ·  3Comments

ettorerizza picture ettorerizza  ·  3Comments