Data.table: fwrite UTF8

Created on 13 Jul 2016 · 22Comments · Source: Rdatatable/data.table

Hello

Could you please add an option to fwrite in order to choose the character encoding, please?
Such as UTF-8.
I know already that fread has that option.

encoding fwrite

Source

skanskan

👍27 ❤4

Most helpful comment

Personally, I don't think having "UTF-8" by default is a good idea. Everybody here understand enough encoding to use it, however most people do not. I'm thinking about my french fellows that will fwrite a file and try to read it again to see all their accents become:

            v1 v2
 1:     Parler  1
 2:  franÃ§ais  2
 3:         Ã   3
 4:    QuÃ©bec  4
 5:       dans  5
 6:         un  6
 7:     hÃ´tel  7
 8:     amÃ¨ne  8
 9:        son  9
10:        lot 10
11:         de 11
12: problÃ¨mes 12

They will have to troubleshot and it's bad for beginners. Most people want there encoding to be default to their system. Few actually have both windows and Linux, or share file between Linux or windows. What we really need is an option that allows us to decide if we want a least either system encoding or utf8.

For the RE, I created something basic:

# a table that would cause trouble for encoding
dd <- data.frame(v1 = c("Parler", "français", "à", "Québec", "dans", "un",
                   "hôtel", "amène", "son", "lot", "de", "problèmes"),
                 v2 = 1:12)

library(data.table)

# the basic fwrite
fwrite(dd, "desktop/crap/ex_fwrite_win.csv")

# the equivalent base R
write.table(dd, "desktop/crap/ex_writetable_win.csv",
            sep=";", col.names = T, row.names = F, quote=F)

# The base R function that we want in fwrite
write.table(dd, "desktop/crap/ex_writetable_win_utf8.csv",
            sep=";", col.names = T, row.names = F, quote=F, fileEncoding = "UTF-8")

# What would happen to somebody that is clueless about encoding if default fwrite encoding on fwrite would be utf8
fread("desktop/crap/ex_writetable_win_utf8.csv")

BastienFR on 22 Jul 2019

👍3

All 22 comments

Having encoding issues in Windows. Windows encoding is a real pain

JhossePaul on 16 Feb 2017

👍3

I have the same issue in windows. It would be great having an encoding option in fwrite.

kzmlbyrk on 12 Apr 2017

I have the same problem when I export data.table
if I use fwrite: a string Côte d'Ivoire becomes Cأ´te d'Ivoire, but I use write.csv it works perfectly.
Thanks

msgoussi on 7 May 2017

any result of this issue? i think it necessary for window platform
+1

amjiuzi on 30 Aug 2017

👍1

I also have the same encoding problem in windows. Hope it could be solved.

kongdd on 22 Sep 2017

hi, @mattdowle, any result on this issue?

amjiuzi on 23 Oct 2017

Yes, please! fread can encode, but if the file is saved with fwrite, latin characters in colnames are effed up and the one solution I have found is to reencode colnames with iconv, which is a rather brittle solution.

If I fread with UFT-8 the contents are right, but the headers not. If I fread with latin1, the headers are right, but the contents not.

EspenRosenquist on 9 Nov 2017

I am also having this issue.

igorstorm on 19 Nov 2018

I also have this issue for encoding problem, hope for adding option.

fanyuandeng on 28 Feb 2019

Any progress on this?

nesscx on 19 May 2019

@skanskan @JhossePaul @kzmlbyrk @msgoussi @amjiuzi @EspenRosenquist @igorstorm @y41u42002 @nesscx @pachamaltese @elisendavila @lz1nwm @kongdd @byapparov @dpprdan @AndrewsOR @alexeyknorre @bendae19 @BastienFR @bobSpacewalk @szugat @alexiaaslan @yaakovfeldman @MathieuMarauri @franknarf1 @themeo @rsaporta @kuzmenkov111 @lucasmation @sindribaldur

Could anyone please offer a reproducible example? It's easier to work through a solution with one in hand. Thanks.

Is the problem that fwrite is emitting non-UTF-8 files? Is there any problem with fwrite writing _only_ UTF-8 files?

Non-UTF-8 files are a nightmare to deal with... if we can avoid it, I'd prefer if every file written by fwrite is UTF-8 (or maybe 16), rather than increase the chances that fwrite contributes to data headaches of downstream users dealing with hard-to-read files in obscure encodings.

MichaelChirico on 21 Jul 2019

I'm not using it for some time but wirting just UTF-8 would be good for me.
Though maybe other old-fashioned prefer ASCII.

skanskan on 21 Jul 2019

Yes. UTF8 by default is better for everyone.

shrektan on 21 Jul 2019

👍2

@MichaelChirico you can use datos package from CRAN and try to save datos::encuesta that has many columns with characters such as \u00e9

pachamaltese on 21 Jul 2019

👀1

            v1 v2
 1:     Parler  1
 2:  franÃ§ais  2
 3:         Ã   3
 4:    QuÃ©bec  4
 5:       dans  5
 6:         un  6
 7:     hÃ´tel  7
 8:     amÃ¨ne  8
 9:        son  9
10:        lot 10
11:         de 11
12: problÃ¨mes 12

For the RE, I created something basic:

# a table that would cause trouble for encoding
dd <- data.frame(v1 = c("Parler", "français", "à", "Québec", "dans", "un",
                   "hôtel", "amène", "son", "lot", "de", "problèmes"),
                 v2 = 1:12)

library(data.table)

# the basic fwrite
fwrite(dd, "desktop/crap/ex_fwrite_win.csv")

# the equivalent base R
write.table(dd, "desktop/crap/ex_writetable_win.csv",
            sep=";", col.names = T, row.names = F, quote=F)

# The base R function that we want in fwrite
write.table(dd, "desktop/crap/ex_writetable_win_utf8.csv",
            sep=";", col.names = T, row.names = F, quote=F, fileEncoding = "UTF-8")

# What would happen to somebody that is clueless about encoding if default fwrite encoding on fwrite would be utf8
fread("desktop/crap/ex_writetable_win_utf8.csv")

BastienFR on 22 Jul 2019

👍3

@BastienFR , you might find the blog post of Yihui Xie useful:
https://yihui.name/en/2018/11/biggest-regret-knitr/

tdeenes on 22 Jul 2019

👍1

@tdeenes, thanks, I already knew that blog! Yihui Xie gives really good points and I agree with him. Don't get me wrong, I really prefer utf8 and everything I do now is in utf8. However, having worked with R for almost 18 years, I have to (saddly) say that I start understanding encoding and managing it properly just a couple of years ago... I used to use notepad search and replace function to fix my problems!

So I feel for people that use R just a little and don't know about it. If I take base R as a reference, I think their way of doing it is totally fine and I doubt the R core have the same regrets as Yihui Xie.

What we need is an option. R is great because you have options and flexibility.

BastienFR on 22 Jul 2019

@BastienFR , I am not against having an option for character encoding. I just want to say that such an extra parameter in fwrite should certainly default to UTF-8. As someone coming from a non-standard orthography (Hungarian, with our special ű, see here), using Linux, and often facing data files created by encoding-unconscious Windows-users would like to push those users towards using proper character encoding instead of making it easier for them to follow a bad habit.

tdeenes on 22 Jul 2019

In my case UTF-8 is not the default of my system (Windows), it's just the default of my R-Studio settings and I want it to be like to that to avoid problems when sharing files with other workmates, each one is using a different operating system, and we have agreed in using UTF8.

In other situations you may not know what codification has been used, then you will need to try several until you find the good one.

skanskan on 22 Jul 2019

It seems to me the pushback to enforcing fwrite->UTF-8 is actually an issue for fread more than fwrite.

fread's encoding option currently defaults to 'unknown'. Automatic encoding detection is hard (see Zawgyi) & almost certainly beyond the scope of data.table... I don't know enough about the issue to say whether it's possible to detect _a specific encoding_ with any generality.

I'm not sure if it makes sense to say that eventually we'd set encoding='UTF-8' by default _in fread_ but AIUI that would make it easier for fwrite to _always_ write UTF-8 (no option).

As I see it, fread has the much harder job because it has to take data from any source, any program, maybe entered manually in Notepad on Windows in Naypyitaw, and turn it into rectangular data. For fwrite the data is already rectangular. It's in R, so base+data.table has already handled most of the details of regularizing. We have full control over what the final product looks like, without much (if any) guesswork.

As it stands I guess it's inevitable we'll have an encoding parameter for fwrite, but the default will certainly be UTF-8. I guess data.table is a drop in the ocean of _all_ CSVs written around the world, but anyway I think it's irresponsible to make it easy to write non-UTF-8 files.

Perhaps we could force users specifying encoding != 'UTF-8' to solve a Project Euler question first 😛

MichaelChirico on 23 Jul 2019

👍2

@skanskan I use Windows everyday. I know it's easier to view native encoding csv files on Windows, especially people tend to use Excel to preview the csv files. However, I still support to write UTF-8 csv files on Windows whenever it can. Non UTF-8 files cause much more pain than the small easier-to-preview in the long term. Besides, by adding a BOM you can view UTF-8 CSV files correctly in Excel, see
readr::write_excel_csv().

shrektan on 23 Jul 2019

But I agree for people who are not familiar with encoding issues may find the native encoding preferred. So an idea is to have fread() and fwrite() use native encoding by default with two new functions fread8() and fwrite8() to read/write UTF-8.

shrektan on 23 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings