Hello
Could you please add an option to fwrite in order to choose the character encoding, please?
Such as UTF-8.
I know already that fread has that option.
Having encoding issues in Windows. Windows encoding is a real pain
+1
I have the same issue in windows. It would be great having an encoding option in fwrite.
I have the same problem when I export data.table
if I use fwrite: a string Côte d'Ivoire becomes Cأ´te d'Ivoire, but I use write.csv it works perfectly.
Thanks
any result of this issue? i think it necessary for window platform
+1
I also have the same encoding problem in windows. Hope it could be solved.
hi, @mattdowle, any result on this issue?
Yes, please! fread can encode, but if the file is saved with fwrite, latin characters in colnames are effed up and the one solution I have found is to reencode colnames with iconv, which is a rather brittle solution.
If I fread with UFT-8 the contents are right, but the headers not. If I fread with latin1, the headers are right, but the contents not.
I am also having this issue.
I also have this issue for encoding problem, hope for adding option.
Any progress on this?
@skanskan @JhossePaul @kzmlbyrk @msgoussi @amjiuzi @EspenRosenquist @igorstorm @y41u42002 @nesscx @pachamaltese @elisendavila @lz1nwm @kongdd @byapparov @dpprdan @AndrewsOR @alexeyknorre @bendae19 @BastienFR @bobSpacewalk @szugat @alexiaaslan @yaakovfeldman @MathieuMarauri @franknarf1 @themeo @rsaporta @kuzmenkov111 @lucasmation @sindribaldur
Could anyone please offer a reproducible example? It's easier to work through a solution with one in hand. Thanks.
Is the problem that fwrite is emitting non-UTF-8 files? Is there any problem with fwrite writing _only_ UTF-8 files?
Non-UTF-8 files are a nightmare to deal with... if we can avoid it, I'd prefer if every file written by fwrite is UTF-8 (or maybe 16), rather than increase the chances that fwrite contributes to data headaches of downstream users dealing with hard-to-read files in obscure encodings.
I'm not using it for some time but wirting just UTF-8 would be good for me.
Though maybe other old-fashioned prefer ASCII.
Yes. UTF8 by default is better for everyone.
@MichaelChirico you can use datos package from CRAN and try to save datos::encuesta that has many columns with characters such as \u00e9
Personally, I don't think having "UTF-8" by default is a good idea. Everybody here understand enough encoding to use it, however most people do not. I'm thinking about my french fellows that will fwrite a file and try to read it again to see all their accents become:
v1 v2
1: Parler 1
2: français 2
3: Ã 3
4: Québec 4
5: dans 5
6: un 6
7: hôtel 7
8: amène 8
9: son 9
10: lot 10
11: de 11
12: problèmes 12
They will have to troubleshot and it's bad for beginners. Most people want there encoding to be default to their system. Few actually have both windows and Linux, or share file between Linux or windows. What we really need is an option that allows us to decide if we want a least either system encoding or utf8.
For the RE, I created something basic:
# a table that would cause trouble for encoding
dd <- data.frame(v1 = c("Parler", "français", "à", "Québec", "dans", "un",
"hôtel", "amène", "son", "lot", "de", "problèmes"),
v2 = 1:12)
library(data.table)
# the basic fwrite
fwrite(dd, "desktop/crap/ex_fwrite_win.csv")
# the equivalent base R
write.table(dd, "desktop/crap/ex_writetable_win.csv",
sep=";", col.names = T, row.names = F, quote=F)
# The base R function that we want in fwrite
write.table(dd, "desktop/crap/ex_writetable_win_utf8.csv",
sep=";", col.names = T, row.names = F, quote=F, fileEncoding = "UTF-8")
# What would happen to somebody that is clueless about encoding if default fwrite encoding on fwrite would be utf8
fread("desktop/crap/ex_writetable_win_utf8.csv")
@BastienFR , you might find the blog post of Yihui Xie useful:
https://yihui.name/en/2018/11/biggest-regret-knitr/
@tdeenes, thanks, I already knew that blog! Yihui Xie gives really good points and I agree with him. Don't get me wrong, I really prefer utf8 and everything I do now is in utf8. However, having worked with R for almost 18 years, I have to (saddly) say that I start understanding encoding and managing it properly just a couple of years ago... I used to use notepad search and replace function to fix my problems!
So I feel for people that use R just a little and don't know about it. If I take base R as a reference, I think their way of doing it is totally fine and I doubt the R core have the same regrets as Yihui Xie.
What we need is an option. R is great because you have options and flexibility.
@BastienFR , I am not against having an option for character encoding. I just want to say that such an extra parameter in fwrite should certainly default to UTF-8. As someone coming from a non-standard orthography (Hungarian, with our special ű, see here), using Linux, and often facing data files created by encoding-unconscious Windows-users would like to push those users towards using proper character encoding instead of making it easier for them to follow a bad habit.
In my case UTF-8 is not the default of my system (Windows), it's just the default of my R-Studio settings and I want it to be like to that to avoid problems when sharing files with other workmates, each one is using a different operating system, and we have agreed in using UTF8.
In other situations you may not know what codification has been used, then you will need to try several until you find the good one.
It seems to me the pushback to enforcing fwrite->UTF-8 is actually an issue for fread more than fwrite.
fread's encoding option currently defaults to 'unknown'. Automatic encoding detection is hard (see Zawgyi) & almost certainly beyond the scope of data.table... I don't know enough about the issue to say whether it's possible to detect _a specific encoding_ with any generality.
I'm not sure if it makes sense to say that eventually we'd set encoding='UTF-8' by default _in fread_ but AIUI that would make it easier for fwrite to _always_ write UTF-8 (no option).
As I see it, fread has the much harder job because it has to take data from any source, any program, maybe entered manually in Notepad on Windows in Naypyitaw, and turn it into rectangular data. For fwrite the data is already rectangular. It's in R, so base+data.table has already handled most of the details of regularizing. We have full control over what the final product looks like, without much (if any) guesswork.
As it stands I guess it's inevitable we'll have an encoding parameter for fwrite, but the default will certainly be UTF-8. I guess data.table is a drop in the ocean of _all_ CSVs written around the world, but anyway I think it's irresponsible to make it easy to write non-UTF-8 files.
Perhaps we could force users specifying encoding != 'UTF-8' to solve a Project Euler question first 😛
@skanskan I use Windows everyday. I know it's easier to view native encoding csv files on Windows, especially people tend to use Excel to preview the csv files. However, I still support to write UTF-8 csv files on Windows whenever it can. Non UTF-8 files cause much more pain than the small easier-to-preview in the long term. Besides, by adding a BOM you can view UTF-8 CSV files correctly in Excel, see
readr::write_excel_csv().
But I agree for people who are not familiar with encoding issues may find the native encoding preferred. So an idea is to have fread() and fwrite() use native encoding by default with two new functions fread8() and fwrite8() to read/write UTF-8.
Most helpful comment
Personally, I don't think having "UTF-8" by default is a good idea. Everybody here understand enough encoding to use it, however most people do not. I'm thinking about my french fellows that will
fwritea file and try to read it again to see all their accents become:They will have to troubleshot and it's bad for beginners. Most people want there encoding to be default to their system. Few actually have both windows and Linux, or share file between Linux or windows. What we really need is an option that allows us to decide if we want a least either system encoding or utf8.
For the RE, I created something basic: