Data.table: Automatic detection of dec=',' in Europe

Created on 18 Oct 2017  Β·  22Comments  Β·  Source: Rdatatable/data.table

I'm not sure what base and readr do in this regard, but currently in fread, dec=='.' by default and needs manually setting to ',' in Europe for numerics with comma as the decimal separator. It could instead be detected automatically like sep already is. Please +1 this issue if you'd like this.

Further, dec could be automatically detected per-column for files where some numeric columns use ',' and other columns use '.'. But does anyone need that?

enhancement fread top request

Most helpful comment

It would be enough, if dec would be chosen correctly for the whole file.

All 22 comments

It would be enough, if dec would be chosen correctly for the whole file.

+1 for auto-dec. dec by column world not be needed. I don’t think a mixed dec csv could be written or read by Excel.

pls.

fread() also sometimes ignores a manual set dec = ,. This happens randomly and quite infrequently and is therefore hard to reproduce. Restarting the R session fixes this issue. I've been having this problem for years, but only occasioannly so I did not report it before.

It sounds like a different issue than the one here

Any sample files for this issue?

fread("a;b\n1,5;2,5", sep=";", dec=",")

A real-world example would be better πŸ˜„

I saw a πŸ‡«πŸ‡· government website using sep=';' as well, is that common in such files? Or sep='\t' maybe?

AFAIR this is how excel produces csv files in France and Poland.

@cderv @dmpe @thohan88 @GabijaSakalyte @Boyoron @IndreSakalauskaite @Amygdalae @AndriusJasinevicius @dvaitkus @Katazyna-Stankevic @labutytegreta @rasainsodaite @ievajuozapaityte @gertrudam @RPrakapaite @Grazvile @bugampo @raugulis @Ignnn @iurbon @EvitaJ @rutele13 @jstonkus @pociuteagne @Kaamile @LinaAnu @zyginta @evelina11101 @silvimi @Auguste11 @1075353 @Andrealek @esadausk @vaiiva @supermenas @ramintares @viktorija-romovaite @egle-lele @RokasStat @ema-malinauskaite @1611003 @zyle1 @tokotrienoliai @DanasKl @danielius-mockus @Gabriele-gif @domasrupkus @GerdaSkin @emyliuxe @GegznaV @clarkdk @s-fleck

Sorry for the wide ping. I have a PR addressing this issue in #4482 -- it would be great if anyone could provide some "real world" sample CSVs rather than testing on my toy examples. Thanks in advance!

@MichaelChirico, here are some examples of data: data.zip. Inside the ZIP:

kojos.csv – data with various measurements of leg parts. Three columns with European numbers.
16.0001.trt – spectroscopic data created by the software of a spectrometer. The data of interest start at line 9. Two semicolon-separated columns with European numbers.
ezerai – two tab-separated columns with European numbers and UTF-8 encoding. Data from Wikipedia. (I do not expect fread() to read this dataset ezerai correctly with the default settings).

Wouldn't it be more logical to automatically choose \t tab as sep when it is present instead of some other character? (See the example dataset ezerai)

AFAIR excel uses sep=";" when writing to csv (in those countries where dec=",").

@GegznaV fread may choose \t -- there is some logic to determine what fread "thinks" is the correct separator among ,|;\t and , see here

Thanks a bunch for the data sets Vilmantas.

  • fread('16.0001.trt') fails because of the extraneous info in the first 6 lines (auto-skip logic not up to the task); fread('16.00001.trt', skip=7L) and fread('16.00001.trt', skip=8L) both work automatically (both get the column names wrong, which are on line 7 but have a subheader column on line 8)
  • fread('kojos.csv') works great
  • fread('ezerai') is not working "automagically". The issue is, both , and \t as sep lead to 3 columns (which matches the header), and the logic for selecting sep is agnostic to column types -- there is a priority order, and , comes first. I filed https://github.com/Rdatatable/data.table/issues/4487 -- I'm hopeful fread's logic could actually detect sep='\t' on its own. fread('ezerai', sep='\t') works.

Remember not to try to handle every possible input. For example

both get the column names wrong, which are on line 7 but have a subheader column on line 8

Sounds that fread would need to skip first 6 lines, then read seventh, skip 8th, and read the rest. Hnadling that is doable but it impose maintainance overhead, can introduce new bugs, etc. It might be better to provide a more general interface where skip can be a vector, so user need to understand what is wrong with their files, and then just skip=c(1:6,8). I recall someone already asked for possibility to read particular lines of file.

Yea I don't think there's an automatic way on this one that's not a fragile house of cards to support. skip=vector would be awesome anyway.

However, if we return to the original issue Automatic detection of dec=',' in Europe, it seems that PR #4482 does what it is expected.

When could one expect these changes to be on CRAN?

@GegznaV probably not very soon. Note that we provide windows binaries so Rtools/compilation is not needed. If you are on R 3.6 you can just

install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table", type="win.binary")

If you are on other version you can try

install.packages("https://rdatatable.gitlab.io/data.table/bin/windows/contrib/3.6/data.table_1.12.9.zip", repos=NULL)

Note that soon those 3.6 will move to 4.0.

Ok, can I expect the new version of data.table on CRAN somewhere around mid-August? (Before the new school year in September). Or your dates are even further?

We don't have any fixed release dates. New version on CRAN might eventually be just a patch release not having new features like this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jameslamb picture jameslamb  Β·  3Comments

mattdowle picture mattdowle  Β·  3Comments

DavidArenburg picture DavidArenburg  Β·  3Comments

franknarf1 picture franknarf1  Β·  3Comments

lux5 picture lux5  Β·  3Comments