Data.table: Automatic detection of dec=',' in Europe

Created on 18 Oct 2017 · 22Comments · Source: Rdatatable/data.table

I'm not sure what base and readr do in this regard, but currently in fread, dec=='.' by default and needs manually setting to ',' in Europe for numerics with comma as the decimal separator. It could instead be detected automatically like sep already is. Please +1 this issue if you'd like this.

Further, dec could be automatically detected per-column for files where some numeric columns use ',' and other columns use '.'. But does anyone need that?

enhancement fread top request

Source

mattdowle

👍50

Most helpful comment

It would be enough, if dec would be chosen correctly for the whole file.

GegznaV on 20 Oct 2017

👍3

All 22 comments

It would be enough, if dec would be chosen correctly for the whole file.

GegznaV on 20 Oct 2017

👍3

+1 for auto-dec. dec by column world not be needed. I don’t think a mixed dec csv could be written or read by Excel.

clarkdk on 22 Oct 2017

pls.

Boyoron on 23 Nov 2018

fread() also sometimes ignores a manual set dec = ,. This happens randomly and quite infrequently and is therefore hard to reproduce. Restarting the R session fixes this issue. I've been having this problem for years, but only occasioannly so I did not report it before.

s-fleck on 11 Nov 2019

It sounds like a different issue than the one here

jangorecki on 11 Nov 2019

Any sample files for this issue?

MichaelChirico on 22 May 2020

fread("a;b\n1,5;2,5", sep=";", dec=",")

jangorecki on 22 May 2020

A real-world example would be better 😄

I saw a 🇫🇷 government website using sep=';' as well, is that common in such files? Or sep='\t' maybe?

MichaelChirico on 22 May 2020

AFAIR this is how excel produces csv files in France and Poland.

jangorecki on 22 May 2020

Here are some nightmarishly bad ones 😂

https://datos.gob.es/en/catalogo/ea0010587-porcentaje-de-gastos-en-i-d-respecto-al-pib-a-precios-de-mercado-por-comunidades-autonomas-serie-2000-2017-estadistica-sobre-actividades-en-i-d-en-el-sector-empresas-identificador-api-t14-p057-a2017-l0-02007-px

MichaelChirico on 22 May 2020

@cderv @dmpe @thohan88 @GabijaSakalyte @Boyoron @IndreSakalauskaite @Amygdalae @AndriusJasinevicius @dvaitkus @Katazyna-Stankevic @labutytegreta @rasainsodaite @ievajuozapaityte @gertrudam @RPrakapaite @Grazvile @bugampo @raugulis @Ignnn @iurbon @EvitaJ @rutele13 @jstonkus @pociuteagne @Kaamile @LinaAnu @zyginta @evelina11101 @silvimi @Auguste11 @1075353 @Andrealek @esadausk @vaiiva @supermenas @ramintares @viktorija-romovaite @egle-lele @RokasStat @ema-malinauskaite @1611003 @zyle1 @tokotrienoliai @DanasKl @danielius-mockus @Gabriele-gif @domasrupkus @GerdaSkin @emyliuxe @GegznaV @clarkdk @s-fleck

Sorry for the wide ping. I have a PR addressing this issue in #4482 -- it would be great if anyone could provide some "real world" sample CSVs rather than testing on my toy examples. Thanks in advance!

MichaelChirico on 24 May 2020

@MichaelChirico, here are some examples of data: data.zip. Inside the ZIP:

kojos.csv – data with various measurements of leg parts. Three columns with European numbers.
16.0001.trt – spectroscopic data created by the software of a spectrometer. The data of interest start at line 9. Two semicolon-separated columns with European numbers.
ezerai – two tab-separated columns with European numbers and UTF-8 encoding. Data from Wikipedia. (I do not expect fread() to read this dataset ezerai correctly with the default settings).

GegznaV on 25 May 2020

Wouldn't it be more logical to automatically choose \t tab as sep when it is present instead of some other character? (See the example dataset ezerai)

GegznaV on 25 May 2020

AFAIR excel uses sep=";" when writing to csv (in those countries where dec=",").

jangorecki on 25 May 2020

@GegznaV fread may choose \t -- there is some logic to determine what fread "thinks" is the correct separator among ,|;\t and , see here

MichaelChirico on 25 May 2020

Thanks a bunch for the data sets Vilmantas.

fread('16.0001.trt') fails because of the extraneous info in the first 6 lines (auto-skip logic not up to the task); fread('16.00001.trt', skip=7L) and fread('16.00001.trt', skip=8L) both work automatically (both get the column names wrong, which are on line 7 but have a subheader column on line 8)
fread('kojos.csv') works great
fread('ezerai') is not working "automagically". The issue is, both , and \t as sep lead to 3 columns (which matches the header), and the logic for selecting sep is agnostic to column types -- there is a priority order, and , comes first. I filed https://github.com/Rdatatable/data.table/issues/4487 -- I'm hopeful fread's logic could actually detect sep='\t' on its own. fread('ezerai', sep='\t') works.

MichaelChirico on 25 May 2020

Remember not to try to handle every possible input. For example

both get the column names wrong, which are on line 7 but have a subheader column on line 8

Sounds that fread would need to skip first 6 lines, then read seventh, skip 8th, and read the rest. Hnadling that is doable but it impose maintainance overhead, can introduce new bugs, etc. It might be better to provide a more general interface where skip can be a vector, so user need to understand what is wrong with their files, and then just skip=c(1:6,8). I recall someone already asked for possibility to read particular lines of file.

jangorecki on 25 May 2020

Yea I don't think there's an automatic way on this one that's not a fragile house of cards to support. skip=vector would be awesome anyway.

MichaelChirico on 25 May 2020

However, if we return to the original issue Automatic detection of dec=',' in Europe, it seems that PR #4482 does what it is expected.

When could one expect these changes to be on CRAN?

GegznaV on 25 May 2020

@GegznaV probably not very soon. Note that we provide windows binaries so Rtools/compilation is not needed. If you are on R 3.6 you can just

install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table", type="win.binary")

If you are on other version you can try

install.packages("https://rdatatable.gitlab.io/data.table/bin/windows/contrib/3.6/data.table_1.12.9.zip", repos=NULL)

Note that soon those 3.6 will move to 4.0.

jangorecki on 25 May 2020

Ok, can I expect the new version of data.table on CRAN somewhere around mid-August? (Before the new school year in September). Or your dates are even further?

GegznaV on 25 May 2020

We don't have any fixed release dates. New version on CRAN might eventually be just a patch release not having new features like this.

jangorecki on 25 May 2020

Was this page helpful?

0 / 5 - 0 ratings