fread fails to properly read a well-formed CSV:
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-09-29
# Simulate a data.table with many lines in string
example <- data.table(column1 = 1:3, column2 = c("text", "text\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmany new lines\n\n\n\n\n\n", "text"))
# Write it to a CSV
fwrite(example, file = "example.csv")
# Read it back
fread("example.csv")
# Warning message:
# In fread("example.csv") :
# Found the last consistent line but text exists afterwards (discarded): <<many new lines>>
# base::read.csv behaves well
read.csv("example.csv")
Warning is thrown by this line in fread.c. Interestingly, reducing the number of \n by one remedies the issue.
The problem persists in dev-1.10.5 (built 2017-09-29) across architectures but is not present in release-1.10.4.
please set verbose = TRUE and include the output. does setting sep = ','
fix the issue?
On Sep 29, 2017 10:47 PM, "memoryfull" notifications@github.com wrote:
fread fails to properly read a well-formed CSV:
library(data.table)# data.table 1.10.5 IN DEVELOPMENT built 2017-09-29
Simulate a data.table with many lines in stringexample <- data.table(column1 = 1:3, column2 = c("text", "text\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmany new lines\n\n\n\n\n\n", "text"))# Write it to a CSV
fwrite(example, file = "example.csv")
Read it back
fread("example.csv")# Warning message:# In fread("example.csv") :# Found the last consistent line but text exists afterwards (discarded): <
> base::read.csv behaves well
read.csv("example.csv")
Warning is thrown by this line
https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L1603
in fread.c. Interestingly, reducing the number of \n by one remedies the
issue.The problem persists in dev-1.10.5 (built 2017-09-29) across architectures
but is not present in release-1.10.4.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2395, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdddAEYYFCpRkDhWMQeM935a9vVrhks5snQMRgaJpZM4Pox8s
.
@MichaelChirico Thank you! As you can see, it's not the separator guessing routine that is faulty in dev build:
fread("example.csv", sep = ",", verbose = T)
# Read 2 rows x 2 columns from 153 bytes file in 00:00.032 wall clock time
# Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
# Final type counts
# 0 : drop
# 0 : bool8
# 0 : bool8
# 0 : bool8
# 0 : bool8
# 1 : int32
# 0 : int64
# 0 : float64
# 0 : float64
# 0 : float64
# 1 : string
# =============================
# 0.000s ( 0%) Memory map 0.000GB file
# 0.032s (100%) sep=',' ncol=2 and header detection
# 0.000s ( 0%) Column type detection using 3 sample rows
# 0.000s ( 0%) Allocation of 2 rows x 2 cols (0.000GB)
# 0.000s ( 0%) Reading 1 chunks of 1.000MB (101475 rows) using 1 threads
# = 0.000s ( 0%) Finding first non-embedded \n after each jump
# + 0.000s ( 0%) Parse to row-major thread buffers
# + 0.000s ( 0%) Transpose
# + 0.000s ( 0%) Waiting
# 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
# 0.032s Total
# column1 column2
# 1: 1 text
# 2: 2 "text
# Warning message:
# In fread("example.csv", sep = ",", verbose = T) :
# Found the last consistent line but text exists afterwards (discarded): <<many new lines>>
It's actually because of this line: https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L492
It even has a TODO note.
The idea is that if there is a field that starts with a " but doesn't end with one -- we wouldn't want to scan the entire file in a futile effort to find a matching quote... Instead the parser bails early, after not finding the closing quote in 100 lines. Exposing that "100" to the user as asked by the TODO note would be an easy fix. It may also be possible to do something smarter here, but it is unclear how useful that would be to the users.
@st-pasha Thank you for this clarification! Please note that my example has text 94 new lines another text 6 newlines. So a simple restart of counter after anything but \n might be one of such heuristics.
Thanks @memoryfull and all here.
Persists in dev as of now.
> fread("example.csv")
column1 column2
1: 1 text
2: 2 "text
Warning message:
In fread(f) :
Stopped early on line 4. Expected 2 fields but found 0. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<many new lines>>
It's picking quote rule 2 (the healing one) based on the biggest contiguous consistent block of non-blank lines in the first 100 lines and the field starting with quote "text\n. Which is not right here. Need to expose quoteRule to be controlled by user, and improve the guesser.
> fread(f,verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file /tmp/RtmptHxM1i/filedd76472aa33
File opened, size = 153 bytes.
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<column1,column2>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 3 lines of 2 fields using quote rule 2
Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<column1,column2>>
Quote rule picked = 2
fill=false and the most number of columns found is 2
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 2 because (152 bytes from row 1 to eof) / (2 * 142 jump0size) == 0
A line with too-few fields (0/2) was found on line 3 of sample jump 0.
Type codes (jump 000) : 5A Quote rule 2
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 2 sample rows
All rows were sampled since file is small so we know nrow=2 exactly
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5A
[10] Allocate memory for the datatable
Allocating 2 column slots (2 - 0 dropped) with 2 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=136
Restarting team from jump 0. nSwept==0 quoteRule==3
jumps=[0..1), chunk_size=1048576, total_size=136
Read 2 rows x 2 columns from 153 bytes file in 00:00.000 wall clock time
[12] Finalizing the datatable
Type counts:
1 : int32 '5'
1 : string 'A'
=============================
0.000s ( 22%) Memory map 0.000GB file
0.000s ( 51%) sep=',' ncol=2 and header detection
0.000s ( 5%) Column type detection using 2 sample rows
0.000s ( 8%) Allocation of 2 rows x 2 cols (0.000GB) of which 2 (100%) rows used
0.000s ( 14%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads
+ 0.000s ( 1%) Parse to row-major thread buffers (grown 0 times)
+ 0.000s ( 1%) Transpose
+ 0.000s ( 12%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.000s Total
column1 column2
1: 1 text
2: 2 "text
Warning message:
In fread(f, verbose = TRUE) :
Stopped early on line 4. Expected 2 fields but found 0. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<many new lines>>
Also running into this issue trying to fread the train.csv and test.csv from the latest Kaggle competition (https://www.kaggle.com/c/avito-demand-prediction/data) in case you need some more datasets to test on.
Most helpful comment
It's actually because of this line: https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L492
It even has a TODO note.
The idea is that if there is a field that starts with a
"but doesn't end with one -- we wouldn't want to scan the entire file in a futile effort to find a matching quote... Instead the parser bails early, after not finding the closing quote in 100 lines. Exposing that "100" to the user as asked by the TODO note would be an easy fix. It may also be possible to do something smarter here, but it is unclear how useful that would be to the users.