Data.table: fread with many new lines in string

Created on 29 Sep 2017  Â·  7Comments  Â·  Source: Rdatatable/data.table

fread fails to properly read a well-formed CSV:

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-09-29

# Simulate a data.table with many lines in string
example <- data.table(column1 = 1:3, column2 = c("text", "text\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmany new lines\n\n\n\n\n\n", "text"))
# Write it to a CSV
fwrite(example, file = "example.csv")

# Read it back
fread("example.csv")
# Warning message:
# In fread("example.csv") :
#  Found the last consistent line but text exists afterwards (discarded): <<many new lines>>

# base::read.csv behaves well
read.csv("example.csv")

Warning is thrown by this line in fread.c. Interestingly, reducing the number of \n by one remedies the issue.

The problem persists in dev-1.10.5 (built 2017-09-29) across architectures but is not present in release-1.10.4.

fread

Most helpful comment

It's actually because of this line: https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L492
It even has a TODO note.

The idea is that if there is a field that starts with a " but doesn't end with one -- we wouldn't want to scan the entire file in a futile effort to find a matching quote... Instead the parser bails early, after not finding the closing quote in 100 lines. Exposing that "100" to the user as asked by the TODO note would be an easy fix. It may also be possible to do something smarter here, but it is unclear how useful that would be to the users.

All 7 comments

please set verbose = TRUE and include the output. does setting sep = ','
fix the issue?

On Sep 29, 2017 10:47 PM, "memoryfull" notifications@github.com wrote:

fread fails to properly read a well-formed CSV:

library(data.table)# data.table 1.10.5 IN DEVELOPMENT built 2017-09-29

Simulate a data.table with many lines in stringexample <- data.table(column1 = 1:3, column2 = c("text", "text\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmany new lines\n\n\n\n\n\n", "text"))# Write it to a CSV

fwrite(example, file = "example.csv")

Read it back

fread("example.csv")# Warning message:# In fread("example.csv") :# Found the last consistent line but text exists afterwards (discarded): <>

base::read.csv behaves well

read.csv("example.csv")

Warning is thrown by this line
https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L1603
in fread.c. Interestingly, reducing the number of \n by one remedies the
issue.

The problem persists in dev-1.10.5 (built 2017-09-29) across architectures
but is not present in release-1.10.4.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2395, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdddAEYYFCpRkDhWMQeM935a9vVrhks5snQMRgaJpZM4Pox8s
.

@MichaelChirico Thank you! As you can see, it's not the separator guessing routine that is faulty in dev build:

fread("example.csv", sep = ",", verbose = T)
# Read 2 rows x 2 columns from 153 bytes file in 00:00.032 wall clock time
# Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
# Final type counts
#          0 : drop     
#          0 : bool8    
#          0 : bool8    
#          0 : bool8    
#          0 : bool8    
#          1 : int32    
#          0 : int64    
#          0 : float64  
#          0 : float64  
#          0 : float64  
#          1 : string   
# =============================
#    0.000s (  0%) Memory map 0.000GB file
#    0.032s (100%) sep=',' ncol=2 and header detection
#    0.000s (  0%) Column type detection using 3 sample rows
#    0.000s (  0%) Allocation of 2 rows x 2 cols (0.000GB)
#    0.000s (  0%) Reading 1 chunks of 1.000MB (101475 rows) using 1 threads
#    =    0.000s (  0%) Finding first non-embedded \n after each jump
#    +    0.000s (  0%) Parse to row-major thread buffers
#    +    0.000s (  0%) Transpose
#    +    0.000s (  0%) Waiting
#    0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
#    0.032s        Total
#    column1 column2
# 1:       1    text
# 2:       2   "text
# Warning message:
# In fread("example.csv", sep = ",", verbose = T) :
#   Found the last consistent line but text exists afterwards (discarded): <<many new lines>>

It's actually because of this line: https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L492
It even has a TODO note.

The idea is that if there is a field that starts with a " but doesn't end with one -- we wouldn't want to scan the entire file in a futile effort to find a matching quote... Instead the parser bails early, after not finding the closing quote in 100 lines. Exposing that "100" to the user as asked by the TODO note would be an easy fix. It may also be possible to do something smarter here, but it is unclear how useful that would be to the users.

@st-pasha Thank you for this clarification! Please note that my example has text 94 new lines another text 6 newlines. So a simple restart of counter after anything but \n might be one of such heuristics.

Thanks @memoryfull and all here.
Persists in dev as of now.

>   fread("example.csv")
   column1 column2
1:       1    text
2:       2   "text
Warning message:
In fread(f) :
  Stopped early on line 4. Expected 2 fields but found 0. Consider fill=TRUE and
  comment.char=. First discarded non-empty line: <<many new lines>>

It's picking quote rule 2 (the healing one) based on the biggest contiguous consistent block of non-blank lines in the first 100 lines and the field starting with quote "text\n. Which is not right here. Need to expose quoteRule to be controlled by user, and improve the guesser.

>   fread(f,verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 8 threads (omp_get_max_threads()=8, nth=8)
  NAstrings = [<<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file /tmp/RtmptHxM1i/filedd76472aa33
  File opened, size = 153 bytes.
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<column1,column2>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 3 lines of 2 fields using quote rule 2
  Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<column1,column2>>
  Quote rule picked = 2
  fill=false and the most number of columns found is 2
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 2 because (152 bytes from row 1 to eof) / (2 * 142 jump0size) == 0
  A line with too-few fields (0/2) was found on line 3 of sample jump 0. 
  Type codes (jump 000)    : 5A  Quote rule 2
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 2 sample rows
  All rows were sampled since file is small so we know nrow=2 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A
[10] Allocate memory for the datatable
  Allocating 2 column slots (2 - 0 dropped) with 2 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=136
  Restarting team from jump 0. nSwept==0 quoteRule==3
  jumps=[0..1), chunk_size=1048576, total_size=136
Read 2 rows x 2 columns from 153 bytes file in 00:00.000 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : int32     '5'
         1 : string    'A'
=============================
   0.000s ( 22%) Memory map 0.000GB file
   0.000s ( 51%) sep=',' ncol=2 and header detection
   0.000s (  5%) Column type detection using 2 sample rows
   0.000s (  8%) Allocation of 2 rows x 2 cols (0.000GB) of which 2 (100%) rows used
   0.000s ( 14%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads
   +    0.000s (  1%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  1%) Transpose
   +    0.000s ( 12%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.000s        Total
   column1 column2
1:       1    text
2:       2   "text
Warning message:
In fread(f, verbose = TRUE) :
  Stopped early on line 4. Expected 2 fields but found 0. Consider fill=TRUE and
  comment.char=. First discarded non-empty line: <<many new lines>>

2600 is related to this.

Also running into this issue trying to fread the train.csv and test.csv from the latest Kaggle competition (https://www.kaggle.com/c/avito-demand-prediction/data) in case you need some more datasets to test on.

Was this page helpful?
0 / 5 - 0 ratings