Data.table: fread(strip.white = FALSE) does not retain leading whitespace in first column

Created on 21 Sep 2017 · 2Comments · Source: Rdatatable/data.table

When leading whitespace is present in the first column (or the only column) and strip.white = FALSE, the leading whitespace is nonetheless trimmed.

fread(input = "ab,x\n  cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
#    V1 V2
# 1: ab  x
# 2: cd x 
# Expected
#       V1 V2
# 1:   ab  x
# 2:   cd x

Leading whitespace is correctly retained when present in other columns:

fread(input = "y,ab,x\ny,  cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
# OK
#    V1   V2 V3
# 1:  y   ab  x
# 2:  y   cd x

# Compared with no whitespace
fread(input = "y,ab,x\ny,cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
#    V1 V2 V3
# 1:  y ab  x
# 2:  y cd x

Session info and verbose output

N.B. I did not test against the most recent build as it was failing at the time I found this issue. My dev version of data.table is only a couple of weeks old.

`sessionInfo()`

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5    RevoUtilsMath_10.0.0

loaded via a namespace (and not attached):
[1] compiler_3.4.1   RevoUtils_10.0.5 tools_3.4.1

`fread(verbose = TRUE)`

> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-09-06 01:46:07 UTC; travis
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

> fread("ab,x\n cd,x", sep = ",", strip.white = FALSE, header = FALSE, verbose = TRUE)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
[1] Check arguments
  Using 12 threads (omp_get_max_threads()=12, nth=12)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
[2] Opening the file
`input` argument is given, interpreting as raw text to read
[3] Detect and skip BOM
[4] Detect end-of-line character(s)
  Detected eol as \n only, the UNIX and Mac standard.
[6] Skipping initial rows if needed
  Positioned on line 1 starting: <<ab,x>>
[7] Detect separator, quoting rule, and ncolumns
  Using supplied sep ','
  sep=','  with 2 lines of 2 fields using quote rule 0
  Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<ab,x>>
  Quote rule picked = 0
[8] Determine column names
  'header' changed by user from 'auto' to false
[9] Detect column types
  Number of sampling jump points = 1 because (10 bytes from row 1 to eof) / (2 * 10 jump0size) == 0
  Type codes (jump 000)    : 66  Quote rule 0
  =====
  Sampled 2 rows (handled \n inside quoted fields) at 1 jump points
  Bytes from first data row on line 1 to the end of last row: 10
  Line length: mean=5.00 sd=0.00 min=5 max=5
  Estimated number of rows: 10 / 5.00 = 2
  Initial alloc = 2 rows (2 + 0%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  All rows were sampled since file is small so we know nrow=2 exactly
  =====
[10] Apply user overrides on column types
After 0 type and 0 drop user overrides : 66
[11] Allocate memory for the datatable
  Allocating 2 column slots (2 - 0 dropped) with 2 rows
[12] Read the data
[13] Finalizing the datatable
Read 2 rows x 2 columns from 10 bytes file in 00:00.001 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
         0 : drop     
         0 : bool8    
         0 : int32    
         0 : int32    
         0 : int64    
         0 : float64  
         2 : string   
Read 2 rows. Exactly what was estimated and allocated up front
=============================
   0.000s (  0%) Memory map 0.000GB file
   0.000s (  0%) sep=',' ncol=2 and header detection
   0.000s (  0%) Column type detection using 2 sample rows
   0.000s (  0%) Allocation of 2 rows x 2 cols (0.000GB)
   0.001s (100%) Reading 1 chunks of 1.000MB (209715 rows) using 1 threads
   =    0.000s (  0%) Finding first non-embedded \n after each jump
   +    0.000s (  0%) Parse to row-major thread buffers
   +    0.000s (  0%) Transpose
   +    0.001s (100%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
   V1 V2
1: ab  x
2: cd  x

bug fread

Source

HughParsonage

All 2 comments

Cannot reproduce with the latest version of data.table:

> str(fread(input = "ab,x\n  cd,x ", sep = ",", strip.white = FALSE, header = FALSE))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ V1: chr  "ab" "  cd"
 $ V2: chr  "x" "x "
 - attr(*, ".internal.selfref")=<externalptr>

The problem must have been resolved already.