When leading whitespace is present in the first column (or the only column) and strip.white = FALSE, the leading whitespace is nonetheless trimmed.
fread(input = "ab,x\n cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
# V1 V2
# 1: ab x
# 2: cd x
# Expected
# V1 V2
# 1: ab x
# 2: cd x
Leading whitespace is correctly retained when present in other columns:
fread(input = "y,ab,x\ny, cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
# OK
# V1 V2 V3
# 1: y ab x
# 2: y cd x
# Compared with no whitespace
fread(input = "y,ab,x\ny,cd,x ", sep = ",", strip.white = FALSE, header = FALSE)
# V1 V2 V3
# 1: y ab x
# 2: y cd x
N.B. I did not test against the most recent build as it was failing at the time I found this issue. My dev version of data.table is only a couple of weeks old.
sessionInfo()> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5 RevoUtilsMath_10.0.0
loaded via a namespace (and not attached):
[1] compiler_3.4.1 RevoUtils_10.0.5 tools_3.4.1
fread(verbose = TRUE)> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-09-06 01:46:07 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("ab,x\n cd,x", sep = ",", strip.white = FALSE, header = FALSE, verbose = TRUE)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
[1] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
[2] Opening the file
`input` argument is given, interpreting as raw text to read
[3] Detect and skip BOM
[4] Detect end-of-line character(s)
Detected eol as \n only, the UNIX and Mac standard.
[6] Skipping initial rows if needed
Positioned on line 1 starting: <<ab,x>>
[7] Detect separator, quoting rule, and ncolumns
Using supplied sep ','
sep=',' with 2 lines of 2 fields using quote rule 0
Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<ab,x>>
Quote rule picked = 0
[8] Determine column names
'header' changed by user from 'auto' to false
[9] Detect column types
Number of sampling jump points = 1 because (10 bytes from row 1 to eof) / (2 * 10 jump0size) == 0
Type codes (jump 000) : 66 Quote rule 0
=====
Sampled 2 rows (handled \n inside quoted fields) at 1 jump points
Bytes from first data row on line 1 to the end of last row: 10
Line length: mean=5.00 sd=0.00 min=5 max=5
Estimated number of rows: 10 / 5.00 = 2
Initial alloc = 2 rows (2 + 0%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
All rows were sampled since file is small so we know nrow=2 exactly
=====
[10] Apply user overrides on column types
After 0 type and 0 drop user overrides : 66
[11] Allocate memory for the datatable
Allocating 2 column slots (2 - 0 dropped) with 2 rows
[12] Read the data
[13] Finalizing the datatable
Read 2 rows x 2 columns from 10 bytes file in 00:00.001 wall clock time
Thread buffers were grown 0 times (if all 1 threads each grew once, this figure would be 1)
Final type counts
0 : drop
0 : bool8
0 : int32
0 : int32
0 : int64
0 : float64
2 : string
Read 2 rows. Exactly what was estimated and allocated up front
=============================
0.000s ( 0%) Memory map 0.000GB file
0.000s ( 0%) sep=',' ncol=2 and header detection
0.000s ( 0%) Column type detection using 2 sample rows
0.000s ( 0%) Allocation of 2 rows x 2 cols (0.000GB)
0.001s (100%) Reading 1 chunks of 1.000MB (209715 rows) using 1 threads
= 0.000s ( 0%) Finding first non-embedded \n after each jump
+ 0.000s ( 0%) Parse to row-major thread buffers
+ 0.000s ( 0%) Transpose
+ 0.001s (100%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.001s Total
V1 V2
1: ab x
2: cd x
Cannot reproduce with the latest version of data.table:
> str(fread(input = "ab,x\n cd,x ", sep = ",", strip.white = FALSE, header = FALSE))
Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
$ V1: chr "ab" " cd"
$ V2: chr "x" "x "
- attr(*, ".internal.selfref")=<externalptr>
The problem must have been resolved already.
Confirm the fix. Maybe something related to the sep=NULL feature?