Copying from my question on SO:
Having trouble creating a reproducible example and can't share the data, but I think I stumbled upon a bug in fread(). Trying to read my 1.658GB tsv file encoded in Latin-1 produces the following error:
Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1", :
Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565 01 0 1 0 1999 1 TNMAT NMAC09 015 015 15.>>
The problematic line is line no 11129896 where there is a NUL mark written out as <0x00> in Sublime Text and ^@ in Vi (can't copy it). If i set skip = 11129895, fread throws the same error but now on "jump 0", if I set skip = 11129896 it works, but nrows=11129895 still throws the same error. Having removed the character the file reads as it should. Maybe fread() is not supposed to support reading files with these encoding issues, but at least it would be great if the error was more informative. Took me quite a while to understand what was going on and to find the correct line.
The verbose output of fread() is:
> dt <- fread(
... "POANG.txt",
... header = TRUE,
... sep = "\t",
... sep2 = NULL,
... encoding = "Latin-1", #same as ISO-8859-1
... na.strings = NULL,
... check.names = TRUE,
... verbose = TRUE,
... )
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
No NAstrings provided.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file POANG.txt
File opened, size = 1.658GB (1780385879 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<LopNr AterPNr SenPNr foralder >>
[06] Detect separator, quoting rule, and ncolumns
Using supplied sep '\t'
sep=0x9 with 100 lines of 15 fields using quote rule 0
Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<LopNr AterPNr SenPNr foralder >>
Quote rule picked = 0
fill=false and the most number of columns found is 15
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (1780385877 bytes from row 1 to eof) / (2 * 9642 jump0size) == 92324
Type codes (jump 000) : 5111115510101055710 Quote rule 0
Type codes (jump 100) : 5111115510101055710 Quote rule 0
=====
Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 1780385877
Line length: mean=88.42 sd=14.68 min=58 max=176
Estimated number of rows: 1780385877 / 88.42 = 20135934
Initial alloc = 30148798 rows (20135934 + 49%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5111115510101055710
[10] Allocate memory for the datatable
Allocating 15 column slots (15 - 0 dropped) with 30148798 rows
[11] Read the data
jumps=[0..1700), chunk_size=1047285, total_size=1780385770
Read 55%. ETA 00:00
[12] Finalizing the datatable
Read 11179334 rows x 15 columns from 1.658GB (1780385879 bytes) file in 00:10.121 wall clock time
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Final type counts
0 : drop
5 : bool8
0 : bool8
0 : bool8
0 : bool8
5 : int32
0 : int64
1 : float64
0 : float64
0 : float64
4 : string
Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1", :
Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565 01 0 1 0 1999 1 TNMAT NMAC09 015 015 15.>>
And sessionInfo():
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin17.0.0 (64-bit)
Running under: macOS High Sierra 10.13
Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.2.20/lib/libopenblasp-r0.2.20.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dtplyr_0.0.2 data.table_1.10.5 dplyr_0.7.4 purrr_0.2.4
[5] readr_1.1.1 tidyr_0.7.2 tibble_1.3.4 ggplot2_2.2.1
[9] tidyverse_1.1.1 colorout_1.1-2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 cellranger_1.1.0 compiler_3.4.2 plyr_1.8.4
[5] bindr_0.1 forcats_0.2.0 tools_3.4.2 lubridate_1.6.0
[9] jsonlite_1.5 nlme_3.1-131 gtable_0.2.0 lattice_0.20-35
[13] pkgconfig_2.0.1 rlang_0.1.2 psych_1.7.8 parallel_3.4.2
[17] haven_1.1.0 bindrcpp_0.2 xml2_1.1.1 stringr_1.2.0
[21] httr_1.3.1 hms_0.3 grid_3.4.2 glue_1.1.1
[25] R6_2.2.2 readxl_1.0.0 foreign_0.8-69 modelr_0.1.1
[29] reshape2_1.4.2 magrittr_1.5 scales_0.5.0 rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
I am also affected by this. For large files, replacing <0x00> is often infeasible or too slow. I assume the <0x00> bytes are created by a database export.
readr::read_delim throws a warning for those bytes, but does not stop parsing (readr issue history)
Would that be a possible workaround for fread as well?
@adamaltmejd and @djbirke would you mind trying your files again please using 1.12.3 from GitHub: embedded NUL should be fixed now.
@mattdowle That's great news, thanks a lot! We decided to bite the bullet and manually replace any <0x00> bytes in our files. I do no longer have access to the original version of the files, so unfortunately cannot test the bugfix on them.
Thanks for fixing this! In my case the file is in on a secure server where I cannot update Datatable myself, but pretty sure it would work if it does in other test cases :). Thanks!