Data.table: fread error reading Latin-1 file containing NUL byte <0x00>

Created on 23 Oct 2017  路  4Comments  路  Source: Rdatatable/data.table

Copying from my question on SO:

Having trouble creating a reproducible example and can't share the data, but I think I stumbled upon a bug in fread(). Trying to read my 1.658GB tsv file encoded in Latin-1 produces the following error:

Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565    01  0   1   0   1999    1   TNMAT       NMAC09  015 015 15.>>

The problematic line is line no 11129896 where there is a NUL mark written out as <0x00> in Sublime Text and ^@ in Vi (can't copy it). If i set skip = 11129895, fread throws the same error but now on "jump 0", if I set skip = 11129896 it works, but nrows=11129895 still throws the same error. Having removed the character the file reads as it should. Maybe fread() is not supposed to support reading files with these encoding issues, but at least it would be great if the error was more informative. Took me quite a while to understand what was going on and to find the correct line.

The verbose output of fread() is:

> dt <- fread(
...         "POANG.txt",
...         header = TRUE,
...         sep = "\t",
...         sep2 = NULL,
...         encoding = "Latin-1", #same as ISO-8859-1
...         na.strings = NULL,
...         check.names = TRUE,
...         verbose = TRUE,
...     )
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=4, nth=4)
  No NAstrings provided.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file POANG.txt
  File opened, size = 1.658GB (1780385879 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<LopNr    AterPNr SenPNr  foralder    >>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep '\t'
  sep=0x9  with 100 lines of 15 fields using quote rule 0
  Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<LopNr    AterPNr SenPNr  foralder    >>
  Quote rule picked = 0
  fill=false and the most number of columns found is 15
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (1780385877 bytes from row 1 to eof) / (2 * 9642 jump0size) == 92324
  Type codes (jump 000)    : 5111115510101055710  Quote rule 0
  Type codes (jump 100)    : 5111115510101055710  Quote rule 0
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 1780385877
  Line length: mean=88.42 sd=14.68 min=58 max=176
  Estimated number of rows: 1780385877 / 88.42 = 20135934
  Initial alloc = 30148798 rows (20135934 + 49%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5111115510101055710
[10] Allocate memory for the datatable
  Allocating 15 column slots (15 - 0 dropped) with 30148798 rows
[11] Read the data
  jumps=[0..1700), chunk_size=1047285, total_size=1780385770
Read 55%. ETA 00:00
[12] Finalizing the datatable
Read 11179334 rows x 15 columns from 1.658GB (1780385879 bytes) file in 00:10.121 wall clock time
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Final type counts
         0 : drop
         5 : bool8
         0 : bool8
         0 : bool8
         0 : bool8
         5 : int32
         0 : int64
         1 : float64
         0 : float64
         0 : float64
         4 : string
Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
  Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565  01  0   1   0   1999    1   TNMAT       NMAC09  015 015 15.>>

And sessionInfo():

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin17.0.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.2.20/lib/libopenblasp-r0.2.20.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] dtplyr_0.0.2      data.table_1.10.5 dplyr_0.7.4       purrr_0.2.4
 [5] readr_1.1.1       tidyr_0.7.2       tibble_1.3.4      ggplot2_2.2.1
 [9] tidyverse_1.1.1   colorout_1.1-2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     cellranger_1.1.0 compiler_3.4.2   plyr_1.8.4
 [5] bindr_0.1        forcats_0.2.0    tools_3.4.2      lubridate_1.6.0
 [9] jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35
[13] pkgconfig_2.0.1  rlang_0.1.2      psych_1.7.8      parallel_3.4.2
[17] haven_1.1.0      bindrcpp_0.2     xml2_1.1.1       stringr_1.2.0
[21] httr_1.3.1       hms_0.3          grid_3.4.2       glue_1.1.1
[25] R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.1
[29] reshape2_1.4.2   magrittr_1.5     scales_0.5.0     rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2
fread

All 4 comments

I am also affected by this. For large files, replacing <0x00> is often infeasible or too slow. I assume the <0x00> bytes are created by a database export.

readr::read_delim throws a warning for those bytes, but does not stop parsing (readr issue history)

Would that be a possible workaround for fread as well?

@adamaltmejd and @djbirke would you mind trying your files again please using 1.12.3 from GitHub: embedded NUL should be fixed now.

@mattdowle That's great news, thanks a lot! We decided to bite the bullet and manually replace any <0x00> bytes in our files. I do no longer have access to the original version of the files, so unfortunately cannot test the bugfix on them.

Thanks for fixing this! In my case the file is in on a secure server where I cannot update Datatable myself, but pretty sure it would work if it does in other test cases :). Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jameslamb picture jameslamb  路  3Comments

tcederquist picture tcederquist  路  3Comments

andschar picture andschar  路  3Comments

franknarf1 picture franknarf1  路  3Comments

sengoku93 picture sengoku93  路  3Comments