In the most recent version of data,table, I am trying to read a large file (all rows and 14 columns from a data table of size 53,880,721 x 37; 6.3GiB) and I got a string of stack imbalance errors which eventually results in RStudio crashing out. This seems to be the same symptom as issue #2139. Version 1.10.4.3 works without any problems, albeit orders of magnitude more slowly. It is hard to provide a reproducible example without a huge file. If necessary, I will create one.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-10-31 21:13:04 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
A <- fread('large_file.csv')
Read 28%. ETA 00:00 Error in fread("large_file.csv") :
unprotect_ptr: pointer not found
Warning: stack imbalance in '<-', 28 then 29
Warning: stack imbalance in '$', 34 then 33
Warning: stack imbalance in '$', 19 then 20
Error: unprotect_ptr: pointer not found
Warning: stack imbalance in 'lapply', 126 then 125
Warning: stack imbalance in 'lapply', 113 then 114
Warning: stack imbalance in 'lapply', 98 then 102
Warning: stack imbalance in 'lapply', 84 then 86
Warning: stack imbalance in 'lapply', 72 then 73
Warning: stack imbalance in 'lapply', 61 then 60
Warning: stack imbalance in 'lapply', 50 then 49
Warning: stack imbalance in '$', 49 then 50
Warning: stack imbalance in '$', 34 then 35
Error during wrapup: R_Reprotect: only 20 protected items, can't reprotect index 20
# Output of sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Thanks! Could you pass verbose=TRUE to that fread call, rerun and post the full output please.
Sure.
Stopped working at 82% instead of 28%. Continues after imbalance call and then I get an "Rstudio R Session has stopped working" error box.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-10-31 21:13:04 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> A <- fread('large_file.csv', verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file large_file.csv
File opened, size = 6.347GB (6814991444 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 013) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 028) : 5110107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 042) : 5510107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 068) : 55101075551010510711011111111117711111777110 Quote rule 0
Type codes (jump 082) : 55101075551010510711011111111177711111777110 Quote rule 0
Type codes (jump 091) : 55101075551010510711055555555777711111777510 Quote rule 0
Type codes (jump 100) : 55101075551010510711055555555777711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6814991442
Line length: mean=126.51 sd=8.07 min=100 max=359
Estimated number of rows: 6814991442 / 126.51 = 53867158
Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 62%. ETA 00:00 Warning: stack imbalance in '$', 38 then 36
Warning: stack imbalance in '$', 23 then 24
Read 68%. ETA 00:00 Warning: stack imbalance in '$', 23 then 24
Read 82%. ETA 00:00
Thanks for this output @aadler. Could you try once more, this time with showProgress=FALSE i.e. fread('large_file.csv', verbose = TRUE, showProgress=FALSE). All I can think currently is that it's related to printing the ETA to the console on Windows. Only thread 0 does that Rprintf which I thought was safe, but maybe it isn't. If it works with showProgress=FALSE but not with TRUE, and that's repeatable several times, then I know I'm barking up the right tree. Please also use latest dev 1.10.5 just to be sure.
In R's printutils.c, Rprintf calls Rvprintf which contains at line 917 :
static int printcount = 0;
if (++printcount > 100) {
R_CheckUserInterrupt();
printcount = 0 ;
}
and in freadR.c line 481 there is this comment :
// Had crashes with R_CheckUserInterrupt() even when called only from
master thread, to overcome.
So barking up this tree looks promising. I'll replace the call to Rprintf() with REprinf() to avoid its call to R_CheckUserInterrupt() and ask you to try again.
It may be that it feels like it's something to do with fread rereading but that's also when it runs for longer and may just be because it prints more ETA messages reaching the 100 count in core R. There isn't a reread being reported in your output this time, for example. It may also depend on how many lines have been printed to the console before fread is called (printcount's value) giving rise to the randomness of the crash / stack imbalance.
@aadler Change made and looks ok. (It's just a red cross because the progress meter isn't getting test coverage from the smoke tests.) Please go ahead and try very latest dev.
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-09 04:24:28 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
You were correct it seems!
> A <- fread('large_file.csv', verbose = TRUE, showProgress=FALSE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 0
0/1 column will be read as boolean
[02] Opening the file
Opening file large_file.csv
File opened, size = 6.347GB (6814991444 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 013) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 028) : 5110107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 042) : 5510107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 068) : 55101075551010510711011111111117711111777110 Quote rule 0
Type codes (jump 082) : 55101075551010510711011111111177711111777110 Quote rule 0
Type codes (jump 091) : 55101075551010510711055555555777711111777510 Quote rule 0
Type codes (jump 100) : 55101075551010510711055555555777711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6814991442
Line length: mean=126.51 sd=8.07 min=100 max=359
Estimated number of rows: 6814991442 / 126.51 = 53867158
Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:56.763 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
5 : bool8
0 : bool8
0 : bool8
0 : bool8
14 : int32
0 : int64
10 : float64
0 : float64
0 : float64
8 : string
Rereading 2 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'int32' to 'string' due to <<C>> on row 1040
Column 14 ("XXXX") bumped from 'bool8' to 'float64' due to <<12473.21>> on row 77204
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 2 columns in 00:20.442
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
0.000s ( 0%) Memory map 6.347GB file
0.549s ( 1%) sep=',' ncol=37 and header detection
0.016s ( 0%) Column type detection using 10049 sample rows
15.322s ( 20%) Allocation of 53880720 rows x 37 cols (12.192GB)
61.318s ( 79%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
= 0.057s ( 0%) Finding first non-embedded \n after each jump
+ 3.851s ( 5%) Parse to row-major thread buffers
+ 38.294s ( 50%) Transpose
+ 19.116s ( 25%) Waiting
20.442s ( 26%) Rereading 2 columns due to out-of-sample type exceptions
77.204s Total
and a second time for testing
> cclasses <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> A <- fread('large_file.csv', verbose = TRUE, showProgress=FALSE, colClasses = cclasses)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 0
0/1 column will be read as boolean
[02] Opening the file
Opening file large_file.csv
File opened, size = 6.347GB (6814991444 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 013) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 028) : 5110107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 042) : 5510107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 068) : 55101075551010510711011111111117711111777110 Quote rule 0
Type codes (jump 082) : 55101075551010510711011111111177711111777110 Quote rule 0
Type codes (jump 091) : 55101075551010510711055555555777711111777510 Quote rule 0
Type codes (jump 100) : 55101075551010510711055555555777711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6814991442
Line length: mean=126.51 sd=8.07 min=100 max=359
Estimated number of rows: 6814991442 / 126.51 = 53867158
Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 14 type and 0 drop user overrides : 55101075551010510771077777777777755555777510
[10] Allocate memory for the datatable
Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:31.641 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
0 : bool8
0 : bool8
0 : bool8
0 : bool8
12 : int32
0 : int64
17 : float64
0 : float64
0 : float64
8 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'float64' to 'string' due to <<C>> on row 1040
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 1 columns in 00:33.233
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
0.000s ( 0%) Memory map 6.347GB file
0.165s ( 0%) sep=',' ncol=37 and header detection
0.016s ( 0%) Column type detection using 10049 sample rows
10.867s ( 17%) Allocation of 53880720 rows x 37 cols (14.262GB)
53.827s ( 83%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
= 0.004s ( 0%) Finding first non-embedded \n after each jump
+ 2.944s ( 5%) Parse to row-major thread buffers
+ 18.906s ( 29%) Transpose
+ 31.973s ( 49%) Waiting
33.233s ( 51%) Rereading 1 columns due to out-of-sample type exceptions
64.874s Total
@mattdowle I ran it three times without the showprogress FALSE and it completed!
> A <- fread('large_file.csv', verbose = TRUE, colClasses = cclasses)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file large_file.csv
File opened, size = 6.347GB (6814991444 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 013) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 028) : 5110107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 042) : 5510107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 068) : 55101075551010510711011111111117711111777110 Quote rule 0
Type codes (jump 082) : 55101075551010510711011111111177711111777110 Quote rule 0
Type codes (jump 091) : 55101075551010510711055555555777711111777510 Quote rule 0
Type codes (jump 100) : 55101075551010510711055555555777711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6814991442
Line length: mean=126.51 sd=8.07 min=100 max=359
Estimated number of rows: 6814991442 / 126.51 = 53867158
Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 14 type and 0 drop user overrides : 55101075551010510771077777777777755555777510
[10] Allocate memory for the datatable
Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 99%. ETA 00:00
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:35.513 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
0 : bool8
0 : bool8
0 : bool8
0 : bool8
12 : int32
0 : int64
17 : float64
0 : float64
0 : float64
8 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'float64' to 'string' due to <<C>> on row 1040
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 1 columns in 00:19.218
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 6.347GB file
0.162s ( 0%) sep=',' ncol=37 and header detection
0.016s ( 0%) Column type detection using 10049 sample rows
11.688s ( 21%) Allocation of 53880720 rows x 37 cols (14.262GB)
42.859s ( 78%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
= 0.023s ( 0%) Finding first non-embedded \n after each jump
+ 3.818s ( 7%) Parse to row-major thread buffers
+ 20.892s ( 38%) Transpose
+ 18.127s ( 33%) Waiting
19.218s ( 35%) Rereading 1 columns due to out-of-sample type exceptions
54.731s Total
> A <- fread('large_file.csv', verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file large_file.csv
File opened, size = 6.347GB (6814991444 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 101 because (6814991442 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264044
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) on row 2
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 013) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 028) : 5110107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 042) : 5510107555101510711011111111117711111777110 Quote rule 0
Type codes (jump 068) : 55101075551010510711011111111117711111777110 Quote rule 0
Type codes (jump 082) : 55101075551010510711011111111177711111777110 Quote rule 0
Type codes (jump 091) : 55101075551010510711055555555777711111777510 Quote rule 0
Type codes (jump 100) : 55101075551010510711055555555777711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6814991442
Line length: mean=126.51 sd=8.07 min=100 max=359
Estimated number of rows: 6814991442 / 126.51 = 53867158
Initial alloc = 61748467 rows (53867158 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 55101075551010510711055555555777711111777510
[10] Allocate memory for the datatable
Allocating 37 column slots (37 - 0 dropped) with 61748467 rows
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
Read 99%. ETA 00:00
[12] Finalizing the datatable
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:23.552 wall clock time
Thread buffers were grown 0 times (if all 40 threads each grew once, this figure would be 40)
Final type counts
0 : drop
5 : bool8
0 : bool8
0 : bool8
0 : bool8
14 : int32
0 : int64
10 : float64
0 : float64
0 : float64
8 : string
Rereading 2 columns due to out-of-sample type exceptions.
Column 17 ("XXXX") bumped from 'int32' to 'string' due to <<C>> on row 1040
Column 14 ("XXXX") bumped from 'bool8' to 'float64' due to <<12473.21>> on row 77204
[11] Read the data
jumps=[0..6520), chunk_size=1045244, total_size=6814991083
[12] Finalizing the datatable
Reread 53880720 rows x 2 columns in 00:20.825
Read 53880720 rows. Exactly what was estimated and allocated up front
=============================
0.000s ( 0%) Memory map 6.347GB file
0.173s ( 0%) sep=',' ncol=37 and header detection
0.016s ( 0%) Column type detection using 10049 sample rows
3.212s ( 7%) Allocation of 53880720 rows x 37 cols (12.192GB)
40.976s ( 92%) Reading 6520 chunks of 0.997MB (8261 rows) using 40 threads
= 0.015s ( 0%) Finding first non-embedded \n after each jump
+ 3.793s ( 9%) Parse to row-major thread buffers
+ 18.783s ( 42%) Transpose
+ 18.384s ( 41%) Waiting
20.825s ( 47%) Rereading 2 columns due to out-of-sample type exceptions
44.377s Total
> A <- fread('large_file.csv')
Read 53880720 rows x 37 columns from 6.347GB (6814991444 bytes) file in 00:32.848 wall clock time
Rereading 2 columns due to out-of-sample type exceptions.
Reread 53880720 rows x 2 columns in 00:18.507
Relief! Thanks @aadler!
Most helpful comment
@mattdowle I ran it three times without the showprogress FALSE and it completed!