Data.table: Is there anything to do to speed up reading nanotime in fread

Created on 12 Apr 2020 · 11Comments · Source: Rdatatable/data.table

Hello, I was trying to get a feel of how efficient fread is reading nanotime. I did the very naive comparaison bellow against kdb.

not reading nanotimes fread is approx matching kdb

library(data.table)
library(nanotime)

N <- 1e6 
set.seed(1)
l <- sample(letters, size = N, replace = TRUE)
w <- replicate(expr = paste(sample(letters, size = 5L), collapse = ""), n = N)
n <- nanotime("1970-01-01T00:00:00.000000001+00:00") + 30 * 365 * 86400 * 1e9 * abs(runif(N))
r <- rnorm(N)
dt <- data.table(l = l, w = w, n = n, r = r)
fwrite(dt, "/tmp/dt.txt")

system.time(
    dt2 <- fread("/tmp/dt.txt", showProgress = FALSE)                                  
)

First 5 runs gives the following

   user  system elapsed                                                                                                
  2.352   0.004   1.373                                                                                                
   user  system elapsed                                                                                                
  2.187   0.006   1.110                                                                                                
   user  system elapsed                                                                                                
  1.708   0.011   0.867                                                                                                
   user  system elapsed                                                                                                
  1.693   0.004   0.856                                                                                                
   user  system elapsed                                                                                                
  1.681   0.006   0.850

kdb

q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted

reading nanotimes fread is slower while kdb is approx. as fast as reading symbols

system.time(
    dt2 <- fread("/tmp/dt.txt", colClasses = c("n" = "nanotime"), showProgress = FALSE)
)

timings are:

   user  system elapsed                                                                                                
  2.127   0.001   1.260                                                                                                
   user  system elapsed                                                                                                
  2.368   0.004   1.383                                                                                                
   user  system elapsed                                                                                                
  2.312   0.006   1.346                                                                                                
   user  system elapsed                                                                                                
  2.357   0.011   1.381                                                                                                
   user  system elapsed                                                                                                
  2.313   0.006   1.351

kdb

q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redaced
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacred
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted

I know 5 runs is probably insufficient and that mmap is tricky so the above results might be useless but the point is: is there something that can be done on the user side to speed things up or is it just that nanotime is not as efficient as parsing strings than kdb is ?

session

R version 3.6.2 (2019-12-12)                                                                                           
Platform: x86_64-redhat-linux-gnu (64-bit)                                                                             
Running under: Fedora 31 (Workstation Edition)                                                                                                                                                                                                
Matrix products: default                                                                                               
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so                                                                              
attached base packages:                                                                                                
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                                                                                          
other attached packages:                                                                                               
[1] nanotime_0.2.4.5.3 data.table_1.12.9  nvimcom_0.9-83                                                                                                                                                                                    
loaded via a namespace (and not attached):                                                                             
[1] zoo_1.8-7       bit_1.1-15.2    compiler_3.6.2  tools_3.6.2     RcppCCTZ_0.2.7  Rcpp_1.0.4.6    bit64_0.9-7        
[8] grid_3.6.2      lattice_0.20-38

benchmark fread

Source

statquant

Most helpful comment

Not sure what magic you did but it is now 3x faster to load in R on my laptop (0.7s vs 2.1) and much faster (> 2x) than "the one who must not be named"
(for loading up to millisec POSIXct resolution it is good enough)

statquant on 22 May 2020

❤3

All 11 comments

Not addressing your question at all but just wanted to mention that.
Please check if publishing benchmarks of kdb is not conflicting with their license. Many (if not most) of closed source project unfortunately have this kind of restriction in their license. As a result we are unable to publish data.table benchmarks against them.

jangorecki on 12 Apr 2020

Addressing your question. You could try to unclass nanotime before writing to csv, and apply class back after reading from csv into R. This is what #1656 is about.

jangorecki on 12 Apr 2020

Hello @jangorecki kdb notoriously forbids it (as I guess you knew) but given they do similar things against data.table (read https://kx.com/blog/kdb-interface-r/) I think it is only fair... anyway I redacted results.

statquant on 12 Apr 2020

👍1

@jangorecki thanks for your suggestion, I would indeed write int64 if the only reader was R but I have several readers (kdb is one of them) so this is unfortunately out of the question. Does fread plan to have a special parser for nanotime (given your suggestion I am guessing there is a call to nanotime somewhere in fread.[R|c]) ?

statquant on 12 Apr 2020

@statquant can you try your timings again on the fread-iso8601 branch? Something tells me we won't be able to get the precision you're after with double storage, in any case.

MichaelChirico on 21 May 2020

statquant on 22 May 2020

❤3

Love to hear it! 😎

MichaelChirico on 22 May 2020

@statquant I just looked at the link of kdb site you put. I don't know what "R experts" they have (Louise Totten?), but on their benchmarks they do benchmark as.data.frame rather then the actual operation in the question.

jangorecki on 22 May 2020

@MichaelChirico sorry if I seem cheaky but given what you've done for POSIXct would pushing towards nanosecs and casting to nanotime require a lot of additional work ?

statquant on 22 May 2020

@MichaelChirico @statquant I thought the same, that would address this issue well.

jangorecki on 22 May 2020

👍1

I don't mind others breaking license agreement, but the fact is that once a company can claim a loss due to a practice that breaks license agreement, then they could easily win the case in court. Deciding factor is probably a matter how much lawyers will cost and how much loss they can claim.
Unfortunatelly it is a common practice among closed source software, applies to many other tools, kdb+ is just one of them.

jangorecki on 22 May 2020

Was this page helpful?

0 / 5 - 0 ratings