Hello, I was trying to get a feel of how efficient fread is reading nanotime. I did the very naive comparaison bellow against kdb.
not reading nanotimes fread is approx matching kdb
R
library(data.table)
library(nanotime)
N <- 1e6
set.seed(1)
l <- sample(letters, size = N, replace = TRUE)
w <- replicate(expr = paste(sample(letters, size = 5L), collapse = ""), n = N)
n <- nanotime("1970-01-01T00:00:00.000000001+00:00") + 30 * 365 * 86400 * 1e9 * abs(runif(N))
r <- rnorm(N)
dt <- data.table(l = l, w = w, n = n, r = r)
fwrite(dt, "/tmp/dt.txt")
system.time(
dt2 <- fread("/tmp/dt.txt", showProgress = FALSE)
)
First 5 runs gives the following
user system elapsed
2.352 0.004 1.373
user system elapsed
2.187 0.006 1.110
user system elapsed
1.708 0.011 0.867
user system elapsed
1.693 0.004 0.856
user system elapsed
1.681 0.006 0.850
kdb
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSSF";enlist",")0:`:/tmp/dt.txt
redacted
reading nanotimes fread is slower while kdb is approx. as fast as reading symbols
system.time(
dt2 <- fread("/tmp/dt.txt", colClasses = c("n" = "nanotime"), showProgress = FALSE)
)
timings are:
user system elapsed
2.127 0.001 1.260
user system elapsed
2.368 0.004 1.383
user system elapsed
2.312 0.006 1.346
user system elapsed
2.357 0.011 1.381
user system elapsed
2.313 0.006 1.351
kdb
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redaced
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacred
q)\t data:("SSPF";enlist",")0:`:/tmp/dt.txt
redacted
I know 5 runs is probably insufficient and that mmap is tricky so the above results might be useless but the point is: is there something that can be done on the user side to speed things up or is it just that nanotime is not as efficient as parsing strings than kdb is ?
session
R version 3.6.2 (2019-12-12)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 31 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] nanotime_0.2.4.5.3 data.table_1.12.9 nvimcom_0.9-83
loaded via a namespace (and not attached):
[1] zoo_1.8-7 bit_1.1-15.2 compiler_3.6.2 tools_3.6.2 RcppCCTZ_0.2.7 Rcpp_1.0.4.6 bit64_0.9-7
[8] grid_3.6.2 lattice_0.20-38
Not addressing your question at all but just wanted to mention that.
Please check if publishing benchmarks of kdb is not conflicting with their license. Many (if not most) of closed source project unfortunately have this kind of restriction in their license. As a result we are unable to publish data.table benchmarks against them.
Addressing your question. You could try to unclass nanotime before writing to csv, and apply class back after reading from csv into R. This is what #1656 is about.
Hello @jangorecki kdb notoriously forbids it (as I guess you knew) but given they do similar things against data.table (read https://kx.com/blog/kdb-interface-r/) I think it is only fair... anyway I redacted results.
@jangorecki thanks for your suggestion, I would indeed write int64 if the only reader was R but I have several readers (kdb is one of them) so this is unfortunately out of the question. Does fread plan to have a special parser for nanotime (given your suggestion I am guessing there is a call to nanotime somewhere in fread.[R|c]) ?
@statquant can you try your timings again on the fread-iso8601 branch? Something tells me we won't be able to get the precision you're after with double storage, in any case.
Not sure what magic you did but it is now 3x faster to load in R on my laptop (0.7s vs 2.1) and much faster (> 2x) than "the one who must not be named"
(for loading up to millisec POSIXct resolution it is good enough)
Love to hear it! 馃槑
@statquant I just looked at the link of kdb site you put. I don't know what "R experts" they have (Louise Totten?), but on their benchmarks they do benchmark as.data.frame rather then the actual operation in the question.
@MichaelChirico sorry if I seem cheaky but given what you've done for POSIXct would pushing towards nanosecs and casting to nanotime require a lot of additional work ?
@MichaelChirico @statquant I thought the same, that would address this issue well.
I don't mind others breaking license agreement, but the fact is that once a company can claim a loss due to a practice that breaks license agreement, then they could easily win the case in court. Deciding factor is probably a matter how much lawyers will cost and how much loss they can claim.
Unfortunatelly it is a common practice among closed source software, applies to many other tools, kdb+ is just one of them.
Most helpful comment
Not sure what magic you did but it is now 3x faster to load in R on my laptop (0.7s vs 2.1) and much faster (> 2x) than "the one who must not be named"
(for loading up to millisec POSIXct resolution it is good enough)