New function to round IDate and ITime together, produces integer-stored period intervals including attributes for unit and amount of units within period. Also origin argument allows to store lower values on integer, so it is easy to have < 100k which according to ?IDateTime gives additional speed up.
Some benchmark: SO.
man online: https://jangorecki.github.io/drat/iperiod/
install.packages("data.table", repos = "https://jangorecki.github.io/drat/iperiod")
@tshort FYI
First benchmark looks promising, on tick data in 22M rows.
# Unit: seconds
# expr min lq mean median uq max neval
# xts 6.916758 7.028135 7.267980 7.261469 7.507824 7.632221 4
# dt 1.432376 1.435704 1.454868 1.446551 1.474032 1.493994 4
edit:
code is in gist
not available now but may eventually start working if github support will manage to recover deleted gist.
The following package https://github.com/lsilvest/nanoival (now part of nanotime) is planning to address this feature, giving better precision than my implementation. I haven't look into its internals yet so not sure how different it is from my work on interval type: https://github.com/Rdatatable/data.table/compare/master...jangorecki:iunit
@lsilvest I have seen recent development in nanotime but cannot find the implementation of period type that would correspond to mine implementation mentioned here. nanoival looks nice but its features has to compromise its performance. If user is interested in truncating POSIXct (or nanotime) to time intervals, like 1 hours, then what would be the fastest way to do it in nanotime?
nanoival is a different concept and cannot be used as a replacement for IPeriod. The main role of nanoival is to construct complex nanotime subsetting (and because of the set operations defined on nanoival it is possible to capture simultaneity quite nicely among other things).
From a quick look at the IPeriod examples, it seems it is mainly used for the concept of aggregation. My view is that aggregation is better seen as a mapping from one time series to another, so it's more intuitive to keep the original point in time type (e.g. POSIXct or nanotime) rather than creating a new one.
In addition to aggregation the other very closely linked concept is alignment. @eddelbuettel and myself have envisioned further developments to the nanotime package for generalised aggregation and alignment (see the experimental package dtts.utils).
I think we can provide something very similar to IPeriod with some of the code we already have, and it would be great to work with you to make sure we make nanotime as useful as possible when used with data.table.
I have created eddelbuettel/nanotime#64 to keep track of this enhancement.
@lsilvest Thanks. Yes you got IPeriod class correct, it is only for aggregation.
Great to hear that! looking forward for this functionality. Closing this issue then as it fits much better to nanotime package.
First benchmark looks promising, on tick data in 22M rows.
# Unit: seconds # expr min lq mean median uq max neval # xts 6.916758 7.028135 7.267980 7.261469 7.507824 7.632221 4 # dt 1.432376 1.435704 1.454868 1.446551 1.474032 1.493994 4
I know this was 5 years ago, but I'm curious what xts functionality you used here. I'd really appreciate if you could give me the command to replicate these results. I'd like to see if there's a way to improve xts' performance.
@joshuaulrich I contacted GitHub support, hope they will recover my deleted gist where I kept that code
https://gist.github.com/jangorecki/12a6a219b203cddc9a34
Thanks, I appreciate it! I would also really appreciate it if you mention me when you benchmark xts' performance. I'm always looking for ways to make xts faster.
@joshuaulrich I got reply from GH support
We only keep backups on our servers for a short period of time after they're deleted, so unfortunately we won't be able to restore this for you.
There is a script in https://stackoverflow.com/a/33137917/2490497 but I think it does not address the timings I posted above (and you asked about), as the input object is of different class there.
It indeed seems that different class of the input is major factor for such speed difference.
If I will have IDateTime class (which is just list of two integers), then it might be faster than xts, but xts works on posixct directly. On the posixct class IPeriod approach is way slower, mostly because of inefficient ITime object creation, that uses posixlt class, so there is some space for improvement there. It is a pity I cannot retrive the old script, as I am not able to produce similar timings even using IDdateTime.
wget https://cran.r-project.org/src/contrib/Archive/zoo/zoo_1.7-10.tar.gz
R CMD INSTALL zoo_1.7-10.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/xts/xts_0.9-7.tar.gz
R CMD INSTALL xts_0.9-7.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/chron/chron_2.3-44.tar.gz
R CMD INSTALL chron_2.3-44.tar.gz
R -q -e 'install.packages("data.table", repos="https://jangorecki.github.io/drat/iperiod", method="curl")'
# get tick data
wget https://api.bitcoincharts.com/v1/csv/krakenEUR.csv.gz
gzip -d krakenEUR.csv.gz
wc -l krakenEUR.csv
#36310761 krakenEUR.csv ## 36M rows
library(data.table)
library(xts)
Sys.setenv(TZ="UTC")
d = fread("krakenEUR.csv")
x = as.POSIXct(d[[1L]], origin="1970-01-01")
system.time(
xt <- align.time(x, 15 * 60)
)
# user system elapsed
# 1.484 0.436 1.922
id = function(x) setattr(as.integer(as.numeric(x)%/%86400L), "class", c("IDate", "Date"))
it = function(x) as.ITime(unclass(x), origin="1970-01-01")
system.time(
dt <- as.IPeriod(x = id(x), unit = "mins", amount = 15L, origin = "1970-01-01", type = "ceiling", itime = it(x))
)
# user system elapsed
# 8.004 1.704 9.708
i = id(x)
t = it(x) ## this is actually the bottleneck because it goes via POSIXlt
system.time(
dt2 <- as.IPeriod(x = i, unit = "mins", amount = 15L, origin = "1970-01-01", type = "ceiling", itime = t)
)
# user system elapsed
# 1.308 0.272 1.580
sessionInfo()
#R version 3.1.0 (2014-04-10)
#Platform: x86_64-unknown-linux-gnu (64-bit)
#
#locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
#[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
#attached base packages:
#[1] stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] xts_0.9-7 zoo_1.7-10 data.table_1.9.7
#
#loaded via a namespace (and not attached):
#[1] chron_2.3-44 grid_3.1.0 lattice_0.20-29
I am looking forward for https://github.com/eddelbuettel/nanotime/issues/64 that will probably move all that to C++
Thanks Jan, I really appreciate you taking the time to find that for me!
Tried out something similar to the Stackoverflow code, and I get different results, I suppose because of code changes since the original performance numbers. I find that xts, shhhhi and IPeriod are within the same ballpark. Here is the code I ran:
library(rbenchmark)
library(data.table)
library(xts)
library(nanotime)
shhhhi <- function(your.time){
strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + round(as.numeric(your.time)/900)*900
}
PLapointe <- function(a){
as.POSIXlt(round(as.double(a)/(15*60))*(15*60),origin=(as.POSIXlt('1970-01-01')))
}
shhhhi_nanotime <- function(your.time){
nanotime("1970-01-01", tz="UTC") + round(as.integer64(unclass(your.time))/900e9)*900e9
}
dt <- fread("~/temp/.btceUSD.csv")
x = as.POSIXct(dt[[1L]], tz="UTC", origin="1970-01-01")
idt <- IDateTime(x)
idate <- idt$idate
itime <- idt$itime
nn <- as.nanotime(x)
benchmark(
shhhhi = shhhhi(x),
PLapointe = PLapointe(x),
xts = align.time(x, 15*60),
ip = periodize(idate, itime, "mins", 15),
"ip_posix" = as.POSIXct(periodize(idate, itime, "mins", 15), tz="UTC"),
"nanotime (period)" = ceilingtogrid(nn, as.nanoperiod("00:15:00"), "UTC"),
"nanotime (duration)" = ceilingtogrid(nn, as.nanoduration("00:15:00"), "UTC"),
"nanotime (shhhhi)" = shhhhi_nanotime(x),
replications=10,
columns = c("test", "replications", "elapsed", "relative"))
And here are the results I get:
test replications elapsed relative
4 ip 10 6.009 1.086
5 ip_posix 10 11.571 2.091
7 nanotime (duration) 10 6.724 1.215
6 nanotime (period) 10 7.927 1.433
8 nanotime (shhhhi) 10 11.408 2.062
2 PLapointe 10 45.142 8.159
1 shhhhi 10 6.657 1.203
3 xts 10 5.533 1.000
So xts has slightly better performance than IPeriod when the latter is not dealing with POSIXct. xts simply uses the modulo operator:
align.time.POSIXct <- function(x, n=60, ...) {
if(n <= 0) stop("'n' must be positive")
structure(unclass(x) + (n - unclass(x) %% n),class=c("POSIXct","POSIXt"))
}
whereas IPeriod uses a representation that is the number of units*amount since the origin (i.e. in this case the number of 15 minutes), and goes through one multiplication and one division similar to shhhhi but without the rounding:
iunits_in_day = as.integer(iunits_in_day)
idate_n = (unclass(x) - as.integer(as.Date(origin))) * iunits_in_day
itime_n = unclass(itime) %/% secs_in_iunit
as.IPeriod(idate_n + itime_n, unit, amount, origin, type)
The experimental version in nanotime has a different approach, and constructs a grid first and then aligns to that grid, thus paying a price for the grid construction, but this comes with some flexibility as it is able to handle an arbitrary aggregation length with days and months and timezone correctness when operating on a nanoperiod.
That said, I have reservations on all these functions and I will elaborate further in eddelbuettel/nanotime#64.
Most helpful comment
Tried out something similar to the Stackoverflow code, and I get different results, I suppose because of code changes since the original performance numbers. I find that
xts,shhhhiandIPeriodare within the same ballpark. Here is the code I ran:And here are the results I get:
So
xtshas slightly better performance thanIPeriodwhen the latter is not dealing withPOSIXct.xtssimply uses the modulo operator:whereas
IPerioduses a representation that is the number of units*amount since the origin (i.e. in this case the number of 15 minutes), and goes through one multiplication and one division similar toshhhhibut without the rounding:The experimental version in
nanotimehas a different approach, and constructs a grid first and then aligns to that grid, thus paying a price for the grid construction, but this comes with some flexibility as it is able to handle an arbitrary aggregation length with days and months and timezone correctness when operating on ananoperiod.That said, I have reservations on all these functions and I will elaborate further in eddelbuettel/nanotime#64.