Data.table: Time intervals integer data type IPeriod

Created on 11 Oct 2015 · 11Comments · Source: Rdatatable/data.table

New function to round IDate and ITime together, produces integer-stored period intervals including attributes for unit and amount of units within period. Also origin argument allows to store lower values on integer, so it is easy to have < 100k which according to ?IDateTime gives additional speed up.
Some benchmark: SO.
man online: https://jangorecki.github.io/drat/iperiod/

install.packages("data.table", repos = "https://jangorecki.github.io/drat/iperiod")

@tshort FYI

feature request idatitime

Source

jangorecki

👍1

Most helpful comment

Tried out something similar to the Stackoverflow code, and I get different results, I suppose because of code changes since the original performance numbers. I find that xts, shhhhi and IPeriod are within the same ballpark. Here is the code I ran:

library(rbenchmark)
library(data.table)
library(xts)
library(nanotime)


shhhhi <- function(your.time){
    strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + round(as.numeric(your.time)/900)*900
}

PLapointe <- function(a){
    as.POSIXlt(round(as.double(a)/(15*60))*(15*60),origin=(as.POSIXlt('1970-01-01')))
}

shhhhi_nanotime  <- function(your.time){
    nanotime("1970-01-01", tz="UTC") + round(as.integer64(unclass(your.time))/900e9)*900e9
}


dt <- fread("~/temp/.btceUSD.csv")

x = as.POSIXct(dt[[1L]], tz="UTC", origin="1970-01-01")
idt <- IDateTime(x)
idate <- idt$idate
itime <- idt$itime
nn  <- as.nanotime(x)

benchmark(
    shhhhi                = shhhhi(x),
    PLapointe             = PLapointe(x),
    xts                   = align.time(x, 15*60),
    ip                    = periodize(idate, itime, "mins", 15),
    "ip_posix"            = as.POSIXct(periodize(idate, itime, "mins", 15), tz="UTC"),
    "nanotime (period)"   = ceilingtogrid(nn, as.nanoperiod("00:15:00"), "UTC"),
    "nanotime (duration)" = ceilingtogrid(nn, as.nanoduration("00:15:00"), "UTC"),
    "nanotime (shhhhi)"   = shhhhi_nanotime(x),
    replications=10,
    columns = c("test", "replications", "elapsed", "relative"))

And here are the results I get:

                 test replications elapsed relative
4                  ip           10   6.009    1.086
5            ip_posix           10  11.571    2.091
7 nanotime (duration)           10   6.724    1.215
6   nanotime (period)           10   7.927    1.433
8   nanotime (shhhhi)           10  11.408    2.062
2           PLapointe           10  45.142    8.159
1              shhhhi           10   6.657    1.203
3                 xts           10   5.533    1.000

So xts has slightly better performance than IPeriod when the latter is not dealing with POSIXct. xts simply uses the modulo operator:

align.time.POSIXct <- function(x, n=60, ...) {
  if(n <= 0) stop("'n' must be positive")
  structure(unclass(x) + (n - unclass(x) %% n),class=c("POSIXct","POSIXt"))
}

whereas IPeriod uses a representation that is the number of units*amount since the origin (i.e. in this case the number of 15 minutes), and goes through one multiplication and one division similar to shhhhi but without the rounding:

    iunits_in_day = as.integer(iunits_in_day)
    idate_n = (unclass(x) - as.integer(as.Date(origin))) * iunits_in_day
    itime_n = unclass(itime) %/% secs_in_iunit
    as.IPeriod(idate_n + itime_n, unit, amount, origin, type)

The experimental version in nanotime has a different approach, and constructs a grid first and then aligns to that grid, thus paying a price for the grid construction, but this comes with some flexibility as it is able to handle an arbitrary aggregation length with days and months and timezone correctness when operating on a nanoperiod.

That said, I have reservations on all these functions and I will elaborate further in eddelbuettel/nanotime#64.

lsilvest on 8 Apr 2020

👍2

All 11 comments

First benchmark looks promising, on tick data in 22M rows.

# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval
#   xts 6.916758 7.028135 7.267980 7.261469 7.507824 7.632221     4
#    dt 1.432376 1.435704 1.454868 1.446551 1.474032 1.493994     4

edit:
code is in gist
not available now but may eventually start working if github support will manage to recover deleted gist.

jangorecki on 12 Oct 2015

The following package https://github.com/lsilvest/nanoival (now part of nanotime) is planning to address this feature, giving better precision than my implementation. I haven't look into its internals yet so not sure how different it is from my work on interval type: https://github.com/Rdatatable/data.table/compare/master...jangorecki:iunit

jangorecki on 18 May 2019

@lsilvest I have seen recent development in nanotime but cannot find the implementation of period type that would correspond to mine implementation mentioned here. nanoival looks nice but its features has to compromise its performance. If user is interested in truncating POSIXct (or nanotime) to time intervals, like 1 hours, then what would be the fastest way to do it in nanotime?

jangorecki on 3 Apr 2020

nanoival is a different concept and cannot be used as a replacement for IPeriod. The main role of nanoival is to construct complex nanotime subsetting (and because of the set operations defined on nanoival it is possible to capture simultaneity quite nicely among other things).

From a quick look at the IPeriod examples, it seems it is mainly used for the concept of aggregation. My view is that aggregation is better seen as a mapping from one time series to another, so it's more intuitive to keep the original point in time type (e.g. POSIXct or nanotime) rather than creating a new one.

In addition to aggregation the other very closely linked concept is alignment. @eddelbuettel and myself have envisioned further developments to the nanotime package for generalised aggregation and alignment (see the experimental package dtts.utils).

I think we can provide something very similar to IPeriod with some of the code we already have, and it would be great to work with you to make sure we make nanotime as useful as possible when used with data.table.

I have created eddelbuettel/nanotime#64 to keep track of this enhancement.

lsilvest on 4 Apr 2020

👍2

@lsilvest Thanks. Yes you got IPeriod class correct, it is only for aggregation.
Great to hear that! looking forward for this functionality. Closing this issue then as it fits much better to nanotime package.

jangorecki on 4 Apr 2020

First benchmark looks promising, on tick data in 22M rows.

# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval
#   xts 6.916758 7.028135 7.267980 7.261469 7.507824 7.632221     4
#    dt 1.432376 1.435704 1.454868 1.446551 1.474032 1.493994     4

I know this was 5 years ago, but I'm curious what xts functionality you used here. I'd really appreciate if you could give me the command to replicate these results. I'd like to see if there's a way to improve xts' performance.

joshuaulrich on 4 Apr 2020

@joshuaulrich I contacted GitHub support, hope they will recover my deleted gist where I kept that code
https://gist.github.com/jangorecki/12a6a219b203cddc9a34

jangorecki on 4 Apr 2020

Thanks, I appreciate it! I would also really appreciate it if you mention me when you benchmark xts' performance. I'm always looking for ways to make xts faster.

joshuaulrich on 4 Apr 2020

👍2

@joshuaulrich I got reply from GH support

We only keep backups on our servers for a short period of time after they're deleted, so unfortunately we won't be able to restore this for you.

There is a script in https://stackoverflow.com/a/33137917/2490497 but I think it does not address the timings I posted above (and you asked about), as the input object is of different class there.

It indeed seems that different class of the input is major factor for such speed difference.
If I will have IDateTime class (which is just list of two integers), then it might be faster than xts, but xts works on posixct directly. On the posixct class IPeriod approach is way slower, mostly because of inefficient ITime object creation, that uses posixlt class, so there is some space for improvement there. It is a pity I cannot retrive the old script, as I am not able to produce similar timings even using IDdateTime.

wget https://cran.r-project.org/src/contrib/Archive/zoo/zoo_1.7-10.tar.gz
R CMD INSTALL zoo_1.7-10.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/xts/xts_0.9-7.tar.gz
R CMD INSTALL xts_0.9-7.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/chron/chron_2.3-44.tar.gz
R CMD INSTALL chron_2.3-44.tar.gz 
R -q -e 'install.packages("data.table", repos="https://jangorecki.github.io/drat/iperiod", method="curl")'

# get tick data
wget https://api.bitcoincharts.com/v1/csv/krakenEUR.csv.gz
gzip -d krakenEUR.csv.gz
wc -l krakenEUR.csv 
#36310761 krakenEUR.csv ## 36M rows

library(data.table)
library(xts)
Sys.setenv(TZ="UTC")
d = fread("krakenEUR.csv")
x = as.POSIXct(d[[1L]], origin="1970-01-01")
system.time(
  xt <- align.time(x, 15 * 60)
)
#   user  system elapsed 
#  1.484   0.436   1.922
id = function(x) setattr(as.integer(as.numeric(x)%/%86400L), "class", c("IDate", "Date"))
it = function(x) as.ITime(unclass(x), origin="1970-01-01")
system.time(
  dt <- as.IPeriod(x = id(x), unit = "mins", amount = 15L, origin = "1970-01-01", type = "ceiling", itime = it(x))
)
#   user  system elapsed 
#  8.004   1.704   9.708
i = id(x)
t = it(x) ## this is actually the bottleneck because it goes via POSIXlt
system.time(
  dt2 <- as.IPeriod(x = i, unit = "mins", amount = 15L, origin = "1970-01-01", type = "ceiling", itime = t)
)
#   user  system elapsed 
#  1.308   0.272   1.580
sessionInfo()
#R version 3.1.0 (2014-04-10)
#Platform: x86_64-unknown-linux-gnu (64-bit)
#
#locale:
# [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
# [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
# [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
# [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
# [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#[1] xts_0.9-7        zoo_1.7-10       data.table_1.9.7
#
#loaded via a namespace (and not attached):
#[1] chron_2.3-44    grid_3.1.0      lattice_0.20-29

I am looking forward for https://github.com/eddelbuettel/nanotime/issues/64 that will probably move all that to C++

jangorecki on 6 Apr 2020

❤1

Thanks Jan, I really appreciate you taking the time to find that for me!

joshuaulrich on 6 Apr 2020

library(rbenchmark)
library(data.table)
library(xts)
library(nanotime)


shhhhi <- function(your.time){
    strptime("1970-01-01", "%Y-%m-%d", tz="UTC") + round(as.numeric(your.time)/900)*900
}

PLapointe <- function(a){
    as.POSIXlt(round(as.double(a)/(15*60))*(15*60),origin=(as.POSIXlt('1970-01-01')))
}

shhhhi_nanotime  <- function(your.time){
    nanotime("1970-01-01", tz="UTC") + round(as.integer64(unclass(your.time))/900e9)*900e9
}


dt <- fread("~/temp/.btceUSD.csv")

x = as.POSIXct(dt[[1L]], tz="UTC", origin="1970-01-01")
idt <- IDateTime(x)
idate <- idt$idate
itime <- idt$itime
nn  <- as.nanotime(x)

benchmark(
    shhhhi                = shhhhi(x),
    PLapointe             = PLapointe(x),
    xts                   = align.time(x, 15*60),
    ip                    = periodize(idate, itime, "mins", 15),
    "ip_posix"            = as.POSIXct(periodize(idate, itime, "mins", 15), tz="UTC"),
    "nanotime (period)"   = ceilingtogrid(nn, as.nanoperiod("00:15:00"), "UTC"),
    "nanotime (duration)" = ceilingtogrid(nn, as.nanoduration("00:15:00"), "UTC"),
    "nanotime (shhhhi)"   = shhhhi_nanotime(x),
    replications=10,
    columns = c("test", "replications", "elapsed", "relative"))

And here are the results I get:

                 test replications elapsed relative
4                  ip           10   6.009    1.086
5            ip_posix           10  11.571    2.091
7 nanotime (duration)           10   6.724    1.215
6   nanotime (period)           10   7.927    1.433
8   nanotime (shhhhi)           10  11.408    2.062
2           PLapointe           10  45.142    8.159
1              shhhhi           10   6.657    1.203
3                 xts           10   5.533    1.000

So xts has slightly better performance than IPeriod when the latter is not dealing with POSIXct. xts simply uses the modulo operator:

align.time.POSIXct <- function(x, n=60, ...) {
  if(n <= 0) stop("'n' must be positive")
  structure(unclass(x) + (n - unclass(x) %% n),class=c("POSIXct","POSIXt"))
}

    iunits_in_day = as.integer(iunits_in_day)
    idate_n = (unclass(x) - as.integer(as.Date(origin))) * iunits_in_day
    itime_n = unclass(itime) %/% secs_in_iunit
    as.IPeriod(idate_n + itime_n, unit, amount, origin, type)

That said, I have reservations on all these functions and I will elaborate further in eddelbuettel/nanotime#64.

lsilvest on 8 Apr 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings