Hello,
I caught myself at joining two data.tables based on timestamp, but with different time-zones (unfortunately).
The joined values were automatically shifted based on first data.table's time-zone...but without warning - or anything else.
Is it possible to throw warning if something like that is happening?
Example:
library(data.table)
library(lubridate)
dt_a <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"),
to = ymd_hms("2020-10-20 10:00:00"),
by = "1 hour"),
value = 1:5)
dt_b <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"),
to = ymd_hms("2020-10-20 10:00:00"),
by = "1 hour"),
value = 1:5)
dt_b[, time := lubridate::force_tz(time, tzone = "CET", roll = F)]
dt_a[dt_b,
on = .(time),
value_b := i.value]
Result:
dt_a
time value value_b
1: 2020-10-20 06:00:00 1 3
2: 2020-10-20 07:00:00 2 4
3: 2020-10-20 08:00:00 3 5
4: 2020-10-20 09:00:00 4 NA
5: 2020-10-20 10:00:00 5 NA
Thanks
Thank you for reporting. It could be useful to know how data.frame behaves on that matter as well.
FWIW, do recall that timezones are somewhat "artificial" in the sense that they are "only" a printing attribute.
Once the class of the object is POSIXct, data from _all_ timezones is represented as "UTC epoch seconds" (seconds since Jan 1 1970 UTC). The timezone is just used to create the string representation.
So IINM the join will be done correctly (in that observations at identical points in time [i.e., identical displacements from epoch time] will be matched even if they occurred in different time zones).
Coercion / ignoring attributes is a bit more of a general issue for data.table, see #4541, #4415, #3911, #3087... probably a few others
The trouble here is that the lubridate function actually alters the values. Interspersing OP's code with some 'naked' printing:
R> dt_b <- data.table(time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00"), to = ymd_hms("2020-10-20 10:00:00"), by = "1 hour"), value = 1:5)
R> dt_b[,time]
[1] "2020-10-20 06:00:00 UTC" "2020-10-20 07:00:00 UTC" "2020-10-20 08:00:00 UTC" "2020-10-20 09:00:00 UTC" "2020-10-20 10:00:00 UTC"
R> as.numeric(dt_b[, time])
[1] 1603173600 1603177200 1603180800 1603184400 1603188000
R> dt_b[, time := lubridate::force_tz(time, tzone = "CET", roll = F)]
R> dt_b[,time]
[1] "2020-10-20 06:00:00 CEST" "2020-10-20 07:00:00 CEST" "2020-10-20 08:00:00 CEST" "2020-10-20 09:00:00 CEST" "2020-10-20 10:00:00 CEST"
R> as.numeric(dt_b[, time])
[1] 1603166400 1603170000 1603173600 1603177200 1603180800
!R> dt_a[dt_b, on = .(time), value_b := i.value]
R>
So here the actual numeric holding the value _was_ shifted. On purpose. Should that really trigger a warning?
Computing with dates and times is tricky. Doubly so in the presence of timezones. Most 'professional' work I have seen in many years settles on _one_ TZ, often UTC or whereever the head office is.
@jangorecki the data.frame's way result is same as data.table's...so this is not really data.table issue...sorry
Interesting problem here, when time is used as string and when as numeric, and nobody knows what is used when...for example during extraction of hour attribute the string is used:
> lubridate::hour(dt_a$time)
[1] 6 7 8 9 10
> lubridate::hour(dt_b$time)
[1] 6 7 8 9 10
I can't understand this issue, frankly speaking. As @eddelbuettel commented, lubridate::force_tz() will alter the value, and your example looks expected for me. A more appropriated example, in my opinion, should use lubridate::with_tz():
library(data.table)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:data.table':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday, week,
#> yday, year
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
dt_a <- data.table(
time = seq.POSIXt(from = ymd_hms("2020-10-20 06:00:00", tz = "UTC"),
to = ymd_hms("2020-10-20 10:00:00", tz = "UTC"),
by = "1 hour"),
value = 1:5
)
dt_a
#> time value
#> 1: 2020-10-20 06:00:00 1
#> 2: 2020-10-20 07:00:00 2
#> 3: 2020-10-20 08:00:00 3
#> 4: 2020-10-20 09:00:00 4
#> 5: 2020-10-20 10:00:00 5
dt_b <- copy(dt_a)
dt_b[, time := lubridate::with_tz(time, tzone = "Asia/Shanghai")]
dt_b
#> time value
#> 1: 2020-10-20 14:00:00 1
#> 2: 2020-10-20 15:00:00 2
#> 3: 2020-10-20 16:00:00 3
#> 4: 2020-10-20 17:00:00 4
#> 5: 2020-10-20 18:00:00 5
dt_a[dt_b,
on = .(time),
value_b := i.value]
dt_a
#> time value value_b
#> 1: 2020-10-20 06:00:00 1 1
#> 2: 2020-10-20 07:00:00 2 2
#> 3: 2020-10-20 08:00:00 3 3
#> 4: 2020-10-20 09:00:00 4 4
#> 5: 2020-10-20 10:00:00 5 5
Created on 2020-10-24 by the reprex package (v0.3.0)
I think we can close this issue. We should not expect posixct to merge to shifted posixct. They should be aligned before merging.
Most helpful comment
The trouble here is that the
lubridatefunction actually alters the values. Interspersing OP's code with some 'naked' printing:So here the actual
numericholding the value _was_ shifted. On purpose. Should that really trigger a warning?Computing with dates and times is tricky. Doubly so in the presence of timezones. Most 'professional' work I have seen in many years settles on _one_ TZ, often UTC or whereever the head office is.