Interval objects in a data frame / tibble column get "jumbled" when the data frame is being sorted with dplyr::arrange
. I hope I'm not reading the documentation wrong, but I don't see why the Intervals should be affected in any way when sorting -- particularly in such a strange and unpredictable way.
library(dplyr)
library(lubridate)
# We create a tibble with columns for two dates (t1, t2) and the interval between them (p)
# I'm using a tibble here, but I get the same issue with data.frame
# I'm using a date here, but I get the same issue with datetime
n = 3
df <- tibble(
x = seq(n),
t1 = today() - duration(sample(n), "days"),
t2 = today() + duration(sample(n), "days"),
p = interval(t1, t2))
# looks as expected
df %>% head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2 2 2017-02-05 2017-02-11 2017-02-05 UTC--2017-02-11 UTC
# 3 3 2017-02-04 2017-02-12 2017-02-04 UTC--2017-02-12 UTC
# 4 4 2017-02-06 2017-02-10 2017-02-06 UTC--2017-02-10 UTC
# when using arange to sort the data frame by any column, the intervals in p get messed up.
# The interval's start and end dates are different than before and do not match t1 or t2
df %>% arrange(t1) %>% head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 3 2017-02-04 2017-02-12 2017-02-07 UTC--2017-02-15 UTC
# 2 2 2017-02-05 2017-02-11 2017-02-05 UTC--2017-02-11 UTC
# 3 4 2017-02-06 2017-02-10 2017-02-04 UTC--2017-02-08 UTC
# 4 1 2017-02-07 2017-02-09 2017-02-06 UTC--2017-02-08 UTC
df %>% arrange(t2) %>% head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2 4 2017-02-06 2017-02-10 2017-02-05 UTC--2017-02-09 UTC
# 3 2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4 3 2017-02-04 2017-02-12 2017-02-06 UTC--2017-02-14 UTC
df %>% arrange(p) %>% head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2 4 2017-02-06 2017-02-10 2017-02-05 UTC--2017-02-09 UTC
# 3 2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4 3 2017-02-04 2017-02-12 2017-02-06 UTC--2017-02-14 UTC
df %>% arrange(-x) %>% head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 4 2017-02-06 2017-02-10 2017-02-07 UTC--2017-02-11 UTC
# 2 3 2017-02-04 2017-02-12 2017-02-05 UTC--2017-02-13 UTC
# 3 2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4 1 2017-02-07 2017-02-09 2017-02-06 UTC--2017-02-08 UTC
# if we keep t1 the same for all rows and sort by t2 it works as expected
# (sorting by t1 also works - in that case the order isn't changed)
df %>%
mutate(t1 = today(), p = interval(t1, t2)) %>%
arrange(t2) %>%
head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 1 2017-02-08 2017-02-09 2017-02-08 UTC--2017-02-09 UTC
# 2 4 2017-02-08 2017-02-10 2017-02-08 UTC--2017-02-10 UTC
# 3 2 2017-02-08 2017-02-11 2017-02-08 UTC--2017-02-11 UTC
# 4 3 2017-02-08 2017-02-12 2017-02-08 UTC--2017-02-12 UTC
# However, if t2 is fixed and we sort by t1, the problem occurs
df %>%
mutate(t2 = today(), p = interval(t1, t2)) %>%
arrange(t1) %>%
head()
# # A tibble: 4 脳 4
# x t1 t2 p
# <int> <date> <date> <S4: Interval>
# 1 3 2017-02-04 2017-02-08 2017-02-07 UTC--2017-02-11 UTC
# 2 2 2017-02-05 2017-02-08 2017-02-05 UTC--2017-02-08 UTC
# 3 4 2017-02-06 2017-02-08 2017-02-04 UTC--2017-02-06 UTC
# 4 1 2017-02-07 2017-02-08 2017-02-06 UTC--2017-02-07 UTC
Problem occurs on lubridate 1.6.0.9009; dplyr 0.5.0
dplyr/tibble doesn't support S4 objects like intervals. See #2221.
Is this lubridate/dplyr/tibble issue clearly documented in any of the package vignettes or help functions? This seems likely to come as a nasty surprise to many users.
For example, this issue isn't brought up anywhere in the _R for Data Science_ chapter on lubridate, which --given the heavy emphasis of the book on using tibbles--would lull many new users into a false sense of security with using interval objects in tibbles.
For reference, here's the smallest reproducible example I could make using dplyr 0.7 and lubridate 1.6.0:
library(lubridate)
library(dplyr)
# Create a dataframe of start dates to be used in intervals
df <- data_frame(Start_Date = as_date(c("1988-08-29",
"2017-06-02", "2017-06-02"),
tz = "UTC"))
# Create a single end date to be used in intervals
End_Date <- as_date("2016-06-05", tz = "UTC")
# Create intervals twice:
# the first time before arranging and filtering rows,
# the second time after arranging and filtering
df %>%
mutate(Interval_1 = interval(start = Start_Date, end = End_Date, tzone = "UTC")) %>%
arrange(Start_Date) %>%
slice(3:4) %>%
mutate(Interval_2 = interval(start = Start_Date, end = End_Date, tzone = "UTC"))
In the result, you can see that the same operation on the same start and end date
produces different intervals, as a result of arrange and slice.
````
Start_Date Interval_1 Interval_2
1 2017-06-02 1988-08-29 UTC--1988-09-01 UTC 2017-06-02 UTC--2017-06-05 UTC
2 2017-06-02 1988-09-28 UTC--1988-10-01 UTC 2017-06-02 UTC--2017-06-05 UTC
```
I have to agree with bschneidr as I just spent way too much time trying to make a column containing 3 different intervals across 1000+ records. Incompatibility with S4 can't be assumed to be general knowledge. I think I am going to send this nugget out to the twittersphere.
Most helpful comment
I have to agree with bschneidr as I just spent way too much time trying to make a column containing 3 different intervals across 1000+ records. Incompatibility with S4 can't be assumed to be general knowledge. I think I am going to send this nugget out to the twittersphere.