Lubridate: Intervals change when sorting data frame with dplyr::arrange

Created on 8 Feb 2017 · 3Comments · Source: tidyverse/lubridate

Interval objects in a data frame / tibble column get "jumbled" when the data frame is being sorted with dplyr::arrange. I hope I'm not reading the documentation wrong, but I don't see why the Intervals should be affected in any way when sorting -- particularly in such a strange and unpredictable way.

library(dplyr)
library(lubridate)

# We create a tibble with columns for two dates (t1, t2) and the interval between them (p)
# I'm using a tibble here, but I get the same issue with data.frame
# I'm using a date here, but I get the same issue with datetime
n = 3
df <- tibble( 
  x = seq(n),
  t1 = today() - duration(sample(n), "days"),
  t2 = today() + duration(sample(n), "days"),
  p = interval(t1, t2))

# looks as expected
df %>% head()

# # A tibble: 4 × 4
#       x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2     2 2017-02-05 2017-02-11 2017-02-05 UTC--2017-02-11 UTC
# 3     3 2017-02-04 2017-02-12 2017-02-04 UTC--2017-02-12 UTC
# 4     4 2017-02-06 2017-02-10 2017-02-06 UTC--2017-02-10 UTC

# when using arange to sort the data frame by any column, the intervals in p get messed up. 
# The interval's start and end dates are different than before and do not match t1 or t2
df %>% arrange(t1) %>% head()

# # A tibble: 4 × 4
#       x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     3 2017-02-04 2017-02-12 2017-02-07 UTC--2017-02-15 UTC
# 2     2 2017-02-05 2017-02-11 2017-02-05 UTC--2017-02-11 UTC
# 3     4 2017-02-06 2017-02-10 2017-02-04 UTC--2017-02-08 UTC
# 4     1 2017-02-07 2017-02-09 2017-02-06 UTC--2017-02-08 UTC

df %>% arrange(t2) %>% head()

# # A tibble: 4 × 4
#       x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2     4 2017-02-06 2017-02-10 2017-02-05 UTC--2017-02-09 UTC
# 3     2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4     3 2017-02-04 2017-02-12 2017-02-06 UTC--2017-02-14 UTC

df %>% arrange(p) %>% head()

# # A tibble: 4 × 4
#       x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     1 2017-02-07 2017-02-09 2017-02-07 UTC--2017-02-09 UTC
# 2     4 2017-02-06 2017-02-10 2017-02-05 UTC--2017-02-09 UTC
# 3     2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4     3 2017-02-04 2017-02-12 2017-02-06 UTC--2017-02-14 UTC

df %>% arrange(-x) %>% head()

# # A tibble: 4 × 4
#      x         t1         t2                              p
#  <int>     <date>     <date>                 <S4: Interval>
# 1     4 2017-02-06 2017-02-10 2017-02-07 UTC--2017-02-11 UTC
# 2     3 2017-02-04 2017-02-12 2017-02-05 UTC--2017-02-13 UTC
# 3     2 2017-02-05 2017-02-11 2017-02-04 UTC--2017-02-10 UTC
# 4     1 2017-02-07 2017-02-09 2017-02-06 UTC--2017-02-08 UTC

# if we keep t1 the same for all rows and sort by t2 it works as expected 
# (sorting by t1 also works - in that case the order isn't changed)
df %>% 
  mutate(t1 = today(), p = interval(t1, t2)) %>% 
  arrange(t2) %>% 
  head()

# # A tibble: 4 × 4
#       x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     1 2017-02-08 2017-02-09 2017-02-08 UTC--2017-02-09 UTC
# 2     4 2017-02-08 2017-02-10 2017-02-08 UTC--2017-02-10 UTC
# 3     2 2017-02-08 2017-02-11 2017-02-08 UTC--2017-02-11 UTC
# 4     3 2017-02-08 2017-02-12 2017-02-08 UTC--2017-02-12 UTC

# However, if t2 is fixed and we sort by t1, the problem occurs
df %>% 
  mutate(t2 = today(), p = interval(t1, t2)) %>% 
  arrange(t1) %>% 
  head()

# # A tibble: 4 × 4
#     x         t1         t2                              p
#   <int>     <date>     <date>                 <S4: Interval>
# 1     3 2017-02-04 2017-02-08 2017-02-07 UTC--2017-02-11 UTC
# 2     2 2017-02-05 2017-02-08 2017-02-05 UTC--2017-02-08 UTC
# 3     4 2017-02-06 2017-02-08 2017-02-04 UTC--2017-02-06 UTC
# 4     1 2017-02-07 2017-02-08 2017-02-06 UTC--2017-02-07 UTC

Problem occurs on lubridate 1.6.0.9009; dplyr 0.5.0

Source

Teebusch

👍2

Most helpful comment

I have to agree with bschneidr as I just spent way too much time trying to make a column containing 3 different intervals across 1000+ records. Incompatibility with S4 can't be assumed to be general knowledge. I think I am going to send this nugget out to the twittersphere.

Chicago-Dave on 9 Mar 2018

👍4

All 3 comments

dplyr/tibble doesn't support S4 objects like intervals. See #2221.

vspinu on 8 Feb 2017

Is this lubridate/dplyr/tibble issue clearly documented in any of the package vignettes or help functions? This seems likely to come as a nasty surprise to many users.

For example, this issue isn't brought up anywhere in the _R for Data Science_ chapter on lubridate, which --given the heavy emphasis of the book on using tibbles--would lull many new users into a false sense of security with using interval objects in tibbles.

For reference, here's the smallest reproducible example I could make using dplyr 0.7 and lubridate 1.6.0:

library(lubridate)
library(dplyr)

# Create a dataframe of start dates to be used in intervals
df <- data_frame(Start_Date = as_date(c("1988-08-29",
                                        "2017-06-02", "2017-06-02"),
                                      tz = "UTC"))

# Create a single end date to be used in intervals
End_Date <- as_date("2016-06-05", tz = "UTC")

# Create intervals twice:
# the first time before arranging and filtering rows,
# the second time after arranging and filtering
df %>%
  mutate(Interval_1 = interval(start = Start_Date, end = End_Date, tzone = "UTC")) %>%
  arrange(Start_Date) %>%
  slice(3:4) %>%
  mutate(Interval_2 = interval(start = Start_Date, end = End_Date, tzone = "UTC"))

In the result, you can see that the same operation on the same start and end date
produces different intervals, as a result of arrange and slice.
````

A tibble: 2 x 3

Start_Date Interval_1 Interval_2

1 2017-06-02 1988-08-29 UTC--1988-09-01 UTC 2017-06-02 UTC--2017-06-05 UTC
2 2017-06-02 1988-09-28 UTC--1988-10-01 UTC 2017-06-02 UTC--2017-06-05 UTC
```