Dplyr: Unexpected lubridate interval behavior when filtering dataframe via dplyr

Created on 14 Nov 2017  路  9Comments  路  Source: tidyverse/dplyr

Background
I posted this on SO as well, since I'm not entirely certain this is a bug - perhaps just my ignorance.

I am creating a lubridate interval vector using a dataset similar to that which is available in the following chunk:

    dat <- readr::read_rds(url("https://github.com/mienkoja/stack_stash/blob/master/dat.rds?raw=true"))

    dat_mod <- dplyr::mutate(dat, interval = lubridate::interval(lubridate::ymd(start_combo)
                                                                 ,lubridate::ymd(stop_combo)))

Problem

If I try to filter the object via dplyr, I get an unexpected interval returned:

    dplyr::filter(dat_mod, id == 7000)

    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2016-10-11 UTC--2016-10-25 UTC

My own investigation

Every interval (viewed via dplyr::filter) appears to be the same length as the expected interval, but all anchored (incorrectly) to 2016-10-11 UTC.

I suspect this is a dplyr::filter bug as I can search by bracket notation and get the expected result.

    dat_mod[dat_mod$id == 7000, ]

    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2017-08-02 UTC--2017-08-16 UTC

Also, this does not appear to be an artifact of what is getting displayed in the output. If I assign the filtered object to a new object, the incorrect interval is preserved:

    dat_mod2 <- filter(dat_mod, id == 7000)
    dat_mod2
    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2016-10-11 UTC--2016-10-25 UTC

Most helpful comment

I'll close this now as a duplicate to #2432

At the moment, we have a workaround in place to essentially refuse to deal with interval objects.

> dplyr::filter(dat_mod, id == 7000)
 Error in filter_impl(.data, quo) : 
  Column `interval` classes Period and Interval from lubridate are currently not supported. 

This is of course temporary until we deal with #2432, but at least this is less surprising.

All 9 comments

Hi, I guess this is the same kind of bug as #2568 as suggested on SO and #1581 is more exact one. If you use str() to interval column, you can notice it failed to subset @start. This is because S4 method of [ for Interval class is not properly called inside filter().

Let's set our hope on vctrs...

dplyr::filter(dat_mod, .data$id == 7000) %>%
  pull(interval) %>%
  str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 1209600
#>   ..@ start: POSIXct[1:7632], format: "2016-10-11" NA NA NA ...
#>   ..@ tzone: chr "UTC"

dat_mod[dat_mod$id == 7000, ] %>%
  pull(interval) %>%
  str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 1209600
#>   ..@ start: POSIXct[1:1], format: "2017-08-02"
#>   ..@ tzone: chr "UTC"

Thanks @yutannihilation!

Since this seems to be a much more pervasive issue (and one that appears to omit of a simple solution given how long it's been since some of these issues were posted), perhaps we could just (for now) get a warning based on the class of columns?

Maybe something like this would work for my current problem?

filter <- function (.data, ...) 
{
  num_interval_cols <- length(which(unlist(lapply(.data, class)) == "Interval"))
  if (num_interval_cols > 0) {warning("S4 Interval class not currently supported inside filter(). Results may not be accurate.")}
  UseMethod("filter")
}

Thanks!

Maybe something like this would work for my current problem?

I think so. One idea is that add a row ID column and subset the original data.frame by the row IDs in the result data.frame. Though I don't think my code is great enough, SO is great place where you can ask for the better version of the code :)

myfilter <- function(d, ...) {
  preds <- rlang::quos(...)

  result <- d %>%
    # add row IDs to distinguish rows in the result
    tibble::rowid_to_column(var = "rowid") %>%
    dplyr::filter(!!! preds)

  # overwrite S4 cols by data properly subsetted by `[`
  cols_S4 <- colnames(result)[purrr::map_lgl(result, ~ isS4(x = .))]
  result[, cols_S4] <- d[result$rowid, cols_S4]

  # remove row ID column
  dplyr::select(result, -rowid)
}

Woule the tidyverse team entertain a pure R solution to this? I believe the filter method is written in C - correct?

Ah, if you thought the code above is a "solution", no. I intended to post a temporary "work around" until vctrs package does the right thing.

Thanks. Do you think #2432 would fix this issue as well?

I'm not quite sure how #2432 will be implemented, but I think so. The root cause is the absence of nice ways to dispatch the correct method for non-base types, which is the same as other issues listed on there.

I'll close this now as a duplicate to #2432

At the moment, we have a workaround in place to essentially refuse to deal with interval objects.

> dplyr::filter(dat_mod, id == 7000)
 Error in filter_impl(.data, quo) : 
  Column `interval` classes Period and Interval from lubridate are currently not supported. 

This is of course temporary until we deal with #2432, but at least this is less surprising.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings