Background
I posted this on SO as well, since I'm not entirely certain this is a bug - perhaps just my ignorance.
I am creating a lubridate interval vector using a dataset similar to that which is available in the following chunk:
dat <- readr::read_rds(url("https://github.com/mienkoja/stack_stash/blob/master/dat.rds?raw=true"))
dat_mod <- dplyr::mutate(dat, interval = lubridate::interval(lubridate::ymd(start_combo)
,lubridate::ymd(stop_combo)))
Problem
If I try to filter the object via dplyr, I get an unexpected interval returned:
dplyr::filter(dat_mod, id == 7000)
## A tibble: 1 x 4
# id start_combo stop_combo interval
# <int> <dbl> <dbl> <S4: Interval>
#1 7000 20170802 20170816 2016-10-11 UTC--2016-10-25 UTC
My own investigation
Every interval (viewed via dplyr::filter) appears to be the same length as the expected interval, but all anchored (incorrectly) to 2016-10-11 UTC.
I suspect this is a dplyr::filter bug as I can search by bracket notation and get the expected result.
dat_mod[dat_mod$id == 7000, ]
## A tibble: 1 x 4
# id start_combo stop_combo interval
# <int> <dbl> <dbl> <S4: Interval>
#1 7000 20170802 20170816 2017-08-02 UTC--2017-08-16 UTC
Also, this does not appear to be an artifact of what is getting displayed in the output. If I assign the filtered object to a new object, the incorrect interval is preserved:
dat_mod2 <- filter(dat_mod, id == 7000)
dat_mod2
## A tibble: 1 x 4
# id start_combo stop_combo interval
# <int> <dbl> <dbl> <S4: Interval>
#1 7000 20170802 20170816 2016-10-11 UTC--2016-10-25 UTC
Hi, I guess this is the same kind of bug as #2568 as suggested on SO and #1581 is more exact one. If you use str() to interval column, you can notice it failed to subset @start. This is because S4 method of [ for Interval class is not properly called inside filter().
Let's set our hope on vctrs...
dplyr::filter(dat_mod, .data$id == 7000) %>%
pull(interval) %>%
str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#> ..@ .Data: num 1209600
#> ..@ start: POSIXct[1:7632], format: "2016-10-11" NA NA NA ...
#> ..@ tzone: chr "UTC"
dat_mod[dat_mod$id == 7000, ] %>%
pull(interval) %>%
str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#> ..@ .Data: num 1209600
#> ..@ start: POSIXct[1:1], format: "2017-08-02"
#> ..@ tzone: chr "UTC"
Thanks @yutannihilation!
Since this seems to be a much more pervasive issue (and one that appears to omit of a simple solution given how long it's been since some of these issues were posted), perhaps we could just (for now) get a warning based on the class of columns?
Maybe something like this would work for my current problem?
filter <- function (.data, ...)
{
num_interval_cols <- length(which(unlist(lapply(.data, class)) == "Interval"))
if (num_interval_cols > 0) {warning("S4 Interval class not currently supported inside filter(). Results may not be accurate.")}
UseMethod("filter")
}
Thanks!
Maybe something like this would work for my current problem?
I think so. One idea is that add a row ID column and subset the original data.frame by the row IDs in the result data.frame. Though I don't think my code is great enough, SO is great place where you can ask for the better version of the code :)
myfilter <- function(d, ...) {
preds <- rlang::quos(...)
result <- d %>%
# add row IDs to distinguish rows in the result
tibble::rowid_to_column(var = "rowid") %>%
dplyr::filter(!!! preds)
# overwrite S4 cols by data properly subsetted by `[`
cols_S4 <- colnames(result)[purrr::map_lgl(result, ~ isS4(x = .))]
result[, cols_S4] <- d[result$rowid, cols_S4]
# remove row ID column
dplyr::select(result, -rowid)
}
Woule the tidyverse team entertain a pure R solution to this? I believe the filter method is written in C - correct?
Ah, if you thought the code above is a "solution", no. I intended to post a temporary "work around" until vctrs package does the right thing.
Thanks. Do you think #2432 would fix this issue as well?
I'm not quite sure how #2432 will be implemented, but I think so. The root cause is the absence of nice ways to dispatch the correct method for non-base types, which is the same as other issues listed on there.
I'll close this now as a duplicate to #2432
At the moment, we have a workaround in place to essentially refuse to deal with interval objects.
> dplyr::filter(dat_mod, id == 7000)
Error in filter_impl(.data, quo) :
Column `interval` classes Period and Interval from lubridate are currently not supported.
This is of course temporary until we deal with #2432, but at least this is less surprising.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
I'll close this now as a duplicate to #2432
At the moment, we have a workaround in place to essentially refuse to deal with
intervalobjects.This is of course temporary until we deal with #2432, but at least this is less surprising.