base::ifelse() is notorious slow and will change the attributes somehow.
ifelse(TRUE, as.Date('2019-06-25'), as.Date('2019-06-25'))
# [1] 18072
dplyr implements dplyr::if_else() which I think is good. However, I always reluctant to use dplyr because of it's heavy dependencies.
So I'm proposing a feature request here... I think both data.table::if_else() and data.table::iif() are good name candidates (though the former conflicts with dplyr::if_else())
alternatively fifelse
For comparison, hutils::if_else
from the description:
Lightweight
dplyr::if_elsewith the virtues and vices that come from such an approach.
I have shared with Matt a prototype implemented in C .
It has good performance without using openmp. (Up to x10 faster than base::ifelse and dplyr::if_else)
I also have a version that uses openmp which is useful when inputs are large.
I have tried to stick to the philosophy of the data.table package as much as I am aware of it.
The prototype is compatible with R 3.1.0 and has no dependency.
Now, base::ifelse and dplyr::if_else have different behaviours.
I am trying to list all the use cases of a "fifelse" function and check which behaviours is wanted and which is not.Below you will find some examples.
Example 1: (base::ifelse does not keep dates format)
dates <- as.Date(c("2011-01-01","2011-01-02","2011-01-03","2011-01-04","2011-01-05"))
dplyr::if_else(dates == "2011-01-01", dates - 1, dates)
#[1] "2010-12-31" "2011-01-02" "2011-01-03" "2011-01-04" "2011-01-05"
base::ifelse(dates == "2011-01-01", dates - 1, dates)
#[1] 14974 14976 14977 14978 14979
Which behaviour do we want? My prototype currently outputs the same as dplyr which in
my view makes more sense but I can add a flag to have the behaviour of base::ifelse as well?
Example 2: (base::ifelse does not keep factor class)
v <- factor(sample(letters[1:5], 10, replace = TRUE))
dplyr::if_else(v %in% c("a","b","c"), v, factor(NA))
#[1] <NA> c b <NA> <NA> b c c c <NA>
# Levels: b c d e
base::ifelse(v %in% c("a","b","c"), v, factor(NA))
#[1] NA 2 1 NA NA 1 2 2 2 NA
Again, dplyr::if_else makes more sense to me but nothing prevents us from having
R base behaviour with a flag?
Example 3:
z <- factor(1:3, labels = letters[1:3])
dplyr::if_else(z == "a", z[[2]], list(z))
#Error: `false` must be a `factor` object, not a list
#Call `rlang::last_error()` to see a backtrace
base::ifelse(z == "a", z[[2]], list(z))
#[[1]]
#[1] 2
#
#[[2]]
#[1] a b c
#Levels: a b c
#
#[[3]]
#[1] a b c
#Levels: a b c
In the above example, dplyr::if_else returns an error however base::ifelse returns a list of integer and factors which I do not believe is wright. I am not sure what makes more sense from the below options? (please feel free to suggest something)
_Option 1:_( equivalent to dplyr::if_else(z == "a", list(z[[2]]), list(z)) )
#[[1]]
#[1] b
#Levels: a b c
#
#[[2]]
#[1] a b c
#Levels: a b c
#
#[[3]]
#[1] a b c
#Levels: a b c
_Option 2:_
#[[1]]
#[1] 2
#
#[[2]]
#[1] 1 2 3
#
#[[3]]
#[1] 1 2 3
_Option 3:_
#[[1]]
#[1] b
#Levels: a b c
#
#[[2]]
#[1] b
#Levels: a b c
#
#[[3]]
#[1] c
#Levels: a b c
_Option 4:_
#[[1]]
#[1] 2
#
#[[2]]
#[1] 2
#
#[[3]]
#[1] 3
Option 1 can be turned into option 2 with a flag, so does option 3 with 4.
Or do we want all 4 options?
Example 4:
The below one is a mystery to me and I did not investigate why it has this behaviour...
(dplyr::if_else returns an error)
base::ifelse(c(-5L:5L) < 0, list(1), list(2, 3))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 1
#
# [[4]]
# [1] 1
#
# [[5]]
# [1] 1
#
# [[6]]
# [1] 3
#
# [[7]]
# [1] 2
#
# [[8]]
# [1] 3
#
# [[9]]
# [1] 2
#
# [[10]]
# [1] 3
#
# [[11]]
# [1] 2
Please let me know your thoughts.
Example 1 should definitely retain Date class. Ditto Example 2, but there's an extremely thorny issue of when !identical(levels(true_vals), levels(false_vals)). We dodged this in coalesce by forcing the levels of all replacement options to be identical, but this is _slightly_ restrictive and _slightly_ inconvenient.
Not to mention that, in a grouping context we have to be extra careful...
Example 3 looks similar?
Example 4 also doesn't make sense to me.
I'm not sure how much attention we should devote to handling list output just yet, in any case.
Please proceed to file a PR -- that way all members could have a glance at the code and help move things forward from there.
Related: #673 and #2677 (of which this current issue is ~nearly~ a duplicate)
Please note also the discussion in #2677 about possibly patching ifelse into base R and the related discussion on r-devel about why _not_ to do that so hastily, it should help inform the discussion of edge cases for any data.table implementation
Thank you for these useful links. It shows the need for a faster and more robust implementation. I will study them and share my code. Please be patient as I am working on this in my spare time. Thank you.
You may also want consider vctrs::if_else(), https://vctrs.r-lib.org/articles/stability.html#ifelse, specifically for the thinking about what types ifelse "should" return.
Give that a try and let me know your thoughts: devtools::install_github("2005m/data.table",ref = "fifelse")
FYI I am not using openmp (in this version).
Critics, feedback and recommendations are welcome.
Please find below some benchmarks.
@hadley , sorry I did not look at vctrs::if_else, but feel free to provide benchmarks...
x <- c(-5e4L:5e4L) < 0
len <- length(x)
microbenchmark::microbenchmark(
data.table::fifelse(x, 1L, 0L),
ifelse(x, 1L, 0L),
dplyr::if_else(x, 1L, 0L),
hutils::if_else(x, 1L, 0L),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, 1L, 0L) 151.380 190.0810 376.5134 264.4880 345.0960 7378.278 100
# ifelse(x, 1L, 0L) 2271.984 2340.8315 3499.8378 3129.3760 3502.9080 18044.582 100
# dplyr::if_else(x, 1L, 0L) 2422.935 2460.3535 4257.2123 3459.5040 4307.9165 25538.319 100
# hutils::if_else(x, 1L, 0L) 610.652 643.3655 838.1112 729.7465 925.8135 1919.191 100
microbenchmark::microbenchmark(
data.table::fifelse(x, rep(1,len), rep(0,len)),
ifelse(x, rep(1,len), rep(0,len)),
dplyr::if_else(x, rep(1,len), rep(0,len)),
hutils::if_else(x, rep(1,len), rep(0,len)),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, rep(1, len), rep(0, len)) 668.382 710.2895 1336.228 746.210 1281.813 14152.75 100
# ifelse(x, rep(1, len), rep(0, len)) 2812.932 2841.3690 4172.893 2892.256 5575.617 19014.44 100
# dplyr::if_else(x, rep(1, len), rep(0, len)) 3899.104 3930.7490 6652.899 3973.726 7177.721 82113.81 100
# hutils::if_else(x, rep(1, len), rep(0, len)) 1364.559 1392.1410 1942.565 1419.723 1830.886 5875.17 100
microbenchmark::microbenchmark(
data.table::fifelse(x, TRUE, FALSE),
ifelse(x, TRUE, FALSE),
dplyr::if_else(x, TRUE, FALSE),
hutils::if_else(x, TRUE, FALSE),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, TRUE, FALSE) 150.953 164.6365 327.2636 195.6395 337.826 4358.804 100
# ifelse(x, TRUE, FALSE) 1945.704 1980.7695 2669.5270 2010.2765 2760.974 19098.682 100
# dplyr::if_else(x, TRUE, FALSE) 2425.502 2444.3170 4363.1057 2475.7480 4892.696 69946.111 100
# hutils::if_else(x, TRUE, FALSE) 606.803 639.7305 829.6400 658.9740 934.152 2741.091 100
microbenchmark::microbenchmark(
data.table::fifelse(x, "TRUE", "FALSE"),
ifelse(x, "TRUE", "FALSE"),
dplyr::if_else(x, "TRUE", "FALSE"),
hutils::if_else(x, "TRUE", "FALSE"),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, "TRUE", "FALSE") 1.038706 1.071206 1.205268 1.089809 1.147752 5.927340 100
# ifelse(x, "TRUE", "FALSE") 36.581362 37.241832 41.123822 37.802878 39.733828 74.862540 100
# dplyr::if_else(x, "TRUE", "FALSE") 5.152052 5.236294 7.133082 5.385750 6.709898 76.003022 100
# hutils::if_else(x, "TRUE", "FALSE") 1.286303 1.323079 1.574507 1.345315 1.459065 5.457805 100
dates <- as.Date(rep(c("2011-01-01","2011-01-02","2011-01-03",
"2011-01-04","2011-01-05"),1e4))
# WARNING: base ifelse does not produce expected result below
microbenchmark::microbenchmark(
data.table::fifelse(dates == "2011-01-01", dates - 1, dates),
ifelse(dates == "2011-01-01", dates - 1, dates),
dplyr::if_else(dates == "2011-01-01", dates - 1, dates),
hutils::if_else(dates == "2011-01-01", dates - 1, dates),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(dates == "2011-01-01", dates - 1, dates) 785.552 840.5015 2452.754 882.837 912.7705 68726.52 100
# ifelse(dates == "2011-01-01", dates - 1, dates) 2696.617 2728.4755 3849.316 2764.609 2801.3855 82019.74 100
# dplyr::if_else(dates == "2011-01-01", dates - 1, dates) 2686.782 2722.2745 3870.595 2772.307 2845.2175 69898.22 100
# hutils::if_else(dates == "2011-01-01", dates - 1, dates) 1113.114 1140.4815 4477.248 1187.521 1204.4120 180626.27 100
v <- rep(factor(sample(letters[1:5], 10, replace = TRUE)),1e4)
# WARNING: base ifelse does not produce expected result below
# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
data.table::fifelse(v %in% c("a","b","c"), v, factor(NA)),
ifelse(v %in% c("a","b","c"), v, factor(NA)),
dplyr::if_else(v %in% c("a","b","c"), v, factor(NA)),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table::fifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.087920 7.130255 7.752777 7.151637 7.294678 12.50852 100
# ifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.214926 7.250846 7.852812 7.294037 7.517257 11.98126 100
# dplyr::if_else(v %in% c("a", "b", "c"), v, factor(NA)) 9.196122 9.269673 13.034460 9.358833 12.553208 88.21562 100
# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
data.table::fifelse(x, list(1),list(2)),
ifelse(x, list(1),list(2)),
dplyr::if_else(x, list(1),list(2)),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, list(1), list(2)) 2.632473 2.724840 4.616436 2.868095 3.658351 40.04899 100
# ifelse(x, list(1), list(2)) 6.379341 6.474061 7.311804 6.523452 7.431518 12.85960 100
# dplyr::if_else(x, list(1), list(2)) 5.710961 5.763131 7.263085 5.784512 6.538633 28.54668 100
# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
data.table::fifelse(x, list(FALSE),list(rep(TRUE,len))),
ifelse(x, list(FALSE),list(rep(TRUE,len))),
dplyr::if_else(x, list(FALSE),list(rep(TRUE,len))),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table::fifelse(x, list(FALSE), list(rep(TRUE, len))) 1.826824 1.853978 2.149058 1.865310 1.909784 7.311997 100
# ifelse(x, list(FALSE), list(rep(TRUE, len))) 6.612826 6.678039 7.626645 6.700062 7.266668 18.867336 100
# dplyr::if_else(x, list(FALSE), list(rep(TRUE, len))) 5.928195 5.988063 7.035108 6.031895 7.762929 15.979142 100
df <- do.call(rbind, replicate(2000, mtcars, simplify=FALSE))
microbenchmark::microbenchmark(
data.table::fifelse(df$mpg < 19.5, TRUE, FALSE),
ifelse(df$mpg < 19.5, TRUE, FALSE),
dplyr::if_else(df$mpg < 19.5, TRUE, FALSE),
hutils::if_else(df$mpg < 19.5, TRUE, FALSE),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(df$mpg < 19.5, TRUE, FALSE) 506.311 526.6235 933.7458 555.2740 685.7005 16243.417 100
# ifelse(df$mpg < 19.5, TRUE, FALSE) 1639.523 1667.3180 2232.3723 1682.7135 2004.9305 19018.289 100
# dplyr::if_else(df$mpg < 19.5, TRUE, FALSE) 2008.993 2035.9335 3488.2319 2069.7155 2824.0500 81675.495 100
# hutils::if_else(df$mpg < 19.5, TRUE, FALSE) 811.210 826.1760 995.5208 854.1855 892.6725 6705.621 100
microbenchmark::microbenchmark(
data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat),
ifelse(df$mpg < 19.5, df$cyl, df$drat),
dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat),
hutils::if_else(df$mpg < 19.5, df$cyl, df$drat),
times = 100L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat) 569.172 674.796 893.9038 699.385 750.2725 5530.930 100
# ifelse(df$mpg < 19.5, df$cyl, df$drat) 1976.921 2024.815 2736.2667 2041.279 2115.4720 16723.641 100
# dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat) 2779.577 2841.155 4848.6506 2896.533 4860.1970 76190.750 100
# hutils::if_else(df$mpg < 19.5, df$cyl, df$drat) 1078.904 1146.041 1300.5216 1183.887 1206.3365 5951.715 100
Looks interesting.
Maybe we could generalise that into fcase, then fifelse is just a special case of fcase, same internals could handle both cases.
@2005m when opening PR, ideally do from a branch in Rdatatable org, so we all can easily push into it.
@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE.
I guess @mattdowle would have to grant branching access for the branch to be done within Rdatatable.
@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE
Currently, the data.table way is chaining [i, j with :=] for each case. This is actually a pretty good user-friendly interface, though I do not know about the potential impacts on performance.
Closed by #3678 (forgot to use 'closes' prefix in PR comment)
Most helpful comment
alternatively
fifelse