data.table::if_else() or data.table::iif()

Created on 25 Jun 2019 · 12Comments · Source: Rdatatable/data.table

base::ifelse() is notorious slow and will change the attributes somehow.

ifelse(TRUE, as.Date('2019-06-25'), as.Date('2019-06-25'))
# [1] 18072

dplyr implements dplyr::if_else() which I think is good. However, I always reluctant to use dplyr because of it's heavy dependencies.

So I'm proposing a feature request here... I think both data.table::if_else() and data.table::iif() are good name candidates (though the former conflicts with dplyr::if_else())

feature request

Source

shrektan

👍4

Most helpful comment

alternatively fifelse

jangorecki on 26 Jun 2019

👍5

All 12 comments

alternatively fifelse

jangorecki on 26 Jun 2019

👍5

For comparison, hutils::if_else

from the description:

Lightweight dplyr::if_else with the virtues and vices that come from such an approach.

HughParsonage on 29 Jun 2019

I have shared with Matt a prototype implemented in C .
It has good performance without using openmp. (Up to x10 faster than base::ifelse and dplyr::if_else)
I also have a version that uses openmp which is useful when inputs are large.
I have tried to stick to the philosophy of the data.table package as much as I am aware of it.
The prototype is compatible with R 3.1.0 and has no dependency.

Now, base::ifelse and dplyr::if_else have different behaviours.
I am trying to list all the use cases of a "fifelse" function and check which behaviours is wanted and which is not.Below you will find some examples.

Example 1: (base::ifelse does not keep dates format)

dates <- as.Date(c("2011-01-01","2011-01-02","2011-01-03","2011-01-04","2011-01-05"))

dplyr::if_else(dates == "2011-01-01", dates - 1, dates)
#[1] "2010-12-31" "2011-01-02" "2011-01-03" "2011-01-04" "2011-01-05"

base::ifelse(dates == "2011-01-01", dates - 1, dates)
#[1] 14974 14976 14977 14978 14979

Which behaviour do we want? My prototype currently outputs the same as dplyr which in
my view makes more sense but I can add a flag to have the behaviour of base::ifelse as well?

Example 2: (base::ifelse does not keep factor class)

v <- factor(sample(letters[1:5], 10, replace = TRUE))

dplyr::if_else(v %in% c("a","b","c"), v, factor(NA))
#[1] <NA> c    b    <NA> <NA> b    c    c    c    <NA>
# Levels: b c d e

base::ifelse(v %in% c("a","b","c"), v, factor(NA))
#[1] NA  2  1 NA NA  1  2  2  2 NA

Again, dplyr::if_else makes more sense to me but nothing prevents us from having
R base behaviour with a flag?

Example 3:

z <- factor(1:3, labels = letters[1:3])

dplyr::if_else(z == "a", z[[2]], list(z))
#Error: `false` must be a `factor` object, not a list
#Call `rlang::last_error()` to see a backtrace

base::ifelse(z == "a", z[[2]], list(z))
#[[1]]
#[1] 2
# 
#[[2]]
#[1] a b c
#Levels: a b c
# 
#[[3]]
#[1] a b c
#Levels: a b c

In the above example, dplyr::if_else returns an error however base::ifelse returns a list of integer and factors which I do not believe is wright. I am not sure what makes more sense from the below options? (please feel free to suggest something)

_Option 1:_( equivalent to dplyr::if_else(z == "a", list(z[[2]]), list(z)) )

#[[1]]
#[1] b
#Levels: a b c
# 
#[[2]]
#[1] a b c
#Levels: a b c
# 
#[[3]]
#[1] a b c
#Levels: a b c

_Option 2:_

#[[1]]
#[1] 2
# 
#[[2]]
#[1] 1 2 3
# 
#[[3]]
#[1] 1 2 3

_Option 3:_

#[[1]]
#[1] b
#Levels: a b c
# 
#[[2]]
#[1] b
#Levels: a b c
# 
#[[3]]
#[1] c
#Levels: a b c

_Option 4:_

#[[1]]
#[1] 2
# 
#[[2]]
#[1] 2
# 
#[[3]]
#[1] 3

Option 1 can be turned into option 2 with a flag, so does option 3 with 4.
Or do we want all 4 options?

Example 4:
The below one is a mystery to me and I did not investigate why it has this behaviour...
(dplyr::if_else returns an error)

base::ifelse(c(-5L:5L) < 0, list(1), list(2, 3))
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 1
# 
# [[3]]
# [1] 1
# 
# [[4]]
# [1] 1
# 
# [[5]]
# [1] 1
# 
# [[6]]
# [1] 3
# 
# [[7]]
# [1] 2
# 
# [[8]]
# [1] 3
# 
# [[9]]
# [1] 2
# 
# [[10]]
# [1] 3
# 
# [[11]]
# [1] 2

Please let me know your thoughts.

2005m on 1 Jul 2019

👍1

Example 1 should definitely retain Date class. Ditto Example 2, but there's an extremely thorny issue of when !identical(levels(true_vals), levels(false_vals)). We dodged this in coalesce by forcing the levels of all replacement options to be identical, but this is _slightly_ restrictive and _slightly_ inconvenient.

Not to mention that, in a grouping context we have to be extra careful...

Example 3 looks similar?

Example 4 also doesn't make sense to me.

I'm not sure how much attention we should devote to handling list output just yet, in any case.

Please proceed to file a PR -- that way all members could have a glance at the code and help move things forward from there.

MichaelChirico on 1 Jul 2019

Related: #673 and #2677 (of which this current issue is ~nearly~ a duplicate)

Please note also the discussion in #2677 about possibly patching ifelse into base R and the related discussion on r-devel about why _not_ to do that so hastily, it should help inform the discussion of edge cases for any data.table implementation

MichaelChirico on 1 Jul 2019

Thank you for these useful links. It shows the need for a faster and more robust implementation. I will study them and share my code. Please be patient as I am working on this in my spare time. Thank you.

2005m on 1 Jul 2019

You may also want consider vctrs::if_else(), https://vctrs.r-lib.org/articles/stability.html#ifelse, specifically for the thinking about what types ifelse "should" return.

hadley on 1 Jul 2019

Give that a try and let me know your thoughts: devtools::install_github("2005m/data.table",ref = "fifelse")
FYI I am not using openmp (in this version).
Critics, feedback and recommendations are welcome.
Please find below some benchmarks.
@hadley , sorry I did not look at vctrs::if_else, but feel free to provide benchmarks...

x   <- c(-5e4L:5e4L) < 0
len <- length(x)

microbenchmark::microbenchmark(
  data.table::fifelse(x, 1L, 0L),
  ifelse(x, 1L, 0L),
  dplyr::if_else(x, 1L, 0L),
  hutils::if_else(x, 1L, 0L),
  times = 100L
)
# Unit: microseconds
#                            expr      min        lq      mean    median        uq       max neval
#  data.table::fifelse(x, 1L, 0L)  151.380  190.0810  376.5134  264.4880  345.0960  7378.278   100
#               ifelse(x, 1L, 0L) 2271.984 2340.8315 3499.8378 3129.3760 3502.9080 18044.582   100
#       dplyr::if_else(x, 1L, 0L) 2422.935 2460.3535 4257.2123 3459.5040 4307.9165 25538.319   100
#      hutils::if_else(x, 1L, 0L)  610.652  643.3655  838.1112  729.7465  925.8135  1919.191   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, rep(1,len), rep(0,len)),
  ifelse(x, rep(1,len), rep(0,len)),
  dplyr::if_else(x, rep(1,len), rep(0,len)),
  hutils::if_else(x, rep(1,len), rep(0,len)),
  times = 100L
)
# Unit: microseconds
#                                              expr      min        lq     mean   median       uq      max neval
#  data.table::fifelse(x, rep(1, len), rep(0, len))  668.382  710.2895 1336.228  746.210 1281.813 14152.75   100
#               ifelse(x, rep(1, len), rep(0, len)) 2812.932 2841.3690 4172.893 2892.256 5575.617 19014.44   100
#       dplyr::if_else(x, rep(1, len), rep(0, len)) 3899.104 3930.7490 6652.899 3973.726 7177.721 82113.81   100
#      hutils::if_else(x, rep(1, len), rep(0, len)) 1364.559 1392.1410 1942.565 1419.723 1830.886  5875.17   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, TRUE, FALSE),
  ifelse(x, TRUE, FALSE),
  dplyr::if_else(x, TRUE, FALSE),
  hutils::if_else(x, TRUE, FALSE),
  times = 100L
)
# Unit: microseconds
#                                 expr      min        lq      mean    median       uq       max neval
#  data.table::fifelse(x, TRUE, FALSE)  150.953  164.6365  327.2636  195.6395  337.826  4358.804   100
#               ifelse(x, TRUE, FALSE) 1945.704 1980.7695 2669.5270 2010.2765 2760.974 19098.682   100
#       dplyr::if_else(x, TRUE, FALSE) 2425.502 2444.3170 4363.1057 2475.7480 4892.696 69946.111   100
#      hutils::if_else(x, TRUE, FALSE)  606.803  639.7305  829.6400  658.9740  934.152  2741.091   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, "TRUE", "FALSE"),
  ifelse(x, "TRUE", "FALSE"),
  dplyr::if_else(x, "TRUE", "FALSE"),
  hutils::if_else(x, "TRUE", "FALSE"),
  times = 100L
)
# Unit: milliseconds
#                                     expr       min        lq      mean    median        uq       max neval
#  data.table::fifelse(x, "TRUE", "FALSE")  1.038706  1.071206  1.205268  1.089809  1.147752  5.927340   100
#               ifelse(x, "TRUE", "FALSE") 36.581362 37.241832 41.123822 37.802878 39.733828 74.862540   100
#       dplyr::if_else(x, "TRUE", "FALSE")  5.152052  5.236294  7.133082  5.385750  6.709898 76.003022   100
#      hutils::if_else(x, "TRUE", "FALSE")  1.286303  1.323079  1.574507  1.345315  1.459065  5.457805   100

dates <- as.Date(rep(c("2011-01-01","2011-01-02","2011-01-03",
                       "2011-01-04","2011-01-05"),1e4))

# WARNING: base ifelse does not produce expected result below
microbenchmark::microbenchmark(
  data.table::fifelse(dates == "2011-01-01", dates - 1, dates),
  ifelse(dates == "2011-01-01", dates - 1, dates),
  dplyr::if_else(dates == "2011-01-01", dates - 1, dates),
  hutils::if_else(dates == "2011-01-01", dates - 1, dates),
  times = 100L
)
# Unit: microseconds
#                                                          expr      min        lq     mean   median        uq       max neval
#  data.table::fifelse(dates == "2011-01-01", dates - 1, dates)  785.552  840.5015 2452.754  882.837  912.7705  68726.52   100
#               ifelse(dates == "2011-01-01", dates - 1, dates) 2696.617 2728.4755 3849.316 2764.609 2801.3855  82019.74   100
#       dplyr::if_else(dates == "2011-01-01", dates - 1, dates) 2686.782 2722.2745 3870.595 2772.307 2845.2175  69898.22   100
#      hutils::if_else(dates == "2011-01-01", dates - 1, dates) 1113.114 1140.4815 4477.248 1187.521 1204.4120 180626.27   100

v <- rep(factor(sample(letters[1:5], 10, replace = TRUE)),1e4)

# WARNING: base ifelse does not produce expected result below
# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(v %in% c("a","b","c"), v, factor(NA)),
  ifelse(v %in% c("a","b","c"), v, factor(NA)),
  dplyr::if_else(v %in% c("a","b","c"), v, factor(NA)),
  times = 100L
)
# Unit: milliseconds
#                                                         expr      min       lq      mean   median        uq      max neval
#  data.table::fifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.087920 7.130255  7.752777 7.151637  7.294678 12.50852   100
#               ifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.214926 7.250846  7.852812 7.294037  7.517257 11.98126   100
#       dplyr::if_else(v %in% c("a", "b", "c"), v, factor(NA)) 9.196122 9.269673 13.034460 9.358833 12.553208 88.21562   100

# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(x, list(1),list(2)),
  ifelse(x, list(1),list(2)),
  dplyr::if_else(x, list(1),list(2)),
  times = 100L
)
# Unit: milliseconds
#                                      expr      min       lq     mean   median       uq      max neval
#  data.table::fifelse(x, list(1), list(2)) 2.632473 2.724840 4.616436 2.868095 3.658351 40.04899   100
#               ifelse(x, list(1), list(2)) 6.379341 6.474061 7.311804 6.523452 7.431518 12.85960   100
#       dplyr::if_else(x, list(1), list(2)) 5.710961 5.763131 7.263085 5.784512 6.538633 28.54668   100

# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(x, list(FALSE),list(rep(TRUE,len))),
  ifelse(x, list(FALSE),list(rep(TRUE,len))),
  dplyr::if_else(x, list(FALSE),list(rep(TRUE,len))),
  times = 100L
)
# Unit: milliseconds
#                                                       expr      min       lq     mean   median       uq       max neval
#  data.table::fifelse(x, list(FALSE), list(rep(TRUE, len))) 1.826824 1.853978 2.149058 1.865310 1.909784  7.311997   100
#               ifelse(x, list(FALSE), list(rep(TRUE, len))) 6.612826 6.678039 7.626645 6.700062 7.266668 18.867336   100
#       dplyr::if_else(x, list(FALSE), list(rep(TRUE, len))) 5.928195 5.988063 7.035108 6.031895 7.762929 15.979142   100

df <- do.call(rbind, replicate(2000, mtcars, simplify=FALSE))

microbenchmark::microbenchmark(
  data.table::fifelse(df$mpg < 19.5, TRUE, FALSE),
  ifelse(df$mpg < 19.5, TRUE, FALSE),
  dplyr::if_else(df$mpg < 19.5, TRUE, FALSE),
  hutils::if_else(df$mpg < 19.5, TRUE, FALSE),
  times = 100L
)
# Unit: microseconds
#                                             expr      min        lq      mean    median        uq       max neval
#  data.table::fifelse(df$mpg < 19.5, TRUE, FALSE)  506.311  526.6235  933.7458  555.2740  685.7005 16243.417   100
#               ifelse(df$mpg < 19.5, TRUE, FALSE) 1639.523 1667.3180 2232.3723 1682.7135 2004.9305 19018.289   100
#       dplyr::if_else(df$mpg < 19.5, TRUE, FALSE) 2008.993 2035.9335 3488.2319 2069.7155 2824.0500 81675.495   100
#      hutils::if_else(df$mpg < 19.5, TRUE, FALSE)  811.210  826.1760  995.5208  854.1855  892.6725  6705.621   100

microbenchmark::microbenchmark(
  data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat),
  ifelse(df$mpg < 19.5, df$cyl, df$drat),
  dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat),
  hutils::if_else(df$mpg < 19.5, df$cyl, df$drat),
  times = 100L
)
# Unit: microseconds
#                                                 expr      min       lq      mean   median        uq       max neval
#  data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat)  569.172  674.796  893.9038  699.385  750.2725  5530.930   100
#               ifelse(df$mpg < 19.5, df$cyl, df$drat) 1976.921 2024.815 2736.2667 2041.279 2115.4720 16723.641   100
#       dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat) 2779.577 2841.155 4848.6506 2896.533 4860.1970 76190.750   100
#      hutils::if_else(df$mpg < 19.5, df$cyl, df$drat) 1078.904 1146.041 1300.5216 1183.887 1206.3365  5951.715   100

2005m on 2 Jul 2019

Looks interesting.
Maybe we could generalise that into fcase, then fifelse is just a special case of fcase, same internals could handle both cases.
@2005m when opening PR, ideally do from a branch in Rdatatable org, so we all can easily push into it.

jangorecki on 2 Jul 2019

@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE.

I guess @mattdowle would have to grant branching access for the branch to be done within Rdatatable.

MichaelChirico on 2 Jul 2019

@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE

Currently, the data.table way is chaining [i, j with :=] for each case. This is actually a pretty good user-friendly interface, though I do not know about the potential impacts on performance.