Not sure the status of ifelse() speed in R these days. Years ago it used to be really slow and I avoided it. Is that still the case?
I've seen this idiom quite a bit in the wild :
DT[, newCol := ifelse(existingCol == 42, 1L, 2L)]
under verbose=TRUE there's a message about the plonk. It's slow and RAM inefficient I assume (haven't tested).
Better would be using i, as per #2676 :
DT[existingCol==42, newCol:=1L, fill=2L]
Nested ifelse(ifelse(ifelse())) also quite common. Either detect and warn about ifelse() usage and point to a document about how to avoid, or optimize it automatically. And if ifelse() in base is still slow, see why and try to improve it there.
Just re-ran Ricardo's benchmark here and yes, it's still slow:
Isn't this just another use case for implementing more flexible nomatch (#857)?
The multiple-ifelse case should probably correspond to a join, IMO. Not sure if there's a concise way we can kludge that into existing [ code intuitively.
I wrote an alternative to dplyr::if_else (itself quite a bit faster than ifelse) that might be a more stringent benchmark.
library(microbenchmark)
library(magrittr)
cnd <- sample(c(TRUE, FALSE), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)
microbenchmark(dplyr = dplyr::if_else(cnd, yes, no, na), base = ifelse(cnd, yes, no),
hutils = hutils::if_else(cnd, yes, no, na),
check = my_check) %T>%
print %>%
autoplot

Unit: milliseconds
expr min lq mean median uq max neval cld
dplyr 6.202732 6.931278 7.584452 7.541011 8.322263 10.836334 100 b
base 27.949490 28.777124 29.843440 29.234310 30.110884 47.372671 100 c
hutils 1.816998 2.041676 2.192027 2.135944 2.211239 5.622667 100 a
Hugh - that's fantastic! Would you consider proposing your hutils::ifelse() as a patch to base R?
Really? Haha I only wrote it because I liked dplyr::if_else safety but dplyr's compile times for it were too long to justify importing the whole package. I actually never expected it to be faster.
There are a few limitations -- namely I'm not sure how it works on more exotic classes like if_else(matrix, list, list) for example. If you think it's worth pursuing, I'll definitely offer it as a patch.
It's really really worth pursuing -- yes. R is kinda famous for its slow ifelse() and desperately needs to be fixed. But yes indeed -- to have any chance of being accepted it would have to be a complete drop-in replacement. Which is the reason fread and fwrite haven't been proposed, for example. And in data.table of course, the reason is we wanted [.data.frame to work differently to be not-the-same as base R. But ifelse() speed is more like sorting ... it should just be in base. And how many of the 10,000 packages use base::ifelse(). Probably quite a lot. Any automatic improvement there with no code changes needed by package maintainers is sure to be welcomed as it could speed up CRAN testing.
Scale up the data size though so that the times are many seconds or minutes. As you know, I tend to dismiss any chart that is in milliseconds scale.
One of the R manuals explains how to build R and run its tests. If you've done that already in your very first communication, it has a much better chance.
Longer version with a 2 billion-length cnd etc.
> system.time(hutils::if_else(cnd, yes, no))
user system elapsed
55.26 4.19 59.75
> system.time(hutils::if_else(cnd, yes, no))
user system elapsed
55.47 4.07 59.55
> system.time(dplyr::if_else(cnd, yes, no))
user system elapsed
105.97 52.11 159.17
> system.time(dplyr::if_else(cnd, yes, no))
user system elapsed
109.05 59.56 168.74
> system.time(ifelse(cnd, yes, no))
user system elapsed
467.92 51.15 520.66
While I never intended it to be faster when I wrote it, I'm confident hutils::if_else is always faster than dplyr::if_else and include that assertion as a unit test, only skipping on CRAN because it takes a while to run (https://github.com/HughParsonage/hutils/blob/15e62bdf235dc6084f1006824a20a42fab12ce9c/tests/testthat/test_1-0-0_if_else.R#L149).
Similar with hutils::coalesce v dplyr::coalesce. (Again, nothing wrong or even slow with the original -- just needed it in situations where compile time was important).
Regarding the inclusion in base R, it seems a discussion on R-dev crops up every so often (e.g. ifelse() woes ... can we agree on a ifelse2() ? ), but good points are raised. Hadley's point that even if we didn't care about backwards-compatibility, the "correct" result of ifelse(c(TRUE, FALSE), factor(c("a", "b")), factor(c("c", "b")) is unclear. That gives me a bit of pause about implementation, but my current thought is to use hutils::if_else as the superstructure but instead of raising an error when there is some mismatch, falling back to base::ifelse, thus preserving backwards-compatibility but getting performance and a missing argument.
We could optimize ifelse into fifelse
To reduce a risk we could "optimize" it to call both functions, compare output, if differs then ask user to report. Then after some time we could reduce extra overhead of 2 calls and leave only fifelse.
Most helpful comment
Longer version with a 2 billion-length cnd etc.
While I never intended it to be faster when I wrote it, I'm confident
hutils::if_elseis always faster thandplyr::if_elseand include that assertion as a unit test, only skipping on CRAN because it takes a while to run (https://github.com/HughParsonage/hutils/blob/15e62bdf235dc6084f1006824a20a42fab12ce9c/tests/testthat/test_1-0-0_if_else.R#L149).Similar with
hutils::coalescevdplyr::coalesce. (Again, nothing wrong or even slow with the original -- just needed it in situations where compile time was important).