Data.table: warn or optimize usage of `ifelse()`

Created on 15 Mar 2018 · 8Comments · Source: Rdatatable/data.table

Not sure the status of ifelse() speed in R these days. Years ago it used to be really slow and I avoided it. Is that still the case?
I've seen this idiom quite a bit in the wild :

DT[, newCol := ifelse(existingCol == 42,  1L, 2L)]

under verbose=TRUE there's a message about the plonk. It's slow and RAM inefficient I assume (haven't tested).

Better would be using i, as per #2676 :

DT[existingCol==42, newCol:=1L, fill=2L]

Nested ifelse(ifelse(ifelse())) also quite common. Either detect and warn about ifelse() usage and point to a document about how to avoid, or optimize it automatically. And if ifelse() in base is still slow, see why and try to improve it there.

enhancement

Source

mattdowle

👍1

Most helpful comment

Longer version with a 2 billion-length cnd etc.

> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.26    4.19   59.75 
> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.47    4.07   59.55 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 105.97   52.11  159.17 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 109.05   59.56  168.74 
> system.time(ifelse(cnd, yes, no))
   user  system elapsed 
 467.92   51.15  520.66

While I never intended it to be faster when I wrote it, I'm confident hutils::if_else is always faster than dplyr::if_else and include that assertion as a unit test, only skipping on CRAN because it takes a while to run (https://github.com/HughParsonage/hutils/blob/15e62bdf235dc6084f1006824a20a42fab12ce9c/tests/testthat/test_1-0-0_if_else.R#L149).

Similar with hutils::coalesce v dplyr::coalesce. (Again, nothing wrong or even slow with the original -- just needed it in situations where compile time was important).

HughParsonage on 17 Mar 2018

👍3

All 8 comments

Just re-ran Ricardo's benchmark here and yes, it's still slow:

https://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow

Isn't this just another use case for implementing more flexible nomatch (#857)?

The multiple-ifelse case should probably correspond to a join, IMO. Not sure if there's a concise way we can kludge that into existing [ code intuitively.

MichaelChirico on 16 Mar 2018

👍1

I wrote an alternative to dplyr::if_else (itself quite a bit faster than ifelse) that might be a more stringent benchmark.

library(microbenchmark)
library(magrittr)
cnd <- sample(c(TRUE, FALSE), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na), base = ifelse(cnd, yes, no),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %T>%
    print %>%
    autoplot

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
  dplyr  6.202732  6.931278  7.584452  7.541011  8.322263 10.836334   100  b 
   base 27.949490 28.777124 29.843440 29.234310 30.110884 47.372671   100   c
 hutils  1.816998  2.041676  2.192027  2.135944  2.211239  5.622667   100 a

HughParsonage on 16 Mar 2018

👍1

Hugh - that's fantastic! Would you consider proposing your hutils::ifelse() as a patch to base R?

mattdowle on 17 Mar 2018

Really? Haha I only wrote it because I liked dplyr::if_else safety but dplyr's compile times for it were too long to justify importing the whole package. I actually never expected it to be faster.

There are a few limitations -- namely I'm not sure how it works on more exotic classes like if_else(matrix, list, list) for example. If you think it's worth pursuing, I'll definitely offer it as a patch.

HughParsonage on 17 Mar 2018

It's really really worth pursuing -- yes. R is kinda famous for its slow ifelse() and desperately needs to be fixed. But yes indeed -- to have any chance of being accepted it would have to be a complete drop-in replacement. Which is the reason fread and fwrite haven't been proposed, for example. And in data.table of course, the reason is we wanted [.data.frame to work differently to be not-the-same as base R. But ifelse() speed is more like sorting ... it should just be in base. And how many of the 10,000 packages use base::ifelse(). Probably quite a lot. Any automatic improvement there with no code changes needed by package maintainers is sure to be welcomed as it could speed up CRAN testing.

Scale up the data size though so that the times are many seconds or minutes. As you know, I tend to dismiss any chart that is in milliseconds scale.

One of the R manuals explains how to build R and run its tests. If you've done that already in your very first communication, it has a much better chance.

mattdowle on 17 Mar 2018

👍1

Longer version with a 2 billion-length cnd etc.

> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.26    4.19   59.75 
> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.47    4.07   59.55 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 105.97   52.11  159.17 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 109.05   59.56  168.74 
> system.time(ifelse(cnd, yes, no))
   user  system elapsed 
 467.92   51.15  520.66

Similar with hutils::coalesce v dplyr::coalesce. (Again, nothing wrong or even slow with the original -- just needed it in situations where compile time was important).

HughParsonage on 17 Mar 2018

👍3

Regarding the inclusion in base R, it seems a discussion on R-dev crops up every so often (e.g. ifelse() woes ... can we agree on a ifelse2() ? ), but good points are raised. Hadley's point that even if we didn't care about backwards-compatibility, the "correct" result of ifelse(c(TRUE, FALSE), factor(c("a", "b")), factor(c("c", "b")) is unclear. That gives me a bit of pause about implementation, but my current thought is to use hutils::if_else as the superstructure but instead of raising an error when there is some mismatch, falling back to base::ifelse, thus preserving backwards-compatibility but getting performance and a missing argument.

HughParsonage on 17 Mar 2018

We could optimize ifelse into fifelse
To reduce a risk we could "optimize" it to call both functions, compare output, if differs then ask user to report. Then after some time we could reduce extra overhead of 2 calls and leave only fifelse.