Data.table: fcase / case_when function for data.table

Created on 5 Sep 2019 · 29Comments · Source: Rdatatable/data.table

Related/follow-up to #3657

case_when is a common tool in SQL for e.g. building labels based on conditions of flags, e.g. cutting age groups:

  case when age < 18 then '0-18'
       when age < 35 then '18-35'
       when age < 65 then '35-65'
       else '65+'
  end as age_label

Our comrades at dplyr have implemented this as case_when with an interface like

dplyr::case_when(
  age < 18 ~ '0-18',
  age < 35 ~ '18-35',
  age < 65 ~ '35-65',
  TRUE ~ '65+'
)

Using & interpeting formulas seems pretty natural for R -- the only other thing I can thing of would be like on syntax (case_when('age<18' = '0-18', ...)).

As for the back end, dplyr is doing a two-pass for loop at the R level which e.g. requires evaluating age < 65 for _all_ rows (whereas this is unnecessary for anything with labels '0-18' or '18-35').

I guess we can do much better with a C implementation. Will be interesting to contrast a proper lazy implementation with a parallel version (since IINM doing it in parallel will require first evaluating all the "LHS" values, e.g. as Jan met recently in frollapply).

Note that normally I'm doing case_when stuff with a look-up join, but it's not straightforward to implement a case_when as a join _in general_ -- though I think the two are isomorphic, backing out what the corresponding join should be I think is a hard problem. e.g. in the example here, the order of truth values matters since implicit in the later conditions are (x & !any(y)) for each condition x that was preceded by conditions y. This case is straightforward to cast as a join on cut(age), perhaps just using roll, but things can be much more complicated when several variables are involved in the truth conditions. So I don't think this route is likely to be fruitful.

feature request

Source

MichaelChirico

👍5

Most helpful comment

We are not in hurry. I think having lazyness should be crucial, otherwise it doesn't bring anything new (other than API) comparing to using lookup table. Don't try to resolve everything at first iteration. When you will feel ready submit PR to get feedback. Final state can take multiple iterations or a follow up PR(s).

jangorecki on 26 Oct 2019

👍3

All 29 comments

I actually prefer the synatx like fcase(test1, value1, test2, value2, ..., default). It's easier to read when the tests and values are quite complicated.

shrektan on 9 Sep 2019

It definitely probably be easier to code from that too.

One way forward would be to S3 it and build a (test1, value1, ....) call to dispatch for fcase.formula (personally I find the formula to read a bit more naturally)

MichaelChirico on 9 Sep 2019

Yes, it's easier to read especially when test & value is short and inlined. But I'm not sure using the formula style will have "significant" overheads or not since it requires extra patching. If not, I'm ok with both styles...

shrektan on 9 Sep 2019

Question, how does this improve on (or differ from) subassignment by reference? You can do an equivalent to a case when using the following:

DT <- data.table(age = 0:100)
DT[, age_label := "65+"]
DT[age < 65, age_label := "35-65"]
DT[age < 35, age_label := "18-35"]
DT[age < 18, age_label := "0-18"]

for which we already get auto-indexing by default.

HughParsonage on 9 Sep 2019

👍2

Hugh see last paragraph. I had a fuller explanation written out but it disappeared when I pressed Comment 😢

MichaelChirico on 9 Sep 2019

Your example is a special case. fcase is supposed to be more generic and not limited to be used within a datatable object.

shrektan on 9 Sep 2019

👍2

I am thinking it would be more efficient if we don't have to evaluate all the "LHS" value, but I need to think how I would solve that problem. What do you think of something like that?
fcase(variable, default, test1, value1,...,testN, valeN)

2005m on 12 Sep 2019

@2005m I don't quite follow the intuition of the function signature?

I have exploratory work on the fcase branch, R API is:

fcase = function(..., default=NA) .External(Cfcase, ..., default)

My thinking is to implement @shrektan's suggestion first as it'll be pretty straightforward (just need to figure out how to deal with DOTSXP in C 😅; note .External is necessary to accept ...).

Then later build the logic to interpret with formula which can map into the simpler implementation.

MichaelChirico on 13 Sep 2019

I think we can pass a list object to C. e.g.,

fcase = function(..., default = NA) {
  .Call(Cfcase, list(...), default = NA)
}

SEXP fcase(SEXP x, SEXP default) {
  ...
}

In addition, ?.External says the ... accepted is up to 65:

... arguments to be passed to the compiled code. Up to 65 for .Call.

shrektan on 13 Sep 2019

fcase(variable, default, test1, value1,...,testN, valeN)

With this construction one could test for missing(..1) at R level then
evaluate fcase recursively.

On Fri, 13 Sep 2019 at 12:31 pm, Xianying Tan notifications@github.com
wrote:

I think we can pass a list object to C. e.g.,

fcase = function(..., default = NA) {
.Call(Cfcase, list(...), default = NA)
}

SEXP fcase(SEXP x, SEXP default) {
...
}

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3823?email_source=notifications&email_token=AB54MDCWWKJSFOVMDDQWBADQJL3PDA5CNFSM4IT6NOD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TZGNI#issuecomment-531075893,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB54MDA2AHRQQRGQFC2FW73QJL3PDANCNFSM4IT6NODQ
.

HughParsonage on 13 Sep 2019

👍1

@MichaelChirico , I was thinking to use it like this:
fcase(age, '65+', '<18', '0-18', '<35', '18-35')

2005m on 13 Sep 2019

since it only depends on a single variable age, maybe it suits for fswitch() as a wrapper of fcase()?

shrektan on 13 Sep 2019

the less checks on R level the better. I also prefer .Call rather than .External.

We should first agree on the API. As @shrektan said fswitch can be a wrapper to fcase, similarly fifelse, if we all agree on that we can then focus on fcase API.
Once API is established then maybe R prototype? this will be for 1.13.0 so there is plenty of time.

jangorecki on 13 Sep 2019

As of now we have

standard evaluation interface

fcase(...)
fcase(..., default)
fcase(when1, value1, ..., default)
fcase(age < 18, '0-18', age < 35, '18-35', age < 65, '35-65', '65+')

NSE formula interface

fcase(...)
fcase(..., default)
fcase(x, ..., default)
fcase(age < 18 ~ '0-18', age < 35 ~ '18-35', age < 65 ~ '35-65', TRUE ~ '65+')

"vectorized"

fcase(when, value, default)
fcase(list(age < 18, age < 35, age < 65),
      list('0-18', '18-35', '35-65'),
      '65+')

any other suggestions?

jangorecki on 13 Sep 2019

👍2

Some food for thought:

https://coolbutuseless.github.io/2018/09/06/strict-case_when/

MichaelChirico on 14 Sep 2019

Author of that post seems to asks for a lookup table. The problem we solve by update on join. Case when has somehow different goal, as explained in above comments.

jangorecki on 14 Sep 2019

Question: Can value1 be the same length as when1 or is it always equal to 1?
Also I think we need to decide which one of the above function interface you want before coding anything? So far I think there is different way of coding this. Some simple (incl. valuating each when) and some more complex (incl. Not evaluating each when)...

2005m on 27 Sep 2019

in SQL it's usually the same length (value1 and when1 are both columns, e.g.), so I would expect that to work.

MichaelChirico on 27 Sep 2019

Maybe this is relevant. I just built a version that is just built on top of your very quick fifelse() with a few checks and structure created in R but otherwise allows fifelse() to run the show.

It can be found in the very developmental tidyfast package at https://github.com/TysonStanley/tidyfast. Would this very simple approach be useful?

TysonStanley on 25 Oct 2019

👍1

Thank you @TysonStanley . I actually finished writting the function in C. I was hoping to do a pull request tonight. I just need to finish writting the tests. I'll have a look at your function.

2005m on 25 Oct 2019

👍2

@2005m That’s awesome! I’m excited to see it rolled out. Is the syntax similar to dplyr::case_when()?

And yes, take a look and let me know what you think. It’s a pretty simple approach since I could rely on fifelse().

TysonStanley on 25 Oct 2019

Here is a sneak peek:

x = sample(1:100,3e7,replace = TRUE) # 114 Mb
data.table::setDTthreads(1L)

microbenchmark::microbenchmark(
dplyr::case_when(
    x < 10L ~ 0L,
    x < 20L ~ 10L,
    x < 30L ~ 20L,
    x < 40L ~ 30L,
    x < 50L ~ 40L,
    x < 60L ~ 50L,
    x > 60L ~ 60L
),
tidyfast::dt_case_when(
  x < 10L ~ 0L,
  x < 20L ~ 10L,
  x < 30L ~ 20L,
  x < 40L ~ 30L,
  x < 50L ~ 40L,
  x < 60L ~ 50L,
  x > 60L ~ 60L
),
data.table::fcase(
  x < 10L, 0L,
  x < 20L, 10L,
  x < 30L, 20L,
  x < 40L, 30L,
  x < 50L, 40L,
  x < 60L, 50L,
  x > 60L, 60L
),
times = 5L)
# Unit: seconds
#                    expr    min    lq  mean  median    uq    max neval
# dplyr::case_when         11.69 11.80 11.83   11.81  11.92 11.94     5
# tidyfast::dt_case_when    2.18  2.23  2.32    2.38   2.39  2.41     5
# data.table::fcase         1.87  1.91  2.02    2.02   2.05  2.26     5

The syntax is different to dplyr::case_when but we can probably change it.
Note my computer is very old. Code is compiled with gcc 4.9 flag -02.
The function will support interger64 and nanotime.
I still need to work on it,though. I think the PR won't be for tonight.

2005m on 25 Oct 2019

👍2 👀1

Thanks for the sneak peak. That performance is fun to see. And the syntax looks just as friendly either way. Personally I like the formulas but for most cases, it probably doesn’t matter much.

Are you planning on supporting other vector types in the future?

TysonStanley on 26 Oct 2019

does it evaluate every single case or only those that needs to be reached to provide answer?

jangorecki on 26 Oct 2019

@jangorecki , yes it evaluates all cases and that is why I am not happy with it. I am also not happy with the performance either. If it can't be improved, I think @TysonStanley 's approach if better because simpler and timings are similar.

@TysonStanley , I also prefer the formulas but it is probably subjective...Regarding other vector types, I don't know. It is up to the team.

2005m on 26 Oct 2019

👍1

jangorecki on 26 Oct 2019

👍3

Does fifelse evaluate lazily (ie only the cases it needs to)? I need to take a look at the code but thought I’d ask.

TysonStanley on 26 Oct 2019

yes it does

fifelse(TRUE, 1, stop("a"))
#Error in fifelse(TRUE, 1, stop("a")) : a

jangorecki on 27 Oct 2019

@jangorecki , I hope to be able to share my code this weekend. I need to write the Rd file and add more tests.

2005m on 29 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

fread with large csv (44 GB) takes a lot of RAM in latest data.table dev version

geponce · 30Comments

Vignettes

arunsrinivasan · 54Comments

Various enhancements to print.data.table

MichaelChirico · 63Comments

Possible bug / misfeature in POSIXct creation

eddelbuettel · 30Comments

Integration with magrittr

my-R-help · 39Comments