Related/follow-up to #3657
case_when
is a common tool in SQL for e.g. building labels based on conditions of flags, e.g. cutting age groups:
case when age < 18 then '0-18'
when age < 35 then '18-35'
when age < 65 then '35-65'
else '65+'
end as age_label
Our comrades at dplyr
have implemented this as case_when
with an interface like
dplyr::case_when(
age < 18 ~ '0-18',
age < 35 ~ '18-35',
age < 65 ~ '35-65',
TRUE ~ '65+'
)
Using & interpeting formulas seems pretty natural for R -- the only other thing I can thing of would be like on
syntax (case_when('age<18' = '0-18', ...)
).
As for the back end, dplyr
is doing a two-pass for
loop at the R level which e.g. requires evaluating age < 65
for _all_ rows (whereas this is unnecessary for anything with labels '0-18'
or '18-35'
).
I guess we can do much better with a C implementation. Will be interesting to contrast a proper lazy implementation with a parallel version (since IINM doing it in parallel will require first evaluating all the "LHS" values, e.g. as Jan met recently in frollapply
).
Note that normally I'm doing case_when
stuff with a look-up join, but it's not straightforward to implement a case_when
as a join _in general_ -- though I think the two are isomorphic, backing out what the corresponding join should be I think is a hard problem. e.g. in the example here, the order of truth values matters since implicit in the later conditions are (x & !any(y))
for each condition x
that was preceded by conditions y
. This case is straightforward to cast as a join on cut(age)
, perhaps just using roll
, but things can be much more complicated when several variables are involved in the truth conditions. So I don't think this route is likely to be fruitful.
I actually prefer the synatx like fcase(test1, value1, test2, value2, ..., default)
. It's easier to read when the tests
and values
are quite complicated.
It definitely probably be easier to code from that too.
One way forward would be to S3 it and build a (test1, value1, ....)
call to dispatch for fcase.formula
(personally I find the formula to read a bit more naturally)
Yes, it's easier to read especially when test & value is short and inlined. But I'm not sure using the formula style will have "significant" overheads or not since it requires extra patching. If not, I'm ok with both styles...
Question, how does this improve on (or differ from) subassignment by reference? You can do an equivalent to a case when using the following:
DT <- data.table(age = 0:100)
DT[, age_label := "65+"]
DT[age < 65, age_label := "35-65"]
DT[age < 35, age_label := "18-35"]
DT[age < 18, age_label := "0-18"]
for which we already get auto-indexing by default.
Hugh see last paragraph. I had a fuller explanation written out but it disappeared when I pressed Comment 😢
Your example is a special case. fcase
is supposed to be more generic and not limited to be used within a datatable object.
I am thinking it would be more efficient if we don't have to evaluate all the "LHS" value, but I need to think how I would solve that problem. What do you think of something like that?
fcase(variable, default, test1, value1,...,testN, valeN)
@2005m I don't quite follow the intuition of the function signature?
I have exploratory work on the fcase
branch, R API is:
fcase = function(..., default=NA) .External(Cfcase, ..., default)
My thinking is to implement @shrektan's suggestion first as it'll be pretty straightforward (just need to figure out how to deal with DOTSXP
in C 😅; note .External
is necessary to accept ...
).
Then later build the logic to interpret with formula
which can map into the simpler implementation.
I think we can pass a list object to C. e.g.,
fcase = function(..., default = NA) {
.Call(Cfcase, list(...), default = NA)
}
SEXP fcase(SEXP x, SEXP default) {
...
}
In addition, ?.External
says the ...
accepted is up to 65:
... arguments to be passed to the compiled code. Up to 65 for .Call.
fcase(variable, default, test1, value1,...,testN, valeN)
With this construction one could test for missing(..1) at R level then
evaluate fcase recursively.
On Fri, 13 Sep 2019 at 12:31 pm, Xianying Tan notifications@github.com
wrote:
I think we can pass a list object to C. e.g.,
fcase = function(..., default = NA) {
.Call(Cfcase, list(...), default = NA)
}SEXP fcase(SEXP x, SEXP default) {
...
}—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/3823?email_source=notifications&email_token=AB54MDCWWKJSFOVMDDQWBADQJL3PDA5CNFSM4IT6NOD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TZGNI#issuecomment-531075893,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB54MDA2AHRQQRGQFC2FW73QJL3PDANCNFSM4IT6NODQ
.
@MichaelChirico , I was thinking to use it like this:
fcase(age, '65+', '<18', '0-18', '<35', '18-35')
since it only depends on a single variable age
, maybe it suits for fswitch()
as a wrapper of fcase()
?
the less checks on R level the better. I also prefer .Call
rather than .External
.
We should first agree on the API. As @shrektan said fswitch
can be a wrapper to fcase
, similarly fifelse
, if we all agree on that we can then focus on fcase
API.
Once API is established then maybe R prototype? this will be for 1.13.0 so there is plenty of time.
As of now we have
fcase(...)
fcase(..., default)
fcase(when1, value1, ..., default)
fcase(age < 18, '0-18', age < 35, '18-35', age < 65, '35-65', '65+')
fcase(...)
fcase(..., default)
fcase(x, ..., default)
fcase(age < 18 ~ '0-18', age < 35 ~ '18-35', age < 65 ~ '35-65', TRUE ~ '65+')
fcase(when, value, default)
fcase(list(age < 18, age < 35, age < 65),
list('0-18', '18-35', '35-65'),
'65+')
any other suggestions?
Some food for thought:
https://coolbutuseless.github.io/2018/09/06/strict-case_when/
Author of that post seems to asks for a lookup table. The problem we solve by update on join. Case when has somehow different goal, as explained in above comments.
Question: Can value1
be the same length as when1
or is it always equal to 1?
Also I think we need to decide which one of the above function interface you want before coding anything? So far I think there is different way of coding this. Some simple (incl. valuating each when) and some more complex (incl. Not evaluating each when)...
in SQL it's usually the same length (value1
and when1
are both columns, e.g.), so I would expect that to work.
Maybe this is relevant. I just built a version that is just built on top of your very quick fifelse()
with a few checks and structure created in R but otherwise allows fifelse()
to run the show.
It can be found in the very developmental tidyfast
package at https://github.com/TysonStanley/tidyfast. Would this very simple approach be useful?
Thank you @TysonStanley . I actually finished writting the function in C. I was hoping to do a pull request tonight. I just need to finish writting the tests. I'll have a look at your function.
@2005m That’s awesome! I’m excited to see it rolled out. Is the syntax similar to dplyr::case_when()
?
And yes, take a look and let me know what you think. It’s a pretty simple approach since I could rely on fifelse()
.
Here is a sneak peek:
x = sample(1:100,3e7,replace = TRUE) # 114 Mb
data.table::setDTthreads(1L)
microbenchmark::microbenchmark(
dplyr::case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
tidyfast::dt_case_when(
x < 10L ~ 0L,
x < 20L ~ 10L,
x < 30L ~ 20L,
x < 40L ~ 30L,
x < 50L ~ 40L,
x < 60L ~ 50L,
x > 60L ~ 60L
),
data.table::fcase(
x < 10L, 0L,
x < 20L, 10L,
x < 30L, 20L,
x < 40L, 30L,
x < 50L, 40L,
x < 60L, 50L,
x > 60L, 60L
),
times = 5L)
# Unit: seconds
# expr min lq mean median uq max neval
# dplyr::case_when 11.69 11.80 11.83 11.81 11.92 11.94 5
# tidyfast::dt_case_when 2.18 2.23 2.32 2.38 2.39 2.41 5
# data.table::fcase 1.87 1.91 2.02 2.02 2.05 2.26 5
The syntax is different to dplyr::case_when
but we can probably change it.
Note my computer is very old. Code is compiled with gcc 4.9 flag -02.
The function will support interger64 and nanotime.
I still need to work on it,though. I think the PR won't be for tonight.
Thanks for the sneak peak. That performance is fun to see. And the syntax looks just as friendly either way. Personally I like the formulas but for most cases, it probably doesn’t matter much.
Are you planning on supporting other vector types in the future?
does it evaluate every single case or only those that needs to be reached to provide answer?
@jangorecki , yes it evaluates all cases and that is why I am not happy with it. I am also not happy with the performance either. If it can't be improved, I think @TysonStanley 's approach if better because simpler and timings are similar.
@TysonStanley , I also prefer the formulas but it is probably subjective...Regarding other vector types, I don't know. It is up to the team.
We are not in hurry. I think having lazyness should be crucial, otherwise it doesn't bring anything new (other than API) comparing to using lookup table. Don't try to resolve everything at first iteration. When you will feel ready submit PR to get feedback. Final state can take multiple iterations or a follow up PR(s).
Does fifelse
evaluate lazily (ie only the cases it needs to)? I need to take a look at the code but thought I’d ask.
yes it does
fifelse(TRUE, 1, stop("a"))
#Error in fifelse(TRUE, 1, stop("a")) : a
@jangorecki , I hope to be able to share my code this weekend. I need to write the Rd file and add more tests.
Most helpful comment
We are not in hurry. I think having lazyness should be crucial, otherwise it doesn't bring anything new (other than API) comparing to using lookup table. Don't try to resolve everything at first iteration. When you will feel ready submit PR to get feedback. Final state can take multiple iterations or a follow up PR(s).