Dplyr: Hybrid support for case_when

Created on 16 Mar 2016  路  12Comments  路  Source: tidyverse/dplyr

We should talk about how to implement this - one generic approach (that might also work for lm()) is to construct an environment populated with promises that evaluate to the variables in the current group. Then if you call the function in the context of that environment, it should be able to use regular evaluation to get what it needs.

feature

Most helpful comment

I have started the RcppActiveBindingbindr and bindrcpp packages, the latter can populate environments with active bindings that link back to C++ functions. The logic of the code is a bit mind-twisting, it's easier to handle and test in a separate small package. Next, I'll be using that package to simplify hybrid evaluation, which should resolve this issue and #1400, #1452, #912, and help with #1012.

All 12 comments

@romainfrancois would be great if you could look into this please.

My approach here would be to implement a hybrid case_when handler. traverse_call() would walk the formulas and pick up the variables needed, for which proxies are then provided by existing mechanisms.

@krlmlr I think we should rewrite hybrid evaluation along the lines of what @hadley suggests here. My feeling is that it is unnecessarily complicated in its current state, and it has many quirks like this one that will keep showing up.

Hybrid evaluation should walk an expression from the inside out. This is important because currently the hybrid handlers are only called once for cases like this:

mtcars %>% mutate(
  mean(disp / mean(drat))
)

During the walk, when we see symbols corresponding to column names, we push the columns to the evaluation environment whose parent is the calling environment. The columns are stored in shrinkable vectors to save on sexp allocations. When we spot a handler, its arguments are R-evaluated and the handler is then called on the results. The output of the handler is inlined in the expression and we continue walking.

This strategy would create some false positive while checking for column names, which would create some unnecessary copies of columns into the shrinkable buffers. But at least the expressions are guaranteed to be correctly evaluated. E.g. the following which currently does not work, will work:

x <- list(am = 1)

mtcars %>%
  group_by(cyl) %>%
  summarise(x$am)

It should be possible to improve the logic and limit the false positives (essentially caused by functions that use special evaluation like $) by having a list of functions whose argument should not be interpreted as column names.

The nice thing is that even if we push too many columns in our eval env, the end result is still correct. In dplyr's evaluation model the column scope prevails, so ideally we'd have all columns in the evaluation environment, we just try to have as few as possible to avoid unnecessary copies. But using the R evaluation as late as possible makes sure that x$am scopes am from x and not the eval env.

by having a list of functions whose argument should not be interpreted as column names.

I was focused on $ but of course ~ is relevant here. Hmm could we just have our expression walker skip entirely the arguments of those functions? After all we don't want to call a hybrid handler for x$mean(col). On the other hand this last example shows that we still want the column scoping to apply within the arguments so we should continue parsing the expression to look for column names.

So we should have at least two categories of special functions: those whose arguments can be discarded entirely, such as ~, and those for which only the first level of arguments should be ignored, like $ or ::. This means:

  • for symbol arguments: don't push new columns in the eval env
  • for call arguments: don't call hybrid handlers.

So ~ would be a "full rescoper" and $ would be a "partial rescoper".

oops I think I was confused. I was assuming that with the new lazyeval, we'd only access columns with the .data pronoun. But of course we'd want the formulas in case_when() to refer to column names and only use the pronouns for disambiguating. So formulas should be traversed to collect column references.

I think we can use delayed assign to create an environment that only evaluates the variable expressions when needed. That way we don't need to handle all the exceptions; the hybrid evaluator can focus on what it knows how to handle.

oh, brilliant. I had completely forgotten you could do this.

My feeling is that it is unnecessarily complicated in its current state, and it has many quirks like this one that will keep showing up.

@lionel-: Do you mean traverse_call()? It already seems to handle ~ (we'd need to override that for case_when()) and implement special cases for a few other functions. I have a fix for #1400 that consists of a small modification to traverse_call(), but I'm thinking about refactoring the method to separate tree traversal from node handling.

@hadley: Do you mean to replace the proxies created in traverse_call()? I think we'll want to instantiate the environment only once (in a grouped setting) for performance reasons. Why does it have to be delayed assignment?

Do you mean traverse_call()?

Yes, to replace as much idiosyncratic logic as possible with simple evaluation.

The environment can stay the same across the whole evaluation procedure. The temporary buffers (shrinkable vectors) can be the same as well. Then we just refill and resize the buffers for each group. Delayed assignment makes sure we only instantiate buffers that are needed.

I might be missing the point. How would evaluation of iris %>% group_by(Species) %>% summarize(Petal.Width = mean(Petal.Width)) work with your approach, step by step?

How would evaluation of iris %>% group_by(Species) %>% summarize(Petal.Width = mean(Petal.Width)) work with your approach, step by step?

This is not about hybrid handlers but about evaluating R expressions while emulating the dplyr scope: giving precedence to data frame columns and subsetting within groups.

Currently dplyr relies on the CallProxy and LazySubsets classes to create this scoping. Some of this code could be replaced by using proper R evaluation. This would be more reliable and would also reduce the code complexity.

I have started the RcppActiveBindingbindr and bindrcpp packages, the latter can populate environments with active bindings that link back to C++ functions. The logic of the code is a bit mind-twisting, it's easier to handle and test in a separate small package. Next, I'll be using that package to simplify hybrid evaluation, which should resolve this issue and #1400, #1452, #912, and help with #1012.

Was this page helpful?
0 / 5 - 0 ratings