Dplyr: summarise shouldn't force base R functions like sum and mean

Created on 19 Dec 2017 · 20Comments · Source: tidyverse/dplyr

The summarise function forces the use of base R functions, instead of the expected behaviour that it would use loaded ('active') functions from the own environment.

# (from another package:)
#' @inherit base::mean
#' @export
mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

library(dplyr)
df <- tibble(group = LETTERS[1:10], 
             score = c(runif(3, 0, 1), 
                       NA,
                       runif(6, 0, 1)))

mean(df$score) # now na.rm = TRUE by default
# [1] 0.5479927

df %>% summarise(m = mean(score))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    NA

> df %>% summarise(m = mean(score, na.rm = TRUE))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    0.5479927

Why not use functions from ones own environment? This mean function comes from another package, which I deliberately loaded, so I would expect it to work in all my code. But apparently, the summarise function is an exception. I think you should not force the use of base R functions like base::mean in dplyr.

You might think that package developers just shouldn't overwrite functions like mean, but even dplyr does this (like intersect, setdiff, setequal, union). That too gives other results than when only base R was loaded, just like you would expect. Hence, you loaded dplyr. Now I ask you to let summarise use the latest loaded functions, not just base R. Another example: if I would write a function to overwrite sum to solve #3189, it wouldn't be used by summarise at default, even when my package was loaded after dplyr.

bug

Source

msberends

👍1

Most helpful comment

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol mean evaluates to the mean function from the base package and only enable hybrid evaluation in that case.

lionel- on 24 Dec 2017

👍2

All 20 comments

I don't know if this helps, but it appears you can pass in arbitrary functions as long as they don't have an already used name (such as mean()). I have no idea what the list of non-overidable functions is though.

suppressPackageStartupMessages(library(dplyr))

mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

mean2 <- mean

df <- tibble(score = c(1, NA, 3))


df %>% summarise(m = mean(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1    NA
df %>% summarise_at(, .vars="score", .funs = funs(mean))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1    NA

df %>% summarise(m = mean2(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1  2.00
df %>% summarise_at(, .vars="score", .funs = funs(mean2))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1  2.00

JohnMount on 19 Dec 2017

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

Suddenly I thought of summarise_at, hoping that it would be the solution, but it isn't;

df %>% summarise_at(vars(score), mean)
# # A tibble: 1 x 1
#   score
#   <dbl>
# 1    NA

df %>% summarise_at(vars(score), funs(mean, .args = list(na.rm = TRUE)))
# # A tibble: 1 x 1
#       score
#       <dbl>
# 1 0.5479927

This of course works, but isn't really a solution or explanation too:

df %>% summarise(m = MyPackage::mean(score))
# # A tibble: 1 x 1
#           m
#       <dbl>
# 1 0.5479927

msberends on 24 Dec 2017

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

lionel- on 24 Dec 2017

👍2

That would really help a lot! Makes more sense too :wink:

msberends on 25 Dec 2017

we use C++ implementations to speed up computations

I got that now, after reading http://r-pkgs.had.co.nz/src.html (which is a great intro btw).
Maybe you could edit the wrapper functions like:

sum <- function(x) {
  .Call('sum', PACKAGE = 'dplyr', x)
}

To:

sum <- function(x) {
  if (identical(sum, base::sum)) {
    .Call('sum', PACKAGE = 'dplyr', x)
  } else {
    sum
  }
}

Something like this? Or create a helper function for all cases?

msberends on 4 Jan 2018

Hybrid evaluation in dplyr doesn't work by masking so that wouldn't work.

lionel- on 4 Jan 2018

Agree to check if mean evaluates to base::mean before kicking in hybrid evaluation, as suggested by @lionel-.

krlmlr on 17 Jan 2018

👍1

Is this then going to be even more expensive than it is already ?

romainfrancois on 13 Mar 2018

This check would be run only once, not once per group.

krlmlr on 13 Mar 2018

Sure. So there are these :

void install_simple_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("mean") ] = simple_prototype<dplyr::Mean>;
  handlers[ Rf_install("var") ] = simple_prototype<dplyr::Var>;
  handlers[ Rf_install("sd") ] = simple_prototype<dplyr::Sd>;
  handlers[ Rf_install("sum") ] = simple_prototype<dplyr::Sum>;
}
void install_minmax_handlers(HybridHandlerMap& handlers) {
  handlers[Rf_install("min")] = minmax_prototype<true>;
  handlers[Rf_install("max")] = minmax_prototype<false>;
}
void install_in_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("%in%") ] = in_prototype;
}

Do we need the same for dplyr specific things, like these below, in case e.g. we want to "mask" row_number for example ?

void install_count_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("n") ] = count_prototype;
  handlers[ Rf_install("n_distinct") ] = count_distinct_prototype;
}
void install_nth_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("first") ] = first_prototype;
  handlers[ Rf_install("last") ] = last_prototype;
  handlers[ Rf_install("nth") ] = nth_prototype;
}
void install_window_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("row_number") ] = row_number_prototype;
  handlers[ Rf_install("ntile") ] = ntile_prototype;
  handlers[ Rf_install("min_rank") ] = rank_impl_prototype<dplyr::internal::min_rank_increment>;
  handlers[ Rf_install("percent_rank") ] = rank_impl_prototype<dplyr::internal::percent_rank_increment>;
  handlers[ Rf_install("dense_rank") ] = rank_impl_prototype<dplyr::internal::dense_rank_increment>;
  handlers[ Rf_install("cume_dist") ] = rank_impl_prototype<dplyr::internal::cume_dist_increment>;
}
void install_offset_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("lead") ] = leadlag_prototype<Lead>;
  handlers[ Rf_install("lag") ] = leadlag_prototype<Lag>;
}

I guess these don't need to change

void install_debug_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("verify_hybrid") ] = verify_hybrid_prototype;
  handlers[ Rf_install("verify_not_hybrid") ] = verify_not_hybrid_prototype;
}

romainfrancois on 13 Mar 2018

var and sd come from stats, not base

romainfrancois on 13 Mar 2018

I don't see why we should be changing any of this code. I thought we would test that evaluating the function's symbol in the user's environment matches what we expect (base, stats, ...).?

krlmlr on 13 Mar 2018

Sure, I was dumping this here to show which are dealt with by hybrid.
We don't have a mechanism right now to capture "what we expect"

romainfrancois on 13 Mar 2018

Maybe we can evaluate the symbol, and check if it is in either stats base or dplyr as these are the only 3 used

romainfrancois on 13 Mar 2018

I would rather compare the pointers with objects picked up from the namespaces. Perhaps it makes sense to cache these in a static.

lionel- on 13 Mar 2018

Do we need to change #3420 to account for the new behavior?

krlmlr on 13 Mar 2018

Yes that's better. We already have this static :

HybridHandlerMap& get_handlers() {
  static HybridHandlerMap handlers;

maybe HybridHandlerMap can be updated from

typedef dplyr_hash_map<SEXP, HybridHandler> HybridHandlerMap;

to something like

typedef dplyr_hash_map<SEXP, std::pair<HybridHandler,SEXP>> HybridHandlerMap;

so that we would do something like this pseudo code:

handlers[ Rf_install("mean") ] = make_pair( simple_prototype<dplyr::Mean>, base::mean )  ;

otherwise we can have another map for the environments, maybe that's less disruptive

romainfrancois on 13 Mar 2018

Instead of std::pair can we add a new class HybridHandler that includes the current typedef and the SEXP as members? I'd rather avoid two maps.

krlmlr on 13 Mar 2018

@krlmlr I see this going just before if (TYPEOF(fun_symbol) != SYMSXP) { so that's orthogonal to #3420 ... mhmm maybe not. anyway, I'll do this with ☕️.

Alright, I'll do this then.

romainfrancois on 13 Mar 2018

👍1

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/