Dplyr: summarise shouldn't force base R functions like sum and mean

Created on 19 Dec 2017  路  20Comments  路  Source: tidyverse/dplyr

The summarise function forces the use of base R functions, instead of the expected behaviour that it would use loaded ('active') functions from the own environment.

# (from another package:)
#' @inherit base::mean
#' @export
mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

library(dplyr)
df <- tibble(group = LETTERS[1:10], 
             score = c(runif(3, 0, 1), 
                       NA,
                       runif(6, 0, 1)))

mean(df$score) # now na.rm = TRUE by default
# [1] 0.5479927

df %>% summarise(m = mean(score))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    NA

> df %>% summarise(m = mean(score, na.rm = TRUE))
# # A tibble: 1 x 1
#       m
#   <dbl>
# 1    0.5479927

Why not use functions from ones own environment? This mean function comes from another package, which I deliberately loaded, so I would expect it to work in all my code. But apparently, the summarise function is an exception. I think you should not force the use of base R functions like base::mean in dplyr.

You might think that package developers just shouldn't overwrite functions like mean, but even dplyr does this (like intersect, setdiff, setequal, union). That too gives other results than when only base R was loaded, just like you would expect. Hence, you loaded dplyr. Now I ask you to let summarise use the latest loaded functions, not just base R. Another example: if I would write a function to overwrite sum to solve #3189, it wouldn't be used by summarise at default, even when my package was loaded after dplyr.

bug

Most helpful comment

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol mean evaluates to the mean function from the base package and only enable hybrid evaluation in that case.

All 20 comments

I don't know if this helps, but it appears you can pass in arbitrary functions as long as they don't have an already used name (such as mean()). I have no idea what the list of non-overidable functions is though.

suppressPackageStartupMessages(library(dplyr))

mean <- function(x, ..., na.rm = TRUE) {
  base::mean(x, ..., na.rm = na.rm)
}

mean2 <- mean

df <- tibble(score = c(1, NA, 3))


df %>% summarise(m = mean(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1    NA
df %>% summarise_at(, .vars="score", .funs = funs(mean))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1    NA

df %>% summarise(m = mean2(score))
#> # A tibble: 1 x 1
#>       m
#>   <dbl>
#> 1  2.00
df %>% summarise_at(, .vars="score", .funs = funs(mean2))
#> # A tibble: 1 x 1
#>   score
#>   <dbl>
#> 1  2.00

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

Suddenly I thought of summarise_at, hoping that it would be the solution, but it isn't;

df %>% summarise_at(vars(score), mean)
# # A tibble: 1 x 1
#   score
#   <dbl>
# 1    NA

df %>% summarise_at(vars(score), funs(mean, .args = list(na.rm = TRUE)))
# # A tibble: 1 x 1
#       score
#       <dbl>
# 1 0.5479927

This of course works, but isn't really a solution or explanation too:

df %>% summarise(m = MyPackage::mean(score))
# # A tibble: 1 x 1
#           m
#       <dbl>
# 1 0.5479927

Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.

That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol mean evaluates to the mean function from the base package and only enable hybrid evaluation in that case.

That would really help a lot! Makes more sense too :wink:

we use C++ implementations to speed up computations

I got that now, after reading http://r-pkgs.had.co.nz/src.html (which is a great intro btw).
Maybe you could edit the wrapper functions like:

sum <- function(x) {
  .Call('sum', PACKAGE = 'dplyr', x)
}

To:

sum <- function(x) {
  if (identical(sum, base::sum)) {
    .Call('sum', PACKAGE = 'dplyr', x)
  } else {
    sum
  }
}

Something like this? Or create a helper function for all cases?

Hybrid evaluation in dplyr doesn't work by masking so that wouldn't work.

Agree to check if mean evaluates to base::mean before kicking in hybrid evaluation, as suggested by @lionel-.

Is this then going to be even more expensive than it is already ?

This check would be run only once, not once per group.

Sure. So there are these :

void install_simple_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("mean") ] = simple_prototype<dplyr::Mean>;
  handlers[ Rf_install("var") ] = simple_prototype<dplyr::Var>;
  handlers[ Rf_install("sd") ] = simple_prototype<dplyr::Sd>;
  handlers[ Rf_install("sum") ] = simple_prototype<dplyr::Sum>;
}
void install_minmax_handlers(HybridHandlerMap& handlers) {
  handlers[Rf_install("min")] = minmax_prototype<true>;
  handlers[Rf_install("max")] = minmax_prototype<false>;
}
void install_in_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("%in%") ] = in_prototype;
}

Do we need the same for dplyr specific things, like these below, in case e.g. we want to "mask" row_number for example ?

void install_count_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("n") ] = count_prototype;
  handlers[ Rf_install("n_distinct") ] = count_distinct_prototype;
}
void install_nth_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("first") ] = first_prototype;
  handlers[ Rf_install("last") ] = last_prototype;
  handlers[ Rf_install("nth") ] = nth_prototype;
}
void install_window_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("row_number") ] = row_number_prototype;
  handlers[ Rf_install("ntile") ] = ntile_prototype;
  handlers[ Rf_install("min_rank") ] = rank_impl_prototype<dplyr::internal::min_rank_increment>;
  handlers[ Rf_install("percent_rank") ] = rank_impl_prototype<dplyr::internal::percent_rank_increment>;
  handlers[ Rf_install("dense_rank") ] = rank_impl_prototype<dplyr::internal::dense_rank_increment>;
  handlers[ Rf_install("cume_dist") ] = rank_impl_prototype<dplyr::internal::cume_dist_increment>;
}
void install_offset_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("lead") ] = leadlag_prototype<Lead>;
  handlers[ Rf_install("lag") ] = leadlag_prototype<Lag>;
}

I guess these don't need to change

void install_debug_handlers(HybridHandlerMap& handlers) {
  handlers[ Rf_install("verify_hybrid") ] = verify_hybrid_prototype;
  handlers[ Rf_install("verify_not_hybrid") ] = verify_not_hybrid_prototype;
}

var and sd come from stats, not base

I don't see why we should be changing any of this code. I thought we would test that evaluating the function's symbol in the user's environment matches what we expect (base, stats, ...).?

Sure, I was dumping this here to show which are dealt with by hybrid.
We don't have a mechanism right now to capture "what we expect"

Maybe we can evaluate the symbol, and check if it is in either stats base or dplyr as these are the only 3 used

I would rather compare the pointers with objects picked up from the namespaces. Perhaps it makes sense to cache these in a static.

Do we need to change #3420 to account for the new behavior?

Yes that's better. We already have this static :

HybridHandlerMap& get_handlers() {
  static HybridHandlerMap handlers;

maybe HybridHandlerMap can be updated from

typedef dplyr_hash_map<SEXP, HybridHandler> HybridHandlerMap;

to something like

typedef dplyr_hash_map<SEXP, std::pair<HybridHandler,SEXP>> HybridHandlerMap;

so that we would do something like this pseudo code:

handlers[ Rf_install("mean") ] = make_pair( simple_prototype<dplyr::Mean>, base::mean )  ;

otherwise we can have another map for the environments, maybe that's less disruptive

Instead of std::pair can we add a new class HybridHandler that includes the current typedef and the SEXP as members? I'd rather avoid two maps.

@krlmlr I see this going just before if (TYPEOF(fun_symbol) != SYMSXP) { so that's orthogonal to #3420 ... mhmm maybe not. anyway, I'll do this with 鈽曪笍.

Alright, I'll do this then.

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

krlmlr picture krlmlr  路  22Comments

jtrecenti picture jtrecenti  路  19Comments

billdenney picture billdenney  路  19Comments

hadley picture hadley  路  18Comments

hadley picture hadley  路  15Comments