The summarise function forces the use of base R functions, instead of the expected behaviour that it would use loaded ('active') functions from the own environment.
# (from another package:)
#' @inherit base::mean
#' @export
mean <- function(x, ..., na.rm = TRUE) {
base::mean(x, ..., na.rm = na.rm)
}
library(dplyr)
df <- tibble(group = LETTERS[1:10],
score = c(runif(3, 0, 1),
NA,
runif(6, 0, 1)))
mean(df$score) # now na.rm = TRUE by default
# [1] 0.5479927
df %>% summarise(m = mean(score))
# # A tibble: 1 x 1
# m
# <dbl>
# 1 NA
> df %>% summarise(m = mean(score, na.rm = TRUE))
# # A tibble: 1 x 1
# m
# <dbl>
# 1 0.5479927
Why not use functions from ones own environment? This mean function comes from another package, which I deliberately loaded, so I would expect it to work in all my code. But apparently, the summarise function is an exception. I think you should not force the use of base R functions like base::mean in dplyr.
You might think that package developers just shouldn't overwrite functions like mean, but even dplyr does this (like intersect, setdiff, setequal, union). That too gives other results than when only base R was loaded, just like you would expect. Hence, you loaded dplyr. Now I ask you to let summarise use the latest loaded functions, not just base R. Another example: if I would write a function to overwrite sum to solve #3189, it wouldn't be used by summarise at default, even when my package was loaded after dplyr.
I don't know if this helps, but it appears you can pass in arbitrary functions as long as they don't have an already used name (such as mean()). I have no idea what the list of non-overidable functions is though.
suppressPackageStartupMessages(library(dplyr))
mean <- function(x, ..., na.rm = TRUE) {
base::mean(x, ..., na.rm = na.rm)
}
mean2 <- mean
df <- tibble(score = c(1, NA, 3))
df %>% summarise(m = mean(score))
#> # A tibble: 1 x 1
#> m
#> <dbl>
#> 1 NA
df %>% summarise_at(, .vars="score", .funs = funs(mean))
#> # A tibble: 1 x 1
#> score
#> <dbl>
#> 1 NA
df %>% summarise(m = mean2(score))
#> # A tibble: 1 x 1
#> m
#> <dbl>
#> 1 2.00
df %>% summarise_at(, .vars="score", .funs = funs(mean2))
#> # A tibble: 1 x 1
#> score
#> <dbl>
#> 1 2.00
Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.
Suddenly I thought of summarise_at, hoping that it would be the solution, but it isn't;
df %>% summarise_at(vars(score), mean)
# # A tibble: 1 x 1
# score
# <dbl>
# 1 NA
df %>% summarise_at(vars(score), funs(mean, .args = list(na.rm = TRUE)))
# # A tibble: 1 x 1
# score
# <dbl>
# 1 0.5479927
This of course works, but isn't really a solution or explanation too:
df %>% summarise(m = MyPackage::mean(score))
# # A tibble: 1 x 1
# m
# <dbl>
# 1 0.5479927
Thanks for the suggestion, but it doesn't explain why dplyr forces the use of base R functions.
That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol mean evaluates to the mean function from the base package and only enable hybrid evaluation in that case.
That would really help a lot! Makes more sense too :wink:
we use C++ implementations to speed up computations
I got that now, after reading http://r-pkgs.had.co.nz/src.html (which is a great intro btw).
Maybe you could edit the wrapper functions like:
sum <- function(x) {
.Call('sum', PACKAGE = 'dplyr', x)
}
To:
sum <- function(x) {
if (identical(sum, base::sum)) {
.Call('sum', PACKAGE = 'dplyr', x)
} else {
sum
}
}
Something like this? Or create a helper function for all cases?
Hybrid evaluation in dplyr doesn't work by masking so that wouldn't work.
Agree to check if mean evaluates to base::mean before kicking in hybrid evaluation, as suggested by @lionel-.
Is this then going to be even more expensive than it is already ?
This check would be run only once, not once per group.
Sure. So there are these :
void install_simple_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("mean") ] = simple_prototype<dplyr::Mean>;
handlers[ Rf_install("var") ] = simple_prototype<dplyr::Var>;
handlers[ Rf_install("sd") ] = simple_prototype<dplyr::Sd>;
handlers[ Rf_install("sum") ] = simple_prototype<dplyr::Sum>;
}
void install_minmax_handlers(HybridHandlerMap& handlers) {
handlers[Rf_install("min")] = minmax_prototype<true>;
handlers[Rf_install("max")] = minmax_prototype<false>;
}
void install_in_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("%in%") ] = in_prototype;
}
Do we need the same for dplyr specific things, like these below, in case e.g. we want to "mask" row_number for example ?
void install_count_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("n") ] = count_prototype;
handlers[ Rf_install("n_distinct") ] = count_distinct_prototype;
}
void install_nth_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("first") ] = first_prototype;
handlers[ Rf_install("last") ] = last_prototype;
handlers[ Rf_install("nth") ] = nth_prototype;
}
void install_window_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("row_number") ] = row_number_prototype;
handlers[ Rf_install("ntile") ] = ntile_prototype;
handlers[ Rf_install("min_rank") ] = rank_impl_prototype<dplyr::internal::min_rank_increment>;
handlers[ Rf_install("percent_rank") ] = rank_impl_prototype<dplyr::internal::percent_rank_increment>;
handlers[ Rf_install("dense_rank") ] = rank_impl_prototype<dplyr::internal::dense_rank_increment>;
handlers[ Rf_install("cume_dist") ] = rank_impl_prototype<dplyr::internal::cume_dist_increment>;
}
void install_offset_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("lead") ] = leadlag_prototype<Lead>;
handlers[ Rf_install("lag") ] = leadlag_prototype<Lag>;
}
I guess these don't need to change
void install_debug_handlers(HybridHandlerMap& handlers) {
handlers[ Rf_install("verify_hybrid") ] = verify_hybrid_prototype;
handlers[ Rf_install("verify_not_hybrid") ] = verify_not_hybrid_prototype;
}
var and sd come from stats, not base
I don't see why we should be changing any of this code. I thought we would test that evaluating the function's symbol in the user's environment matches what we expect (base, stats, ...).?
Sure, I was dumping this here to show which are dealt with by hybrid.
We don't have a mechanism right now to capture "what we expect"
Maybe we can evaluate the symbol, and check if it is in either stats base or dplyr as these are the only 3 used
I would rather compare the pointers with objects picked up from the namespaces. Perhaps it makes sense to cache these in a static.
Do we need to change #3420 to account for the new behavior?
Yes that's better. We already have this static :
HybridHandlerMap& get_handlers() {
static HybridHandlerMap handlers;
maybe HybridHandlerMap can be updated from
typedef dplyr_hash_map<SEXP, HybridHandler> HybridHandlerMap;
to something like
typedef dplyr_hash_map<SEXP, std::pair<HybridHandler,SEXP>> HybridHandlerMap;
so that we would do something like this pseudo code:
handlers[ Rf_install("mean") ] = make_pair( simple_prototype<dplyr::Mean>, base::mean ) ;
otherwise we can have another map for the environments, maybe that's less disruptive
Instead of std::pair can we add a new class HybridHandler that includes the current typedef and the SEXP as members? I'd rather avoid two maps.
@krlmlr I see this going just before if (TYPEOF(fun_symbol) != SYMSXP) { so that's orthogonal to #3420 ... mhmm maybe not. anyway, I'll do this with 鈽曪笍.
Alright, I'll do this then.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
That's because of hybrid evaluation, we use C++ implementations to speed up computations. Maybe we could check that the symbol
meanevaluates to the mean function from the base package and only enable hybrid evaluation in that case.