When transitioning from by_row() to map() approach I've found that several dplyr/purrr/tidyr functions do not evaluate within the map() environment. For instance below I was expecting the value returned by n() in the map() example to match that of the by_row() version. Instead it returns the number of rows of the nested input Temp. This might be intended but I can't think of an obvious way to use dplyr::n() on nested tibbles via map().
library(dplyr); library(purrr); library(tidyr)
data(iris)
Temp <- iris %>%
group_by(Species) %>%
nest()
ByRow <- by_row(Temp, function(x){
x$data[[1]] %>%
filter(Sepal.Length >= 6) %>%
summarise(petal_length_avg = mean(Petal.Length),
obs = n())
}, .to = 'test')
ByRow$test
Map <- mutate(Temp, test = map(data, . %>%
filter(Sepal.Length >= 6) %>%
summarise(petal_length_avg = mean(Petal.Length),
obs = n())))
Map$test
Thanks. What is the expected result? Does it work better with #2190?
devtools::install_github("hadley/dplyr#2190")
@hadley: I don't understand the issue here.
I think this is a better example:
library(dplyr)
library(purrr)
df <- tibble(x = list(
tibble(y = 1:2),
tibble(y = 1:3),
tibble(y = 1:4)
))
nrows <- function(df) {
df %>% summarise(n = n()) %>% .[["n"]]
}
df %>%
mutate(
n1 = x %>% map_int(nrows),
n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
)
#> # A tibble: 3 脳 3
#> x n1 n2
#> <list> <int> <int>
#> 1 <tibble [2 脳 1]> 2 3
#> 2 <tibble [3 脳 1]> 3 3
#> 3 <tibble [4 脳 1]> 4 3
n1 and n2 should be equivalent, but aren't, presumably because dplyr and purrr (or magrittr?) are fighting over what . means,
Same behavior with #2190.
This looks like a too eager substitution by hybrid evaluation. It walks the expression, encounters n() and replaces it with the number of rows:
df %>%
mutate(
n1 = x %>% map_int(nrows),
n2 = x %>% map_int(. %>% summarise(n = 3) %>% .[["n"]])
)
I don't know how to fix this, short of disabling hybrid evaluation if the expression cannot be evaluated in full by the hybrid evaluator. But this will break e.g. df %>% group_by(id) %>% summarize(n() - 1), unless we implement a regular version of n().
@hadley: Please advise.
Ah of course. In that case, n3 and n4 below are more worrying:
df %>%
mutate(
n1 = x %>% map_int(nrows),
n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
)
#> # A tibble: 3 脳 5
#> x n1 n2 n3 n4
#> <list> <int> <int> <int> <int>
#> 1 <tibble [2 脳 1]> 2 3 3 3
#> 2 <tibble [3 脳 1]> 3 3 3 3
#> 3 <tibble [4 脳 1]> 4 3 3 3
I think we can probably leave off resolving this until the next version? I suspect it will require more re-thinking about how the hybrid evaluator works.
Adding to this issue, is there any reason why output1 is broken, but not output2 or output3 ?
library(tidyverse)
temp <- data_frame(id = 1:5)
temp$data <- 1:5 %>% map(~{
set.seed(.x)
data_frame(group = sample(1:2, size = .x, replace = TRUE))
})
temp
### output1 returns n= 5 everywhere
output1 <- temp %>% mutate(dd = map(data, function(X) {
X %>%
group_by(group) %>%
summarise(n = n())
})) %>% unnest(dd)
output1
### output2 returns the correct n
output2 <- temp %>% mutate(dd = map(data, function(X) {
X %>%
count(group)
})) %>% unnest(dd)
### output3 returns the correct n
ffs <- function(X){
X %>%
group_by(group) %>%
summarise(n = n())
}
output3 <- temp %>% mutate(dd = map(data, ffs)) %>% unnest(dd)
I'll add this version to the mix to create the Magritte lambda outside of the mutate:
nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]
I've added the hybrid label.
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
df <- tibble(x = list(
tibble(y = 1:2),
tibble(y = 1:3),
tibble(y = 1:4)
))
nrows <- function(df) {
df %>% summarise(n = n()) %>% .[["n"]]
}
nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]
trace( dplyr:::mutate.tbl_df, tracer = quote(print(dots)), at = 3 )
#> Tracing function "mutate.tbl_df" in package "dplyr
#> (not-exported)"
#> [1] "mutate.tbl_df"
mutate( df,
n1 = x %>% map_int(nrows),
n5 = map_int(x, nrows_magrittr_lambda),
n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
)
#> Tracing mutate.tbl_df(df, n1 = x %>% map_int(nrows), n5 = map_int(x, .... step 3
#> $n1
#> <quosure>
#> expr: ^x %>% map_int(nrows)
#> env: global
#>
#> $n5
#> <quosure>
#> expr: ^map_int(x, nrows_magrittr_lambda)
#> env: global
#>
#> $n2
#> <quosure>
#> expr: ^x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
#> env: global
#>
#> $n3
#> <quosure>
#> expr: ^map_int(x, ~summarise(., n = n())[["n"]])
#> env: global
#>
#> $n4
#> <quosure>
#> expr: ^map_int(x, function(df) summarise(df, n = n())[["n"]])
#> env: global
#> # A tibble: 3 x 6
#> x n1 n5 n2 n3 n4
#> <list> <int> <int> <int> <int> <int>
#> 1 <tibble [2 脳 1]> 2 2 3 3 3
#> 2 <tibble [3 脳 1]> 3 3 3 3 3
#> 3 <tibble [4 脳 1]> 4 4 3 3 3
Created on 2018-03-05 by the reprex package (v0.2.0).
I finally got it: The n() is the problem here, when used with nested data manipulation verbs.
Indeed. hybrid simplification is too eager. With this debugging:
Call GroupedHybridCall::simplify(const SlicingIndex& indices) const {
set_indices(indices);
Call call = clone(original_call);
Rprintf( "call: ") ; Rf_PrintValue(call) ;
while (simplified(call)) {}
clear_indices();
Rprintf( "simplified call: " ) ; Rf_PrintValue(call) ;
Rprintf("--------\n");
return call;
}
I get:
> library(dplyr)
> mutate( df,
+ n1 = x %>% map_int(nrows),
+ n5 = map_int(x, nrows_magrittr_lambda),
+
+ n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
+ n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
+ n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
+ )
call: x %>% map_int(nrows)
simplified call: x %>% map_int(nrows)
--------
call: map_int(x, nrows_magrittr_lambda)
simplified call: map_int(x, nrows_magrittr_lambda)
--------
call: x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
simplified call: x %>% map_int(. %>% summarise(n = 3L) %>% .[["n"]])
--------
call: map_int(x, ~summarise(., n = n())[["n"]])
simplified call: map_int(x, ~summarise(., n = 3L)[["n"]])
--------
call: map_int(x, function(df) summarise(df, n = n())[["n"]])
simplified call: map_int(x, function(df) summarise(df, n = 3L)[["n"]])
--------
# A tibble: 3 x 6
x n1 n5 n2 n3 n4
<list> <int> <int> <int> <int> <int>
1 <tibble [2 脳 1]> 2 2 3 3 3
2 <tibble [3 脳 1]> 3 3 3 3 3
3 <tibble [4 脳 1]> 4 4 3 3 3
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/