Dplyr: Using n() in nested mutate()/summarize() calls gives unexpected results

Created on 19 Aug 2016  路  11Comments  路  Source: tidyverse/dplyr

When transitioning from by_row() to map() approach I've found that several dplyr/purrr/tidyr functions do not evaluate within the map() environment. For instance below I was expecting the value returned by n() in the map() example to match that of the by_row() version. Instead it returns the number of rows of the nested input Temp. This might be intended but I can't think of an obvious way to use dplyr::n() on nested tibbles via map().

library(dplyr); library(purrr); library(tidyr)
data(iris)

Temp <- iris %>%
  group_by(Species) %>%
  nest()

ByRow <- by_row(Temp, function(x){
  x$data[[1]] %>%
    filter(Sepal.Length >= 6) %>%
    summarise(petal_length_avg = mean(Petal.Length),
              obs = n())
}, .to = 'test')

ByRow$test

Map <- mutate(Temp, test = map(data, . %>% 
                                 filter(Sepal.Length >= 6) %>%
                                 summarise(petal_length_avg = mean(Petal.Length),
                                              obs = n())))
Map$test
bug

All 11 comments

Thanks. What is the expected result? Does it work better with #2190?

devtools::install_github("hadley/dplyr#2190")

@hadley: I don't understand the issue here.

I think this is a better example:

library(dplyr)
library(purrr)

df <- tibble(x = list(
  tibble(y = 1:2),
  tibble(y = 1:3),
  tibble(y = 1:4)
))

nrows <- function(df) {
  df %>% summarise(n = n()) %>% .[["n"]]
}

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
  )
#> # A tibble: 3 脳 3
#>                  x    n1    n2
#>             <list> <int> <int>
#> 1 <tibble [2 脳 1]>     2     3
#> 2 <tibble [3 脳 1]>     3     3
#> 3 <tibble [4 脳 1]>     4     3

n1 and n2 should be equivalent, but aren't, presumably because dplyr and purrr (or magrittr?) are fighting over what . means,

Same behavior with #2190.

This looks like a too eager substitution by hybrid evaluation. It walks the expression, encounters n() and replaces it with the number of rows:

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = 3) %>% .[["n"]])
  )

I don't know how to fix this, short of disabling hybrid evaluation if the expression cannot be evaluated in full by the hybrid evaluator. But this will break e.g. df %>% group_by(id) %>% summarize(n() - 1), unless we implement a regular version of n().

@hadley: Please advise.

Ah of course. In that case, n3 and n4 below are more worrying:

df %>%
  mutate(
    n1 = x %>% map_int(nrows),
    n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
    n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
    n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
  )
#> # A tibble: 3 脳 5
#>                  x    n1    n2    n3    n4
#>             <list> <int> <int> <int> <int>
#> 1 <tibble [2 脳 1]>     2     3     3     3
#> 2 <tibble [3 脳 1]>     3     3     3     3
#> 3 <tibble [4 脳 1]>     4     3     3     3

I think we can probably leave off resolving this until the next version? I suspect it will require more re-thinking about how the hybrid evaluator works.

Adding to this issue, is there any reason why output1 is broken, but not output2 or output3 ?

library(tidyverse)
temp <- data_frame(id = 1:5)

temp$data <- 1:5 %>% map(~{
   set.seed(.x)
   data_frame(group = sample(1:2, size = .x, replace = TRUE))
 })
temp

### output1 returns n= 5 everywhere
output1 <- temp %>% mutate(dd = map(data, function(X) {
   X %>% 
     group_by(group) %>%
     summarise(n = n())
 })) %>% unnest(dd) 
output1

### output2 returns the correct n
output2 <-  temp %>% mutate(dd = map(data, function(X) {
    X %>% 
     count(group)
  })) %>% unnest(dd)  


### output3  returns the correct n
ffs <- function(X){
    X %>% 
      group_by(group) %>%
      summarise(n = n()) 
  }
output3 <- temp %>% mutate(dd = map(data, ffs)) %>% unnest(dd)

I'll add this version to the mix to create the Magritte lambda outside of the mutate:

nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]

I've added the hybrid label.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

df <- tibble(x = list(
  tibble(y = 1:2),
  tibble(y = 1:3),
  tibble(y = 1:4)
))

nrows <- function(df) {
  df %>% summarise(n = n()) %>% .[["n"]]
}

nrows_magrittr_lambda <- . %>% summarise(n = n()) %>% .[["n"]]

trace( dplyr:::mutate.tbl_df, tracer = quote(print(dots)), at = 3 )
#> Tracing function "mutate.tbl_df" in package "dplyr
#> (not-exported)"
#> [1] "mutate.tbl_df"
mutate( df,
        n1 = x %>% map_int(nrows),
        n5 = map_int(x, nrows_magrittr_lambda),

        n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
        n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
        n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
)
#> Tracing mutate.tbl_df(df, n1 = x %>% map_int(nrows), n5 = map_int(x,  .... step 3 
#> $n1
#> <quosure>
#>   expr: ^x %>% map_int(nrows)
#>   env:  global
#> 
#> $n5
#> <quosure>
#>   expr: ^map_int(x, nrows_magrittr_lambda)
#>   env:  global
#> 
#> $n2
#> <quosure>
#>   expr: ^x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
#>   env:  global
#> 
#> $n3
#> <quosure>
#>   expr: ^map_int(x, ~summarise(., n = n())[["n"]])
#>   env:  global
#> 
#> $n4
#> <quosure>
#>   expr: ^map_int(x, function(df) summarise(df, n = n())[["n"]])
#>   env:  global
#> # A tibble: 3 x 6
#>   x                   n1    n5    n2    n3    n4
#>   <list>           <int> <int> <int> <int> <int>
#> 1 <tibble [2 脳 1]>     2     2     3     3     3
#> 2 <tibble [3 脳 1]>     3     3     3     3     3
#> 3 <tibble [4 脳 1]>     4     4     3     3     3

Created on 2018-03-05 by the reprex package (v0.2.0).

I finally got it: The n() is the problem here, when used with nested data manipulation verbs.

Indeed. hybrid simplification is too eager. With this debugging:

Call GroupedHybridCall::simplify(const SlicingIndex& indices) const {
  set_indices(indices);
  Call call = clone(original_call);
  Rprintf( "call: ") ; Rf_PrintValue(call) ;
  while (simplified(call)) {}
  clear_indices();
  Rprintf( "simplified call: " ) ; Rf_PrintValue(call) ;
  Rprintf("--------\n");

  return call;
}

I get:

> library(dplyr)
> mutate( df,
+         n1 = x %>% map_int(nrows),
+         n5 = map_int(x, nrows_magrittr_lambda),
+ 
+         n2 = x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]]),
+         n3 = map_int(x, ~ summarise(., n = n())[["n"]]),
+         n4 = map_int(x, function(df) summarise(df, n = n())[["n"]])
+ )
call: x %>% map_int(nrows)
simplified call: x %>% map_int(nrows)
--------
call: map_int(x, nrows_magrittr_lambda)
simplified call: map_int(x, nrows_magrittr_lambda)
--------
call: x %>% map_int(. %>% summarise(n = n()) %>% .[["n"]])
simplified call: x %>% map_int(. %>% summarise(n = 3L) %>% .[["n"]])
--------
call: map_int(x, ~summarise(., n = n())[["n"]])
simplified call: map_int(x, ~summarise(., n = 3L)[["n"]])
--------
call: map_int(x, function(df) summarise(df, n = n())[["n"]])
simplified call: map_int(x, function(df) summarise(df, n = 3L)[["n"]])
--------
# A tibble: 3 x 6
  x                   n1    n5    n2    n3    n4
  <list>           <int> <int> <int> <int> <int>
1 <tibble [2 脳 1]>     2     2     3     3     3
2 <tibble [3 脳 1]>     3     3     3     3     3
3 <tibble [4 脳 1]>     4     4     3     3     3

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bachlaw picture bachlaw  路  3Comments

DasHammett picture DasHammett  路  3Comments

tjmahr picture tjmahr  路  4Comments

Prometheus77 picture Prometheus77  路  4Comments

lazarillo picture lazarillo  路  5Comments