This might be a won't fix, but I thought I'd float the balloon anyway.
Consider:
library(dplyr)
data_frame(
id = c(1,2,3,4,5,6),
value = c(1L,2L,3L,NA,NA,NA),
group = c("A","A","A","B","B","B")
) %>%
group_by(group) %>%
mutate(value = ifelse(is.na(value),0,value))
Which fails.
Versus:
data_frame(
id = c(1,2,3,4,5,6),
value = c(1L,2L,3L,4,NA,NA),
group = c("A","A","A","B","B","B")
) %>%
mutate(value = ifelse(is.na(value),0,value)) %>%
group_by(group)
Which does not fail.
As a matter of linguistic parsing one might not expect different outcomes based on ordering. Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?
Thanks, confirmed. Works for me if I replace 0 by 0L in the failing example. Would you like to contribute a testthat test?
@krlmlr The test would be added to test-group-by-r.r, correct?
Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?
Yes, if a grouped mutate uses different return types it's safer to throw an error; if this is really desired behavior the user can coerce. The only exception we'd like to allow is integer vs. double, and logical NAs, see also hadley/vctrs#7 and dplyr::if_else().
Test location looks good to me.
Also note that grouped mutate does not current coerce factor + character -> character (WARN) as expected. See PR #2249 for a demonstrative test.
Here's a simpler reprex, along with a suggestion that the code path is different for summarise():
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
value = c(1L, NA),
group = c("A", "B")
)
df %>%
group_by(group) %>%
mutate(value = if (is.na(value)) 0 else 1L)
#> Error in mutate_impl(.data, dots): incompatible types, expecting a integer vector
df %>%
group_by(group) %>%
summarise(value = if (is.na(value)) 0 else 1L)
#> # A tibble: 2 脳 2
#> group value
#> <chr> <dbl>
#> 1 A 1
#> 2 B 0
Another reprex:
df <- tibble(
g = c(1, 2, 2),
x = c(1L, 2L, 3L)
)
df %>%
group_by(g) %>%
mutate(median(x))
@zeehio any ideas what's going on here? Despite my previous assertion, I'd really like to fix this bug for the next release.
This issue goes through a different code path than the Collecter.h case I was working at.
The error is raised here https://github.com/hadley/dplyr/blob/61f77bc1d2dbf496cf5b72a750a24f134d506a45/inst/include/dplyr/Gatherer.h#L81
The wrong underlying assumption in Gatherer.h is that the type returned by (in this reprex case) median(x) will be always the same for all the groups. This bug has never been fully addressed in the past https://github.com/hadley/dplyr/issues/489#issuecomment-52693935 as the solution proposed there was to fix whatever was inside mutate to always return the same type, and I feel that is a just workaround.
I think the best solution would be to use the Collecter.h code path inside gatherer. I can try to commit a pull request during the following two weeks if that is fine for you (I am quite busy this week).
Yes, that would be fantastic! It would fix a whole class of bugs.
result of the mutate operation returns non-NA.result filled with NA with length equal to the total number of rows. (If it is a list, create a list of that length).result of the first group in the vector.result and store it in the vector (or list).result of applying the mutate operation to any group can be NA.result is not NA, then it can be a list or a vector.NA results must be of the exact same type.result of applying the mutate operation to a group can be of a different type.result are combinable and of consistent length, the final output may be a vector column, otherwise it needs to be a list column.List.combine_all() the list if possible. If it is possible, return a vector column, otherwise return a list column.List and use combine_all to combine them. This would require storing an extra copy of the results and may be not-desirable.result of applying the mutate operation to a group can be of a different type. The only restriction is that either all the result are of type list (and then we return a list-column) or all the result are not a list (and then we combine them into a vector column or raise an error).Advantage with respect to the general case:
result for the first group whether we follow the vector path or the list path, avoiding the extra copy.Advantage with respect to the current implementation:
Collecter implementation.@hadley if the tradeoff solution is good for you I will go with it.
@zeehio in the tradeoff solution, could you spell out in a bit more detail what happens if the types are different in different groups?
Yes, we will use the same rules as in combine():
Ok, I don't quite see how this will work under the hood, but it sounds reasonable, and I'd love to review a PR 馃榿
@hadley I give you two PR instead of one. The second one includes the commits of the first one, because it depends on it, but after the first PR is merged the second PR will consist only of a single commit.
It was easier than expected to make it work, although my C++ and Rcpp still need some practice. If you find things done in a weird way, assume my Rcpp skills are not good and ask.