Dplyr: Integer not automatically coerced to double

Created on 7 Jun 2016  路  14Comments  路  Source: tidyverse/dplyr

This might be a won't fix, but I thought I'd float the balloon anyway.

Consider:

library(dplyr)
data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,NA,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  group_by(group) %>%
  mutate(value = ifelse(is.na(value),0,value))

Which fails.

Versus:

data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,4,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  mutate(value = ifelse(is.na(value),0,value)) %>%
  group_by(group)

Which does not fail.

As a matter of linguistic parsing one might not expect different outcomes based on ordering. Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

bug

All 14 comments

Thanks, confirmed. Works for me if I replace 0 by 0L in the failing example. Would you like to contribute a testthat test?

@krlmlr The test would be added to test-group-by-r.r, correct?

Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

Yes, if a grouped mutate uses different return types it's safer to throw an error; if this is really desired behavior the user can coerce. The only exception we'd like to allow is integer vs. double, and logical NAs, see also hadley/vctrs#7 and dplyr::if_else().

Test location looks good to me.

Also note that grouped mutate does not current coerce factor + character -> character (WARN) as expected. See PR #2249 for a demonstrative test.

Here's a simpler reprex, along with a suggestion that the code path is different for summarise():

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  value = c(1L, NA),
  group = c("A", "B")
) 

df %>% 
  group_by(group) %>%
  mutate(value = if (is.na(value)) 0 else 1L)
#> Error in mutate_impl(.data, dots): incompatible types, expecting a integer vector

df %>% 
  group_by(group) %>%
  summarise(value = if (is.na(value)) 0 else 1L)
#> # A tibble: 2 脳 2
#>   group value
#>   <chr> <dbl>
#> 1     A     1
#> 2     B     0

Another reprex:

df <- tibble(
  g = c(1, 2, 2), 
  x = c(1L, 2L, 3L)
)

df %>%
  group_by(g) %>%
  mutate(median(x))

@zeehio any ideas what's going on here? Despite my previous assertion, I'd really like to fix this bug for the next release.

This issue goes through a different code path than the Collecter.h case I was working at.

The error is raised here https://github.com/hadley/dplyr/blob/61f77bc1d2dbf496cf5b72a750a24f134d506a45/inst/include/dplyr/Gatherer.h#L81

The wrong underlying assumption in Gatherer.h is that the type returned by (in this reprex case) median(x) will be always the same for all the groups. This bug has never been fully addressed in the past https://github.com/hadley/dplyr/issues/489#issuecomment-52693935 as the solution proposed there was to fix whatever was inside mutate to always return the same type, and I feel that is a just workaround.

I think the best solution would be to use the Collecter.h code path inside gatherer. I can try to commit a pull request during the following two weeks if that is fine for you (I am quite busy this week).

Yes, that would be fantastic! It would fix a whole class of bugs.

Current solution (dplyr-0.5.0 and master):

Implementation

  1. Take the first group in the data frame for which the result of the mutate operation returns non-NA.
  2. Create a vector of the type of the result filled with NA with length equal to the total number of rows. (If it is a list, create a list of that length).
  3. Store the computed result of the first group in the vector.
  4. For each of the pending data frame groups, compute their result and store it in the vector (or list).
  5. The vector (or list) is the final result of the mutate operation.

Assumptions of the current algorithm:

  • The result of applying the mutate operation to any group can be NA.
  • If the result is not NA, then it can be a list or a vector.
  • All the non-NA results must be of the exact same type.

General case assumptions:

  • Each result of applying the mutate operation to a group can be of a different type.
  • If all the result are combinable and of consistent length, the final output may be a vector column, otherwise it needs to be a list column.

Possible general case implementation

  1. We compute all the groups storing the elements in a List.
  2. We combine_all() the list if possible. If it is possible, return a vector column, otherwise return a list column.

Issue with this general case

  • We need to compute all the results in advance to know whether or not they are combinable. Therefore we would need to store all the results in a List and use combine_all to combine them. This would require storing an extra copy of the results and may be not-desirable.

Tradeoff solution proposal

Assumptions

  • Each result of applying the mutate operation to a group can be of a different type. The only restriction is that either all the result are of type list (and then we return a list-column) or all the result are not a list (and then we combine them into a vector column or raise an error).

Advantage with respect to the general case:

  • We decide after computing the result for the first group whether we follow the vector path or the list path, avoiding the extra copy.

Advantage with respect to the current implementation:

  • Coercions and promotions will be done through the Collecter implementation.

@hadley if the tradeoff solution is good for you I will go with it.

@zeehio in the tradeoff solution, could you spell out in a bit more detail what happens if the types are different in different groups?

Yes, we will use the same rules as in combine():

  • Classes unknown to dplyr are stripped (known classes: POSIXct, factor, Date, AsIs, integer64, table) (as in #2425)
  • NA values are accepted
  • integers are promoted to numerics if needed
  • factors and characters are coerced to characters
  • factors with different levels are coerced to characters

Ok, I don't quite see how this will work under the hood, but it sounds reasonable, and I'd love to review a PR 馃榿

@hadley I give you two PR instead of one. The second one includes the commits of the first one, because it depends on it, but after the first PR is merged the second PR will consist only of a single commit.

It was easier than expected to make it work, although my C++ and Rcpp still need some practice. If you find things done in a weird way, assume my Rcpp skills are not good and ask.

Was this page helpful?
0 / 5 - 0 ratings