Dplyr: Integer not automatically coerced to double

Created on 7 Jun 2016 · 14Comments · Source: tidyverse/dplyr

This might be a won't fix, but I thought I'd float the balloon anyway.

Consider:

library(dplyr)
data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,NA,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  group_by(group) %>%
  mutate(value = ifelse(is.na(value),0,value))

Which fails.

Versus:

data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,4,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  mutate(value = ifelse(is.na(value),0,value)) %>%
  group_by(group)

Which does not fail.

As a matter of linguistic parsing one might not expect different outcomes based on ordering. Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

bug

Source

russellpierce

All 14 comments

Thanks, confirmed. Works for me if I replace 0 by 0L in the failing example. Would you like to contribute a testthat test?

krlmlr on 8 Nov 2016

@krlmlr The test would be added to test-group-by-r.r, correct?

russellpierce on 11 Nov 2016

Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

Yes, if a grouped mutate uses different return types it's safer to throw an error; if this is really desired behavior the user can coerce. The only exception we'd like to allow is integer vs. double, and logical NAs, see also hadley/vctrs#7 and dplyr::if_else().

Test location looks good to me.

krlmlr on 11 Nov 2016

Also note that grouped mutate does not current coerce factor + character -> character (WARN) as expected. See PR #2249 for a demonstrative test.

russellpierce on 12 Nov 2016

Here's a simpler reprex, along with a suggestion that the code path is different for summarise():

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  value = c(1L, NA),
  group = c("A", "B")
) 

df %>% 
  group_by(group) %>%
  mutate(value = if (is.na(value)) 0 else 1L)
#> Error in mutate_impl(.data, dots): incompatible types, expecting a integer vector

df %>% 
  group_by(group) %>%
  summarise(value = if (is.na(value)) 0 else 1L)
#> # A tibble: 2 × 2
#>   group value
#>   <chr> <dbl>
#> 1     A     1
#> 2     B     0

hadley on 22 Feb 2017

Another reprex:

df <- tibble(
  g = c(1, 2, 2), 
  x = c(1L, 2L, 3L)
)

df %>%
  group_by(g) %>%
  mutate(median(x))

hadley on 27 Feb 2017

@zeehio any ideas what's going on here? Despite my previous assertion, I'd really like to fix this bug for the next release.

hadley on 27 Feb 2017

This issue goes through a different code path than the Collecter.h case I was working at.

The error is raised here https://github.com/hadley/dplyr/blob/61f77bc1d2dbf496cf5b72a750a24f134d506a45/inst/include/dplyr/Gatherer.h#L81

The wrong underlying assumption in Gatherer.h is that the type returned by (in this reprex case) median(x) will be always the same for all the groups. This bug has never been fully addressed in the past https://github.com/hadley/dplyr/issues/489#issuecomment-52693935 as the solution proposed there was to fix whatever was inside mutate to always return the same type, and I feel that is a just workaround.

I think the best solution would be to use the Collecter.h code path inside gatherer. I can try to commit a pull request during the following two weeks if that is fine for you (I am quite busy this week).

zeehio on 27 Feb 2017

👍1

Yes, that would be fantastic! It would fix a whole class of bugs.

hadley on 27 Feb 2017

👍1

Current solution (dplyr-0.5.0 and master):

Implementation

Take the first group in the data frame for which the result of the mutate operation returns non-NA.
Create a vector of the type of the result filled with NA with length equal to the total number of rows. (If it is a list, create a list of that length).
Store the computed result of the first group in the vector.
For each of the pending data frame groups, compute their result and store it in the vector (or list).
The vector (or list) is the final result of the mutate operation.

Assumptions of the current algorithm:

The result of applying the mutate operation to any group can be NA.
If the result is not NA, then it can be a list or a vector.
All the non-NA results must be of the exact same type.

General case assumptions:

Each result of applying the mutate operation to a group can be of a different type.
If all the result are combinable and of consistent length, the final output may be a vector column, otherwise it needs to be a list column.

Possible general case implementation

We compute all the groups storing the elements in a List.
We combine_all() the list if possible. If it is possible, return a vector column, otherwise return a list column.

Issue with this general case

We need to compute all the results in advance to know whether or not they are combinable. Therefore we would need to store all the results in a List and use combine_all to combine them. This would require storing an extra copy of the results and may be not-desirable.

Tradeoff solution proposal

Assumptions

Each result of applying the mutate operation to a group can be of a different type. The only restriction is that either all the result are of type list (and then we return a list-column) or all the result are not a list (and then we combine them into a vector column or raise an error).

Advantage with respect to the general case:

We decide after computing the result for the first group whether we follow the vector path or the list path, avoiding the extra copy.

Advantage with respect to the current implementation:

Coercions and promotions will be done through the Collecter implementation.

@hadley if the tradeoff solution is good for you I will go with it.

zeehio on 28 Feb 2017

@zeehio in the tradeoff solution, could you spell out in a bit more detail what happens if the types are different in different groups?

hadley on 28 Feb 2017

Yes, we will use the same rules as in combine():

Classes unknown to dplyr are stripped (known classes: POSIXct, factor, Date, AsIs, integer64, table) (as in #2425)
NA values are accepted
integers are promoted to numerics if needed
factors and characters are coerced to characters
factors with different levels are coerced to characters

zeehio on 28 Feb 2017

Ok, I don't quite see how this will work under the hood, but it sounds reasonable, and I'd love to review a PR 😁

hadley on 28 Feb 2017

@hadley I give you two PR instead of one. The second one includes the commits of the first one, because it depends on it, but after the first PR is merged the second PR will consist only of a single commit.

It was easier than expected to make it work, although my C++ and Rcpp still need some practice. If you find things done in a weird way, assume my Rcpp skills are not good and ask.

zeehio on 2 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings