In a large amount of code across both my PKNCA package and data analysis projects, my code assumes that groups have at least one row. I understand that the behavior is by design, but the documentation does not suggest any methods to get back to the prior method of dropping empty groups. I typically use filter()
to remove groups not of interest and group_by()
to enable various applications of analysis on groups which filled. The only method I can see to get back to the prior behavior is to convert everything factor to a character and then converting back at the end which is pretty intensive to track which columns have been modified and need to be modified back.
For most of my use cases, this makes use of dplyr
with factor grouping very difficult. A common use case for me has treatments and subjects in clinical trials described by ordered and unordered factors, respectively (reprex at the bottom).
Is there any method to revert back to the original behaviour? Specifically, in #341, drop = FALSE
was suggested, but if it exists, it's not documented in the help page for group_by()
. The .preserve
argument of filter()
doesn't suggest that it does the same, but it looks close. The documentation of the drop
argument for grouped_df()
simply says "deprecated" suggesting it's not the right fix.
The compatibility vignette (https://github.com/tidyverse/dplyr/blob/master/vignettes/compatibility.Rmd) appears to be the place something like this should be documented, but I don't see it there either.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
d <-
data.frame(
A=factor(
rep(c("10 mg", "20 mg", "100 mg"), 4),
levels=c("10 mg", "20 mg", "100 mg"),
ordered=TRUE
),
B=rep(factor(1:6), 2),
C=1:12
)
d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C)
)
#> # A tibble: 18 x 3
#> # Groups: A [3]
#> A B Mean
#> <ord> <fct> <dbl>
#> 1 10 mg 1 4
#> 2 10 mg 2 NaN
#> 3 10 mg 3 NaN
#> 4 10 mg 4 7
#> 5 10 mg 5 NaN
#> 6 10 mg 6 NaN
#> 7 20 mg 1 NaN
#> 8 20 mg 2 5
#> 9 20 mg 3 NaN
#> 10 20 mg 4 NaN
#> 11 20 mg 5 8
#> 12 20 mg 6 NaN
#> 13 100 mg 1 NaN
#> 14 100 mg 2 NaN
#> 15 100 mg 3 6
#> 16 100 mg 4 NaN
#> 17 100 mg 5 NaN
#> 18 100 mg 6 9
Created on 2018-12-31 by the reprex package (v0.2.0).
From reading the code, I think there is no drop argument... Converting as character was an option from the discussion in #341.
Is filtering after the operation not enough or as simple as desired ?
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.7.99.9000'
d <-
data.frame(
A=factor(
rep(c("10 mg", "20 mg", "100 mg"), 4),
levels=c("10 mg", "20 mg", "100 mg"),
ordered=TRUE
),
B=rep(factor(1:6), 2),
C=1:12
)
d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C)
) %>%
filter(!is.nan(Mean))
#> # A tibble: 6 x 3
#> # Groups: A [3]
#> A B Mean
#> <ord> <fct> <dbl>
#> 1 10 mg 1 4
#> 2 10 mg 4 7
#> 3 20 mg 2 5
#> 4 20 mg 5 8
#> 5 100 mg 3 6
#> 6 100 mg 6 9
Created on 2018-12-31 by the reprex package (v0.2.1)
If we consider the previous behaviour as not correct because it automatically filtered out groups that where is the data, causing missing data, filtering out manually if needed now that groups are preserved seems like the right move.
I believe that custom functions that assumes non empty group could return NaN
(or other value) if the group is empty so that it is possible to filter afterwards.
What do you think ?
If this is the only method for getting close to prior behavior, my request is that be documented (in a way that doesn't require reading the code).
Converting to character is mostly reasonable for unordered factors, but it loses information for ordered factors. Sometimes setting the level order is nontrivial. (All of these are possible, but would need
I have a difference of opinion about the previous behavior being incorrect/correct; I think that they are simply filling two different needs. The prior method fills the need for a summary of each group where data exists when all combinations are not expected (this is my common use case described above), and the future behavior helps elucidate missing groups when all combinations are expected. Previously, when I needed the future use case, I integrated explicit missing rows, but I acknowledge that solution isn't right nor best for everyone. I'm not the package author, so I'm not advocating that my opinion must be followed-- I'm trying to understand the new coding method and how to work with it or around it.
Since my most common use case is that all combinations are not expected, I will need to rewrite all R functions used for summarization and then filter based on no data. Also, since NA and NaN separately can exist within my data, I would need to add a number of observations column to filter on instead. This is possible but onerous:
min_maybe_none <- function(x, ...)
UseMethod("min_maybe_none")
min_maybe_none.numeric <- function(x, ...) {
if (length(x) == 0) {
NA_real_
} else {
min(x, ...)
}
}
# Repeat for logical, factor, character so that you get the correct type of NA output for each class.
# Repeat all of the above for max, median, mean, sd, ...
I would need to do this because I often have >50 groups and with standard warning reporting, I would not see unexpected warnings; standard warning reporting would just show warnings about the empty groups.
I had a few ideas overnight of how the prior behavior could be achieved. I will reiterate the ones above in addition to the new ideas so that the list can be (semi-)comprehensive.
mutate_if(is.factor, as.character))
). Con: Information is lost-- especially for ordered factors or factors where not all levels are represented in the current data.group_by()
(like the originally suggested drop
). Pro: Simple for users. Con: Appears to have been considered and not preferred by dplyr authors. (I've not read the full threads around the rationale, but to be clear I'm not advocating for this as it has already been considered. I'm just trying to be complete.).skip_empty
. (Suggestion from @yutannihilation below.) Pro: Simple to understand; maintains the goal of one interpretation of factors as groups. Con: More complex to maintain as every verb would require the argument, if it would apply to the verb (in and out of dplyr
).Conversion to character strings and back could be simplified by implementation of something like tidyverse/forcats#144 and maybe a data.frame mirroring method like the following (typed directly into GitHub; there could be bugs):
fct_mirror.data.frame <- function(.f, .x) {
# Find factor columns in the original data that are also in the current data
matching_cols <- intersect(colnames(.f), colnames(.x)[sapply(X=.x, FUN=is.factor)])
for (current_col in matching_cols) {
if (!is.factor(.f[[current_col]])) {
.f[[current_col]] <- fct_mirror(.f[[current_col]], .x[[current_col]])
}
}
.f
}
For 2, the simplest would be if grouping of factors is detected by something simple like is.factor()
, then make a new class called something like hidden_factor
which inherits from factor
and its only functions are:
as.hidden_factor <- function(x)
UseMethod("as.hidden_factor")
as.hidden_factor.factor <- function(x) {
class(x) <- unique(c("hidden_factor", class(x)))
x
}
as.hidden_factor.default <- function(x) {
message("No need to convert x to a hidden_factor")
x
}
is.factor.hidden_factor function(x) <- FALSE
as.factor.hidden_factor <- function(x) {
class(x) <- setdiff(class(x), "hidden_factor")
x
}
That would break some other applications looking for factors, but it or something similar could result in the same behaviour as the prior version.
Is it possible to implement an option for summarise()
to skip empty groups? After reading @cderv's comment, I now feel this issue is not about group_by()
, rather about summarise()
in a sense:
I believe that custom functions that assumes non empty group could return NaN (or other value) if the group is empty so that it is possible to filter afterwards.
The empty groups don't become a problem when mutate()
ed or filter()
ed because there's no rows from empty groups in the data (accordingly, the behaviours are basically same as 0.7.8). But, summarise()
embodies the empty groups and mixes them with non-empty groups. So, after summarising, it requires a lot of heuristics to detect which rows were from empty groups.
I feel something like this
d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C),
.skip_empty = TRUE
)
is safer than this (or other ways using some sentinal values to represent empty groups):
d %>%
group_by(A, B) %>%
summarize(
Mean=mean(C)
) %>%
filter(!is.nan(Mean))
(I also want other verbs like mutate()
and filter()
to skip the empty groups, but this might be a different issue)
@yutannihilation, If applied to all verbs, something like .skip_empty
would work. (I'll add it to the list above.)
As it relates to the data, .skip_empty
would usually apply at the group_by()
level rather than the verb level. I say this because the times that .skip_empty
would typically apply would usually be defined at the definition of groups; in my experience, whether the full combination of factors applies (as is the new behavior) or not (as is the prior behavior) is based on the data being grouped. Examples include: United States congressional members by party, year, and sex (full combination-- though it would depend on the year range as the Whigs and Tories may or may not be counted, and the parties no longer exist), patients in a parallel-design clinical trial (only observed combinations), and GDP by country and year in the 20th century (intermediate: when a country exists, it should have a GDP, but countries start and stop existence).
If applied at the grouping level, .skip_empty
seems like it could have a simple application to all verbs since the management of what groups exist would occur centrally. While if applied at the verb level, .skip_empty
would then need to be managed for all verbs both within and outside dplyr
.
I met with @romainfrancois this morning and we're going to add group_by(x, y, z, .drop = TRUE)
to handle the case where you have factors but you want to use the actual values, not the Cartesian product of levels.
(@yutannihilation I don't think .skip_empty
is enough because it's possible that because of the Cartesian product you have more groups than you have memory)
This is on my list for tomorrow morning.
I guess group_trim()
also would have the .drop
argument.
I'm not sure how this impacts the group_by that are implicit, e.g. a group by that is due to a filter(.preserve = FALSE)
@hadley and @romainfrancois, Thank you!!!
@romainfrancois, for filter(.preserve = FALSE)
, etc., could the .drop
argument be stored with the groups so that it's applied the same way whenever groups change without another call to group_by()
? That seems the most consistent way to me for group handling to occur. If someone wants to filter and then after filtering use the Cartesian product, that seems like another call to group_by()
would be required.
Another intermediate case would be when group_by(add=TRUE)
is called, and I think that it should default to using the same .drop
flag as the prior group_by()
call used.
@romainfrancois, if you implement the .drop
argument as being stored with the object, one other consideration is what interface would exist to change just the .drop
flag?
Two options that occur to me are:
group_by()
with all arguments. (That seems like it would work regardless.)group_by(add=TRUE, .drop=FALSE)
or group_by(add=TRUE, .drop=TRUE)
, whereby add
is slightly overloaded to really mean "modify".Cool! I agree that .drop
flag is best. Please ignore my comment above, I was thinking on the assumption that drop
is unacceptable.
To preserve backward compatibility, can the default be .drop=TRUE
?
Nope ;-)
We still believe that what .drop = FALSE
does is best, we meet you half way, but you'll have to opt-out of the "correct" behaviour ..
@romainfrancois Perfectly reasonable, I appreciate the accommodation!
P.S. I was wondering if I was asking too much there, but I'm a strong believer that you never get anything that you don't ask for.
Hello @hadley and @romainfrancois
I am planning how/when to modify my package to ensure it is compatible with dplyr 0.8 (currently many automated tests are failing against the RC). Adding group_by(.drop=)
will address the main incompatibility, so I'm pleased to see this (thank you!).
May I ask if you intend to include this change in the 0.8.0 release (and in a new RC before?).
@romainfrancois - I also emailed you directly (not sure if you prefer such questions here or via email?) - sorry for the duplication.
Actually, we reverted the default behaviour so that .drop = TRUE
is the default now (at least for 0.8.0) , you should not have to change anything.
The fix is in this PR, which hopefully I should merge today or this week.
https://github.com/tidyverse/dplyr/pull/4091
actually @billdenney we made .drop = TRUE
the default
@romainfrancois, Thanks! (And, cool new hair! :))
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
Most helpful comment
Actually, we reverted the default behaviour so that
.drop = TRUE
is the default now (at least for 0.8.0) , you should not have to change anything.The fix is in this PR, which hopefully I should merge today or this week.
https://github.com/tidyverse/dplyr/pull/4091