Dplyr: Documentation Request: How do I drop empty groups?

Created on 31 Dec 2018 · 19Comments · Source: tidyverse/dplyr

In a large amount of code across both my PKNCA package and data analysis projects, my code assumes that groups have at least one row. I understand that the behavior is by design, but the documentation does not suggest any methods to get back to the prior method of dropping empty groups. I typically use filter() to remove groups not of interest and group_by() to enable various applications of analysis on groups which filled. The only method I can see to get back to the prior behavior is to convert everything factor to a character and then converting back at the end which is pretty intensive to track which columns have been modified and need to be modified back.

For most of my use cases, this makes use of dplyr with factor grouping very difficult. A common use case for me has treatments and subjects in clinical trials described by ordered and unordered factors, respectively (reprex at the bottom).

Is there any method to revert back to the original behaviour? Specifically, in #341, drop = FALSE was suggested, but if it exists, it's not documented in the help page for group_by(). The .preserve argument of filter() doesn't suggest that it does the same, but it looks close. The documentation of the drop argument for grouped_df() simply says "deprecated" suggesting it's not the right fix.

The compatibility vignette (https://github.com/tidyverse/dplyr/blob/master/vignettes/compatibility.Rmd) appears to be the place something like this should be documented, but I don't see it there either.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d <-
  data.frame(
    A=factor(
      rep(c("10 mg", "20 mg", "100 mg"), 4),
      levels=c("10 mg", "20 mg", "100 mg"),
      ordered=TRUE
    ),
    B=rep(factor(1:6), 2),
    C=1:12
  )
d %>%
  group_by(A, B) %>%
  summarize(
    Mean=mean(C)
  )
#> # A tibble: 18 x 3
#> # Groups:   A [3]
#>    A      B      Mean
#>    <ord>  <fct> <dbl>
#>  1 10 mg  1         4
#>  2 10 mg  2       NaN
#>  3 10 mg  3       NaN
#>  4 10 mg  4         7
#>  5 10 mg  5       NaN
#>  6 10 mg  6       NaN
#>  7 20 mg  1       NaN
#>  8 20 mg  2         5
#>  9 20 mg  3       NaN
#> 10 20 mg  4       NaN
#> 11 20 mg  5         8
#> 12 20 mg  6       NaN
#> 13 100 mg 1       NaN
#> 14 100 mg 2       NaN
#> 15 100 mg 3         6
#> 16 100 mg 4       NaN
#> 17 100 mg 5       NaN
#> 18 100 mg 6         9

Created on 2018-12-31 by the reprex package (v0.2.0).

Source

billdenney

Most helpful comment

Actually, we reverted the default behaviour so that .drop = TRUE is the default now (at least for 0.8.0) , you should not have to change anything.

The fix is in this PR, which hopefully I should merge today or this week.
https://github.com/tidyverse/dplyr/pull/4091

romainfrancois on 23 Jan 2019

👍3

All 19 comments

From reading the code, I think there is no drop argument... Converting as character was an option from the discussion in #341.

Is filtering after the operation not enough or as simple as desired ?

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
packageVersion("dplyr")
#> [1] '0.7.99.9000'
d <-
  data.frame(
    A=factor(
      rep(c("10 mg", "20 mg", "100 mg"), 4),
      levels=c("10 mg", "20 mg", "100 mg"),
      ordered=TRUE
    ),
    B=rep(factor(1:6), 2),
    C=1:12
  )
d %>%
  group_by(A, B) %>%
  summarize(
    Mean=mean(C)
  ) %>%
  filter(!is.nan(Mean))
#> # A tibble: 6 x 3
#> # Groups:   A [3]
#>   A      B      Mean
#>   <ord>  <fct> <dbl>
#> 1 10 mg  1         4
#> 2 10 mg  4         7
#> 3 20 mg  2         5
#> 4 20 mg  5         8
#> 5 100 mg 3         6
#> 6 100 mg 6         9

^{Created on 2018-12-31 by the reprex package (v0.2.1)}

If we consider the previous behaviour as not correct because it automatically filtered out groups that where is the data, causing missing data, filtering out manually if needed now that groups are preserved seems like the right move.

I believe that custom functions that assumes non empty group could return NaN (or other value) if the group is empty so that it is possible to filter afterwards.

What do you think ?

cderv on 31 Dec 2018

If this is the only method for getting close to prior behavior, my request is that be documented (in a way that doesn't require reading the code).

Converting to character is mostly reasonable for unordered factors, but it loses information for ordered factors. Sometimes setting the level order is nontrivial. (All of these are possible, but would need

I have a difference of opinion about the previous behavior being incorrect/correct; I think that they are simply filling two different needs. The prior method fills the need for a summary of each group where data exists when all combinations are not expected (this is my common use case described above), and the future behavior helps elucidate missing groups when all combinations are expected. Previously, when I needed the future use case, I integrated explicit missing rows, but I acknowledge that solution isn't right nor best for everyone. I'm not the package author, so I'm not advocating that my opinion must be followed-- I'm trying to understand the new coding method and how to work with it or around it.

Since my most common use case is that all combinations are not expected, I will need to rewrite all R functions used for summarization and then filter based on no data. Also, since NA and NaN separately can exist within my data, I would need to add a number of observations column to filter on instead. This is possible but onerous:

min_maybe_none <- function(x, ...)
  UseMethod("min_maybe_none")
min_maybe_none.numeric <- function(x, ...) {
  if (length(x) == 0) {
    NA_real_
  } else {
    min(x, ...)
  }
}
# Repeat for logical, factor, character so that you get the correct type of NA output for each class.
# Repeat all of the above for max, median, mean, sd, ...

I would need to do this because I often have >50 groups and with standard warning reporting, I would not see unexpected warnings; standard warning reporting would just show warnings about the empty groups.

billdenney on 31 Dec 2018

I had a few ideas overnight of how the prior behavior could be achieved. I will reiterate the ones above in addition to the new ideas so that the list can be (semi-)comprehensive.

As above, convert all factors to character strings. Pro: Simple (mutate_if(is.factor, as.character))). Con: Information is lost-- especially for ordered factors or factors where not all levels are represented in the current data.
Make a new class that is a factor in every way except for the way that dplyr grouping detects factors.
Make a new class that replicates factors by copying all methods. (This is a bad solution as it is fragile to the creation of methods.) The idea here would find all methods for factors and copy them to the new class at load (I don't know if there is a way to do it more dynamically, but all methods I can think of would break quickly).
Like 2 above, but have a special case in dplyr code for the new class. Pro: Simple for users. Con: Harder to maintain the dplyr code.
Make a new character column that puts the columns together as some unique hash. Pro: simple. Con: Doesn't represent the real grouping; loses the real grouping columns on summarize (they would need to be merged back in later); must avoid hash collisions.
Add a flag argument to group_by() (like the originally suggested drop). Pro: Simple for users. Con: Appears to have been considered and not preferred by dplyr authors. (I've not read the full threads around the rationale, but to be clear I'm not advocating for this as it has already been considered. I'm just trying to be complete.)
Add a flag argument to all verbs like .skip_empty. (Suggestion from @yutannihilation below.) Pro: Simple to understand; maintains the goal of one interpretation of factors as groups. Con: More complex to maintain as every verb would require the argument, if it would apply to the verb (in and out of dplyr).

Additional Discussion of Specific Options

Option 1

Conversion to character strings and back could be simplified by implementation of something like tidyverse/forcats#144 and maybe a data.frame mirroring method like the following (typed directly into GitHub; there could be bugs):

fct_mirror.data.frame <- function(.f, .x) {
  # Find factor columns in the original data that are also in the current data
  matching_cols <- intersect(colnames(.f), colnames(.x)[sapply(X=.x, FUN=is.factor)])
  for (current_col in matching_cols) {
    if (!is.factor(.f[[current_col]])) {
      .f[[current_col]] <- fct_mirror(.f[[current_col]], .x[[current_col]])
    }
  }
  .f
}

Option 2

For 2, the simplest would be if grouping of factors is detected by something simple like is.factor(), then make a new class called something like hidden_factor which inherits from factor and its only functions are:

as.hidden_factor <- function(x)
  UseMethod("as.hidden_factor")
as.hidden_factor.factor <- function(x) {
  class(x) <- unique(c("hidden_factor", class(x)))
  x
}
as.hidden_factor.default <- function(x) {
  message("No need to convert x to a hidden_factor")
  x
}
is.factor.hidden_factor function(x) <- FALSE
as.factor.hidden_factor <- function(x) {
  class(x) <- setdiff(class(x), "hidden_factor")  
  x
}

That would break some other applications looking for factors, but it or something similar could result in the same behaviour as the prior version.

billdenney on 1 Jan 2019

Is it possible to implement an option for summarise() to skip empty groups? After reading @cderv's comment, I now feel this issue is not about group_by(), rather about summarise() in a sense:

I believe that custom functions that assumes non empty group could return NaN (or other value) if the group is empty so that it is possible to filter afterwards.

The empty groups don't become a problem when mutate()ed or filter()ed because there's no rows from empty groups in the data (accordingly, the behaviours are basically same as 0.7.8). But, summarise() embodies the empty groups and mixes them with non-empty groups. So, after summarising, it requires a lot of heuristics to detect which rows were from empty groups.

I feel something like this

d %>%
  group_by(A, B) %>%
  summarize(
    Mean=mean(C),
    .skip_empty = TRUE
  )

is safer than this (or other ways using some sentinal values to represent empty groups):

d %>%
  group_by(A, B) %>%
  summarize(
    Mean=mean(C)
  ) %>%
  filter(!is.nan(Mean))

(I also want other verbs like mutate() and filter() to skip the empty groups, but this might be a different issue)

yutannihilation on 7 Jan 2019

@yutannihilation, If applied to all verbs, something like .skip_empty would work. (I'll add it to the list above.)

As it relates to the data, .skip_empty would usually apply at the group_by() level rather than the verb level. I say this because the times that .skip_empty would typically apply would usually be defined at the definition of groups; in my experience, whether the full combination of factors applies (as is the new behavior) or not (as is the prior behavior) is based on the data being grouped. Examples include: United States congressional members by party, year, and sex (full combination-- though it would depend on the year range as the Whigs and Tories may or may not be counted, and the parties no longer exist), patients in a parallel-design clinical trial (only observed combinations), and GDP by country and year in the 20th century (intermediate: when a country exists, it should have a GDP, but countries start and stop existence).

If applied at the grouping level, .skip_empty seems like it could have a simple application to all verbs since the management of what groups exist would occur centrally. While if applied at the verb level, .skip_empty would then need to be managed for all verbs both within and outside dplyr.

billdenney on 7 Jan 2019

👍1

I met with @romainfrancois this morning and we're going to add group_by(x, y, z, .drop = TRUE) to handle the case where you have factors but you want to use the actual values, not the Cartesian product of levels.

hadley on 7 Jan 2019

👍3

(@yutannihilation I don't think .skip_empty is enough because it's possible that because of the Cartesian product you have more groups than you have memory)

hadley on 7 Jan 2019

This is on my list for tomorrow morning.

I guess group_trim() also would have the .drop argument.

I'm not sure how this impacts the group_by that are implicit, e.g. a group by that is due to a filter(.preserve = FALSE)

romainfrancois on 7 Jan 2019

@hadley and @romainfrancois, Thank you!!!

@romainfrancois, for filter(.preserve = FALSE), etc., could the .drop argument be stored with the groups so that it's applied the same way whenever groups change without another call to group_by()? That seems the most consistent way to me for group handling to occur. If someone wants to filter and then after filtering use the Cartesian product, that seems like another call to group_by() would be required.

Another intermediate case would be when group_by(add=TRUE) is called, and I think that it should default to using the same .drop flag as the prior group_by() call used.

billdenney on 7 Jan 2019

@romainfrancois, if you implement the .drop argument as being stored with the object, one other consideration is what interface would exist to change just the .drop flag?

Two options that occur to me are:

The user must re-call group_by() with all arguments. (That seems like it would work regardless.)
The user may call group_by(add=TRUE, .drop=FALSE) or group_by(add=TRUE, .drop=TRUE), whereby add is slightly overloaded to really mean "modify".

billdenney on 7 Jan 2019

Cool! I agree that .drop flag is best. Please ignore my comment above, I was thinking on the assumption that drop is unacceptable.

yutannihilation on 8 Jan 2019

To preserve backward compatibility, can the default be .drop=TRUE?

billdenney on 8 Jan 2019

Nope ;-)

We still believe that what .drop = FALSE does is best, we meet you half way, but you'll have to opt-out of the "correct" behaviour ..

romainfrancois on 8 Jan 2019

@romainfrancois Perfectly reasonable, I appreciate the accommodation!

P.S. I was wondering if I was asking too much there, but I'm a strong believer that you never get anything that you don't ask for.

billdenney on 8 Jan 2019

👍1

Hello @hadley and @romainfrancois

I am planning how/when to modify my package to ensure it is compatible with dplyr 0.8 (currently many automated tests are failing against the RC). Adding group_by(.drop=) will address the main incompatibility, so I'm pleased to see this (thank you!).
May I ask if you intend to include this change in the 0.8.0 release (and in a new RC before?).

@romainfrancois - I also emailed you directly (not sure if you prefer such questions here or via email?) - sorry for the duplication.

cbailiss on 17 Jan 2019

Actually, we reverted the default behaviour so that .drop = TRUE is the default now (at least for 0.8.0) , you should not have to change anything.

The fix is in this PR, which hopefully I should merge today or this week.
https://github.com/tidyverse/dplyr/pull/4091

romainfrancois on 23 Jan 2019

👍3

actually @billdenney we made .drop = TRUE the default

romainfrancois on 30 Jan 2019

@romainfrancois, Thanks! (And, cool new hair! :))

billdenney on 30 Jan 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/