Dplyr: slice_sample() behaving differently than sample_n()

Created on 4 Jun 2020 · 6Comments · Source: tidyverse/dplyr

A really useful feature of sample_n() is the ability to vary size by the grouping variable(s):

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'

set.seed(123223)

starwars %>%
  select(name, sex, homeworld) %>%
  add_count(sex, homeworld, name = "number") %>%
  group_by(sex, homeworld) %>%
  sample_n(size = case_when(
    number %in% 1:4 ~ 1,
    number >= 5 ~ 2
  )) %>%
  ungroup() %>%
  arrange(desc(number), homeworld)
#> # A tibble: 64 x 4
#>    name                sex    homeworld number
#>    <chr>               <chr>  <chr>      <int>
#>  1 Owen Lars           male   Tatooine       6
#>  2 Darth Vader         male   Tatooine       6
#>  3 Jar Jar Binks       male   Naboo          5
#>  4 Rugor Nass          male   Naboo          5
#>  5 Finn                male   <NA>           5
#>  6 Yoda                male   <NA>           5
#>  7 Dormé               female Naboo          3
#>  8 BB8                 none   <NA>           3
#>  9 Bail Prestor Organa male   Alderaan       2
#> 10 Wedge Antilles      male   Corellia       2
#> # … with 54 more rows

Switching to sample_n()'s successor slice_sample() yields a different number of rows:

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'

set.seed(123223)

starwars %>%
  select(name, sex, homeworld) %>%
  add_count(sex, homeworld, name = "number") %>%
  group_by(sex, homeworld) %>%
  slice_sample(n = case_when(
    "number" %in% 1:4 ~ 1,
    "number" >= 5 ~ 2
  )) %>%
  ungroup() %>%
  arrange(desc(number), homeworld)
#> # A tibble: 75 x 4
#>    name              sex    homeworld number
#>    <chr>             <chr>  <chr>      <int>
#>  1 Cliegg Lars       male   Tatooine       6
#>  2 Biggs Darklighter male   Tatooine       6
#>  3 Palpatine         male   Naboo          5
#>  4 Rugor Nass        male   Naboo          5
#>  5 Qui-Gon Jinn      male   <NA>           5
#>  6 Finn              male   <NA>           5
#>  7 Dormé             female Naboo          3
#>  8 Padmé Amidala     female Naboo          3
#>  9 R4-P17            none   <NA>           3
#> 10 BB8               none   <NA>           3
#> # … with 65 more rows

sample_n() is behaving as I would expect: returning one sample per combination of sex and homeworld when said combination occurs between 1 and 4 times; and two samples per combination when there are 5 or more occurrences.

Maybe I'm doing something wrong, but slice_sample() is returning two samples for all combinations of sex and homeworld, regardless of the number of times said combination occurs, except for combinations which only occur once, which can only return one sample.

feature verbs

Source

jackhannah95

👍2

All 6 comments

Somewhat simpler reprex:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(g = c(1, 1, 2, 2, 2), x = 1:5)
df %>% group_by(g) %>% slice_sample(n = g) %>% count(g)
#> Error in check_slice_size(n, prop): object 'g' not found
df %>% group_by(g) %>% sample_n(g) %>% count(g)
#> # A tibble: 2 x 2
#> # Groups:   g [2]
#>       g     n
#>   <dbl> <int>
#> 1     1     1
#> 2     2     2

^{Created on 2020-06-05 by the reprex package (v0.3.0)}

which reveals that part of your problem is writing "number" %in% 1:4 where you meant number %in% 1:4.

I'm not sure if I want to allow per-group resampling in slice_sample() as that was one of the major sources of complexity in sample_n(). But I'll consider it again in the future.

hadley on 5 Jun 2020

👍1

I would really like this feature to be brought into a non-deprecated function in dplyr. I'm using a sampling construction from thc's answer to this question on sampling different sizes per group

I have my required group counts in an AgeCount variable.
So in my use case this works:
sample_n(AgeCount[1])

But, as you know, this does not:
slice_sample(AgeCount[1])

Essentially, I am trying to do stratified sampling. I'm currently writing a package that is going to be used by other people who require stratified sampling. I am trying to limit right down the number of packages that my package is going to require, so I would like to restrict mine to dplyr.

In the absence of this functionality in slice_sample, I am going to be retaining sample_n in my code. Which works perfectly, but it makes me nervous relying on a deprecated - albeit working well - function.

You noted that the stratified sampling was a major source of complexity. Would a possible alternative way forward be to create a new function that only performs stratified sampling? There are multiple questions on Stack Overflow related to stratified sampling in R, across various disciplines, so there is clearly an ongoing need for this. It would be nice if the functionality was retained as a non-deprecated function in dplyr. As an alternative, purrr has this functionality but the package is no longer being maintained and most of the functions seem to reside in dplyr.

programgirl on 30 Jun 2020

👍1

BTW sample_n() is not deprecated; it's superseded. That means you don't need to worry about relying on it.

hadley on 30 Jun 2020

Thanks Hadley, I was worried that sample_n() was going to disappear.

programgirl on 1 Jul 2020

👍1

I vote up for this functionality. It's a common task in the clinical trials, and the sample_n was very handy here. If possible, kindly please consider keeping it alive (I know it's _superseded_, but I'm afraid it still means something like "_possibly deprecated in longer future_") or provide an alternative, when it's going to pass away one day.