A really useful feature of sample_n() is the ability to vary size by the grouping variable(s):
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'
set.seed(123223)
starwars %>%
select(name, sex, homeworld) %>%
add_count(sex, homeworld, name = "number") %>%
group_by(sex, homeworld) %>%
sample_n(size = case_when(
number %in% 1:4 ~ 1,
number >= 5 ~ 2
)) %>%
ungroup() %>%
arrange(desc(number), homeworld)
#> # A tibble: 64 x 4
#> name sex homeworld number
#> <chr> <chr> <chr> <int>
#> 1 Owen Lars male Tatooine 6
#> 2 Darth Vader male Tatooine 6
#> 3 Jar Jar Binks male Naboo 5
#> 4 Rugor Nass male Naboo 5
#> 5 Finn male <NA> 5
#> 6 Yoda male <NA> 5
#> 7 Dormé female Naboo 3
#> 8 BB8 none <NA> 3
#> 9 Bail Prestor Organa male Alderaan 2
#> 10 Wedge Antilles male Corellia 2
#> # … with 54 more rows
Switching to sample_n()'s successor slice_sample() yields a different number of rows:
library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'
set.seed(123223)
starwars %>%
select(name, sex, homeworld) %>%
add_count(sex, homeworld, name = "number") %>%
group_by(sex, homeworld) %>%
slice_sample(n = case_when(
"number" %in% 1:4 ~ 1,
"number" >= 5 ~ 2
)) %>%
ungroup() %>%
arrange(desc(number), homeworld)
#> # A tibble: 75 x 4
#> name sex homeworld number
#> <chr> <chr> <chr> <int>
#> 1 Cliegg Lars male Tatooine 6
#> 2 Biggs Darklighter male Tatooine 6
#> 3 Palpatine male Naboo 5
#> 4 Rugor Nass male Naboo 5
#> 5 Qui-Gon Jinn male <NA> 5
#> 6 Finn male <NA> 5
#> 7 Dormé female Naboo 3
#> 8 Padmé Amidala female Naboo 3
#> 9 R4-P17 none <NA> 3
#> 10 BB8 none <NA> 3
#> # … with 65 more rows
sample_n() is behaving as I would expect: returning one sample per combination of sex and homeworld when said combination occurs between 1 and 4 times; and two samples per combination when there are 5 or more occurrences.
Maybe I'm doing something wrong, but slice_sample() is returning two samples for all combinations of sex and homeworld, regardless of the number of times said combination occurs, except for combinations which only occur once, which can only return one sample.
Somewhat simpler reprex:
library(dplyr, warn.conflicts = FALSE)
df <- data.frame(g = c(1, 1, 2, 2, 2), x = 1:5)
df %>% group_by(g) %>% slice_sample(n = g) %>% count(g)
#> Error in check_slice_size(n, prop): object 'g' not found
df %>% group_by(g) %>% sample_n(g) %>% count(g)
#> # A tibble: 2 x 2
#> # Groups: g [2]
#> g n
#> <dbl> <int>
#> 1 1 1
#> 2 2 2
Created on 2020-06-05 by the reprex package (v0.3.0)
which reveals that part of your problem is writing "number" %in% 1:4 where you meant number %in% 1:4.
I'm not sure if I want to allow per-group resampling in slice_sample() as that was one of the major sources of complexity in sample_n(). But I'll consider it again in the future.
I would really like this feature to be brought into a non-deprecated function in dplyr. I'm using a sampling construction from thc's answer to this question on sampling different sizes per group
I have my required group counts in an AgeCount variable.
So in my use case this works:
sample_n(AgeCount[1])
But, as you know, this does not:
slice_sample(AgeCount[1])
Essentially, I am trying to do stratified sampling. I'm currently writing a package that is going to be used by other people who require stratified sampling. I am trying to limit right down the number of packages that my package is going to require, so I would like to restrict mine to dplyr.
In the absence of this functionality in slice_sample, I am going to be retaining sample_n in my code. Which works perfectly, but it makes me nervous relying on a deprecated - albeit working well - function.
You noted that the stratified sampling was a major source of complexity. Would a possible alternative way forward be to create a new function that only performs stratified sampling? There are multiple questions on Stack Overflow related to stratified sampling in R, across various disciplines, so there is clearly an ongoing need for this. It would be nice if the functionality was retained as a non-deprecated function in dplyr. As an alternative, purrr has this functionality but the package is no longer being maintained and most of the functions seem to reside in dplyr.
BTW sample_n() is not deprecated; it's superseded. That means you don't need to worry about relying on it.
Thanks Hadley, I was worried that sample_n() was going to disappear.
I vote up for this functionality. It's a common task in the clinical trials, and the sample_n was very handy here. If possible, kindly please consider keeping it alive (I know it's _superseded_, but I'm afraid it still means something like "_possibly deprecated in longer future_") or provide an alternative, when it's going to pass away one day.
@Generalized that is not what superseded means.