Tidyr: Add drop to pivot_wider

Created on 3 Oct 2019 · 8Comments · Source: tidyverse/tidyr

pivot_wider is missing the drop option that was present in spread which is quite useful
to fill a matrix with empty rows.

df <- tibble(x=factor(c(1,1,2), 1:3), y=c("a", "b", "b"), z=11:13)
print(df)                                                                       
#> # A tibble: 3 x 3
#>   x     y         z
#>   <fct> <chr> <int>
#> 1 1     a        11
#> 2 1     b        12
#> 3 2     b        13                                                                                                                 
pivot_wider(df, names_from = y, values_from=z)                                      
#> # A tibble: 2 x 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
spread(df, key=y, value=z, drop=FALSE)     
#> # A tibble: 3 x 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
#> 3 3        NA    NA

feature pivoting

Source

romagnolid

👍11

Most helpful comment

Note that the spread() function with drop = FALSE also kept factor levels that were not present in the data, creating columns for them (this is not shown in the above example). This was a very useful feature, which is not present in pivot_wider(). Here is a regexp for this:

library(tidyr)
d = tibble(day_int = c(4,5,1,2),
           day_fac = factor(day_int, levels=1:5,
                            labels=c("Mon","Tue", "Wed","Thu","Fri")))
d
#> # A tibble: 4 x 2
#>   day_int day_fac
#>     <dbl> <fct>  
#> 1       4 Thu    
#> 2       5 Fri    
#> 3       1 Mon    
#> 4       2 Tue
levels(d$day_fac)
#> [1] "Mon" "Tue" "Wed" "Thu" "Fri"

# spread() automatically creates a `Wed` column
spread(d, key = "day_fac", value = "day_int", drop = FALSE)
#> # A tibble: 1 x 5
#>     Mon   Tue   Wed   Thu   Fri
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     4     5

# ... but pivot_wider() does not (and does not respect
# the level ordering, but that is issue #839)
pivot_wider(d, names_from = day_fac, values_from = day_int,
               values_fill = list(day_fac = NA))
#> # A tibble: 1 x 4
#>     Thu   Fri   Mon   Tue
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4     5     1     2

For naming the argument, I like the consistency of names_fill = TRUE (cf. values_fill).

huftis on 31 Mar 2020

👍4

All 8 comments

Needs r-lib/vctrs#686.

Not sure what to call this argument; it could be names_drop = FALSE but drop is not terribly descriptive. It's more like names_use_factor_levels = TRUE but that's rather verbose.

hadley on 28 Nov 2019

What about names_keep or names_keep_all?

romagnolid on 27 Feb 2020

library(tidyr)
d = tibble(day_int = c(4,5,1,2),
           day_fac = factor(day_int, levels=1:5,
                            labels=c("Mon","Tue", "Wed","Thu","Fri")))
d
#> # A tibble: 4 x 2
#>   day_int day_fac
#>     <dbl> <fct>  
#> 1       4 Thu    
#> 2       5 Fri    
#> 3       1 Mon    
#> 4       2 Tue
levels(d$day_fac)
#> [1] "Mon" "Tue" "Wed" "Thu" "Fri"

# spread() automatically creates a `Wed` column
spread(d, key = "day_fac", value = "day_int", drop = FALSE)
#> # A tibble: 1 x 5
#>     Mon   Tue   Wed   Thu   Fri
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     4     5

# ... but pivot_wider() does not (and does not respect
# the level ordering, but that is issue #839)
pivot_wider(d, names_from = day_fac, values_from = day_int,
               values_fill = list(day_fac = NA))
#> # A tibble: 1 x 4
#>     Thu   Fri   Mon   Tue
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4     5     1     2

For naming the argument, I like the consistency of names_fill = TRUE (cf. values_fill).

huftis on 31 Mar 2020

👍4

Thinking about this more, I think it's unlikely the full spectrum of needs can be resolved with just a couple of new arguments to pivot_wider(). I think the right approach is probably to document the spec data frame in much more detail, and show how you can use expand_grid() and friends to generate exactly the output that you want.

hadley on 28 Apr 2020

Hi,

I think it's a shame that spread/pivot_wider can't be told what columns to expect when given an empty input. In extensive pipes, empty data frames/tibbles do happen and missing columns can break the whole mechanism e.g. mutate after pivot_wider. I'd be happier with a result albeit an empty data frames than a broken pipe.

thanks,
Stefan

smarc on 5 May 2020

@smarc it can be. That's the purpose of the spec.

hadley on 5 May 2020

Here’s another use case, a bit different from my earlier example, but similar to data I work with. Here, each level of a factor (gender) is present, but not always for each value/label of the other names_from variable.

library(tidyverse)
# Example data
# Note: In 2019, only males responded
d = tibble(
  year = c(2018, 2018, 2019, 2020, 2020),
  gender = factor(c("female", "male", "male", "female", "male")),
  percentage = seq(30, 70, 10)
)
pivot_wider(d, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 1 x 5
#>   `2018_female` `2018_male` `2019_male` `2020_female` `2020_male`
#>           <dbl>       <dbl>       <dbl>         <dbl>       <dbl>
#> 1            30          40          50            60          70

Expected syntax and results:

pivot_wider(d, names_from = c(year, gender), values_from = percentage,
               names_fill = TRUE)
#> # A tibble: 1 x 6
#>   `2018_female` `2018_male` `2019_female` `2019_male` `2020_female` `2020_male`
#>           <dbl>       <dbl>         <dbl>       <dbl>         <dbl>       <dbl>
#> 1            30          40            NA          50            60          70

(I am aware that the complete() function can be used to preprocess the data to achieve the above outcome.)

huftis on 6 May 2020

names_fill = TRUE would be a useful feature!