Tidyr: Spread does not preserve order

Created on 19 May 2018  路  6Comments  路  Source: tidyverse/tidyr

When going from a wide dataset to long and then back, there seems to be an implicit sort that occurs on non key/value variables. This prevents the ability to recover an initial data set by undoing the gather() operation with spread().

Consider the following example:

library("tidyr")
(init_data = tibble::tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
))
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.

long = init_data %>% 
  gather(key = "Week", value = "Cost", -Group)

long
#> # A tibble: 9 x 3
#>   Group Week         Cost
#>   <chr> <chr>       <dbl>
#> 1 B     Cost Week 1    1.
#> 2 C     Cost Week 1    2.
#> 3 A     Cost Week 1    3.
#> 4 B     Cost Week 2    4.
#> 5 C     Cost Week 2    5.
#> 6 A     Cost Week 2    6.
#> 7 B     Cost Week 3    7.
#> 8 C     Cost Week 3    8.
#> 9 A     Cost Week 3    9.

wide = long %>%
  spread(key = "Week", value = "Cost")

wide
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 A                3.            6.            9.
#> 2 B                1.            4.            7.
#> 3 C                2.            5.            8.

init_data
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.
bug pivoting

Most helpful comment

This is handled by pivot() the successor to gather() and spread() (spread() and gather() need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
)

spec <- df %>% 
  pivot_spec_long(-1, measure = "cost") %>% 
  mutate(
    week = readr::parse_number(variable),
    variable = NULL
  )
spec
#> # A tibble: 3 x 3
#>   col_name    measure  week
#>   <chr>       <chr>   <dbl>
#> 1 Cost Week 1 cost        1
#> 2 Cost Week 2 cost        2
#> 3 Cost Week 3 cost        3

df %>% pivot(spec)
#> # A tibble: 9 x 3
#>   Group  week  cost
#>   <chr> <dbl> <dbl>
#> 1 B         1     1
#> 2 B         2     2
#> 3 B         3     3
#> 4 C         1     4
#> 5 C         2     5
#> 6 C         3     6
#> 7 A         1     7
#> 8 A         2     8
#> 9 A         3     9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                 1             2             3
#> 2 C                 4             5             6
#> 3 A                 7             8             9

Created on 2019-03-03 by the reprex package (v0.2.1.9000)

All 6 comments

Supply factor_key = TRUE to gather()

Oh maybe that's a different issue

I ran into this too, but with a further complication: if there is only a single identifier column, it seems the result is always sorted (as far as I tried, anyway), but if you have multiple identifier columns, the result switches to being unsorted at some number of rows. The more duplicate identifiers, the lower the number of rows needed for the result to be unsorted.

Very confusing.

library(dplyr) # 0.7.99.9000
library(tidyr) # 0.8.1.9000

new_df <- function(n) {
  set.seed(1)
  tibble(
    i = rep(seq_len(n / 2), each = 2),
    var = rep_len(c("a", "b"), n),
    val = runif(n)
  ) %>% sample_frac(1)
}

op <- options(tibble.print_min = 4)
df_10k <- new_df(10000)

# Single identifier: always sorted
df_10k %>%
  spread(var, val)
#> # A tibble: 5,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(5000000) %>%
  spread(var, val)
#> # A tibble: 2,500,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 2.5e+06 more rows

# Double identifier: switches from sorted to unsorted
df_10k %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 4
#>       i     j     a     b
#>   <int> <int> <dbl> <dbl>
#> 1     1     1 0.266 0.372
#> 2     2     2 0.573 0.908
#> 3     3     3 0.202 0.898
#> 4     4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(100000) %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 50,000 x 4
#>       i     j      a      b
#>   <int> <int>  <dbl>  <dbl>
#> 1 35026 35026 0.224  0.158 
#> 2 35088 35088 0.0712 0.439 
#> 3 15647 15647 0.330  0.0746
#> 4 27617 27617 0.138  0.0349
#> # ... with 5e+04 more rows

# Triple identifier: switches earlier
df_10k %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 5
#>       i     j     k       a     b
#>   <int> <int> <int>   <dbl> <dbl>
#> 1   324   324   324 0.0308  0.862
#> 2  3383  3383  3383 0.722   0.868
#> 3  3677  3677  3677 0.762   0.598
#> 4   557   557   557 0.00986 0.686
#> # ... with 4,996 more rows

new_df(1000) %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 500 x 5
#>       i     j     k     a     b
#>   <int> <int> <int> <dbl> <dbl>
#> 1     1     1     1 0.266 0.372
#> 2     2     2     2 0.573 0.908
#> 3     3     3     3 0.202 0.898
#> 4     4     4     4 0.945 0.661
#> # ... with 496 more rows

options(op)

Created on 2018-08-27 by the reprex package (v0.2.0).

I think this might be related to the problem that spread does not preserve the _column_ order. For me this is a somewhat annoying problem when creating summary tables.

See here:

d <- tibble::tibble(i = c("c", "a", "b"), j = 1:3)
tidyr::spread(d, i, j)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     2     3     1

Created on 2018-09-04 by the reprex
package
(v0.2.0).

I found that the discrepancy in row order between data with many identifiers vs. few comes from the switch to using character ids (rather than integers) when the number of possible id combinations is > 2^31.

Relevant source is here: https://github.com/tidyverse/tidyr/blob/cbdd14e90b2a771e242a44d1ed5eea84d53da642/R/id.R#L16-L29

In the above if-branches, order(res) won't be the same, which results in the varying row orders.

This is handled by pivot() the successor to gather() and spread() (spread() and gather() need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
)

spec <- df %>% 
  pivot_spec_long(-1, measure = "cost") %>% 
  mutate(
    week = readr::parse_number(variable),
    variable = NULL
  )
spec
#> # A tibble: 3 x 3
#>   col_name    measure  week
#>   <chr>       <chr>   <dbl>
#> 1 Cost Week 1 cost        1
#> 2 Cost Week 2 cost        2
#> 3 Cost Week 3 cost        3

df %>% pivot(spec)
#> # A tibble: 9 x 3
#>   Group  week  cost
#>   <chr> <dbl> <dbl>
#> 1 B         1     1
#> 2 B         2     2
#> 3 B         3     3
#> 4 C         1     4
#> 5 C         2     5
#> 6 C         3     6
#> 7 A         1     7
#> 8 A         2     8
#> 9 A         3     9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                 1             2             3
#> 2 C                 4             5             6
#> 3 A                 7             8             9

Created on 2019-03-03 by the reprex package (v0.2.1.9000)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pachadotdev picture pachadotdev  路  3Comments

hadley picture hadley  路  6Comments

earowang picture earowang  路  9Comments

davidhunterwalsh picture davidhunterwalsh  路  4Comments

albertotb picture albertotb  路  7Comments