Tidyr: Spread does not preserve order

Created on 19 May 2018 · 6Comments · Source: tidyverse/tidyr

When going from a wide dataset to long and then back, there seems to be an implicit sort that occurs on non key/value variables. This prevents the ability to recover an initial data set by undoing the gather() operation with spread().

Consider the following example:

library("tidyr")
(init_data = tibble::tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
))
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.

long = init_data %>% 
  gather(key = "Week", value = "Cost", -Group)

long
#> # A tibble: 9 x 3
#>   Group Week         Cost
#>   <chr> <chr>       <dbl>
#> 1 B     Cost Week 1    1.
#> 2 C     Cost Week 1    2.
#> 3 A     Cost Week 1    3.
#> 4 B     Cost Week 2    4.
#> 5 C     Cost Week 2    5.
#> 6 A     Cost Week 2    6.
#> 7 B     Cost Week 3    7.
#> 8 C     Cost Week 3    8.
#> 9 A     Cost Week 3    9.

wide = long %>%
  spread(key = "Week", value = "Cost")

wide
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 A                3.            6.            9.
#> 2 B                1.            4.            7.
#> 3 C                2.            5.            8.

init_data
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.

bug pivoting

Source

coatless

👍3

Most helpful comment

This is handled by pivot() the successor to gather() and spread() (spread() and gather() need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
)

spec <- df %>% 
  pivot_spec_long(-1, measure = "cost") %>% 
  mutate(
    week = readr::parse_number(variable),
    variable = NULL
  )
spec
#> # A tibble: 3 x 3
#>   col_name    measure  week
#>   <chr>       <chr>   <dbl>
#> 1 Cost Week 1 cost        1
#> 2 Cost Week 2 cost        2
#> 3 Cost Week 3 cost        3

df %>% pivot(spec)
#> # A tibble: 9 x 3
#>   Group  week  cost
#>   <chr> <dbl> <dbl>
#> 1 B         1     1
#> 2 B         2     2
#> 3 B         3     3
#> 4 C         1     4
#> 5 C         2     5
#> 6 C         3     6
#> 7 A         1     7
#> 8 A         2     8
#> 9 A         3     9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                 1             2             3
#> 2 C                 4             5             6
#> 3 A                 7             8             9

^{Created on 2019-03-03 by the reprex package (v0.2.1.9000)}

hadley on 3 Mar 2019

👍5

All 6 comments

Supply factor_key = TRUE to gather()

hadley on 30 May 2018

❤1 👍1

Oh maybe that's a different issue

hadley on 30 May 2018

I ran into this too, but with a further complication: if there is only a single identifier column, it seems the result is always sorted (as far as I tried, anyway), but if you have multiple identifier columns, the result switches to being unsorted at some number of rows. The more duplicate identifiers, the lower the number of rows needed for the result to be unsorted.

Very confusing.

library(dplyr) # 0.7.99.9000
library(tidyr) # 0.8.1.9000

new_df <- function(n) {
  set.seed(1)
  tibble(
    i = rep(seq_len(n / 2), each = 2),
    var = rep_len(c("a", "b"), n),
    val = runif(n)
  ) %>% sample_frac(1)
}

op <- options(tibble.print_min = 4)
df_10k <- new_df(10000)

# Single identifier: always sorted
df_10k %>%
  spread(var, val)
#> # A tibble: 5,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(5000000) %>%
  spread(var, val)
#> # A tibble: 2,500,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 2.5e+06 more rows

# Double identifier: switches from sorted to unsorted
df_10k %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 4
#>       i     j     a     b
#>   <int> <int> <dbl> <dbl>
#> 1     1     1 0.266 0.372
#> 2     2     2 0.573 0.908
#> 3     3     3 0.202 0.898
#> 4     4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(100000) %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 50,000 x 4
#>       i     j      a      b
#>   <int> <int>  <dbl>  <dbl>
#> 1 35026 35026 0.224  0.158 
#> 2 35088 35088 0.0712 0.439 
#> 3 15647 15647 0.330  0.0746
#> 4 27617 27617 0.138  0.0349
#> # ... with 5e+04 more rows

# Triple identifier: switches earlier
df_10k %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 5
#>       i     j     k       a     b
#>   <int> <int> <int>   <dbl> <dbl>
#> 1   324   324   324 0.0308  0.862
#> 2  3383  3383  3383 0.722   0.868
#> 3  3677  3677  3677 0.762   0.598
#> 4   557   557   557 0.00986 0.686
#> # ... with 4,996 more rows

new_df(1000) %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 500 x 5
#>       i     j     k     a     b
#>   <int> <int> <int> <dbl> <dbl>
#> 1     1     1     1 0.266 0.372
#> 2     2     2     2 0.573 0.908
#> 3     3     3     3 0.202 0.898
#> 4     4     4     4 0.945 0.661
#> # ... with 496 more rows

options(op)

Created on 2018-08-27 by the reprex package (v0.2.0).

mikmart on 27 Aug 2018

I think this might be related to the problem that spread does not preserve the _column_ order. For me this is a somewhat annoying problem when creating summary tables.

See here:

d <- tibble::tibble(i = c("c", "a", "b"), j = 1:3)
tidyr::spread(d, i, j)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     2     3     1

Created on 2018-09-04 by the reprex
package (v0.2.0).

elbersb on 4 Sep 2018

👍2

I found that the discrepancy in row order between data with many identifiers vs. few comes from the switch to using character ids (rather than integers) when the number of possible id combinations is > 2^31.

Relevant source is here: https://github.com/tidyverse/tidyr/blob/cbdd14e90b2a771e242a44d1ed5eea84d53da642/R/id.R#L16-L29

In the above if-branches, order(res) won't be the same, which results in the varying row orders.

mikmart on 9 Sep 2018

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
)

spec <- df %>% 
  pivot_spec_long(-1, measure = "cost") %>% 
  mutate(
    week = readr::parse_number(variable),
    variable = NULL
  )
spec
#> # A tibble: 3 x 3
#>   col_name    measure  week
#>   <chr>       <chr>   <dbl>
#> 1 Cost Week 1 cost        1
#> 2 Cost Week 2 cost        2
#> 3 Cost Week 3 cost        3

df %>% pivot(spec)
#> # A tibble: 9 x 3
#>   Group  week  cost
#>   <chr> <dbl> <dbl>
#> 1 B         1     1
#> 2 B         2     2
#> 3 B         3     3
#> 4 C         1     4
#> 5 C         2     5
#> 6 C         3     6
#> 7 A         1     7
#> 8 A         2     8
#> 9 A         3     9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                 1             2             3
#> 2 C                 4             5             6
#> 3 A                 7             8             9

^{Created on 2019-03-03 by the reprex package (v0.2.1.9000)}

hadley on 3 Mar 2019

👍5

Was this page helpful?

0 / 5 - 0 ratings