When going from a wide dataset to long and then back, there seems to be an implicit sort that occurs on non key/value variables. This prevents the ability to recover an initial data set by undoing the gather()
operation with spread()
.
Consider the following example:
library("tidyr")
(init_data = tibble::tibble(
Group = c("B", "C", "A"),
`Cost Week 1` = c(1, 2, 3),
`Cost Week 2` = c(4, 5, 6),
`Cost Week 3` = c(7, 8, 9)
))
#> # A tibble: 3 x 4
#> Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 B 1. 4. 7.
#> 2 C 2. 5. 8.
#> 3 A 3. 6. 9.
long = init_data %>%
gather(key = "Week", value = "Cost", -Group)
long
#> # A tibble: 9 x 3
#> Group Week Cost
#> <chr> <chr> <dbl>
#> 1 B Cost Week 1 1.
#> 2 C Cost Week 1 2.
#> 3 A Cost Week 1 3.
#> 4 B Cost Week 2 4.
#> 5 C Cost Week 2 5.
#> 6 A Cost Week 2 6.
#> 7 B Cost Week 3 7.
#> 8 C Cost Week 3 8.
#> 9 A Cost Week 3 9.
wide = long %>%
spread(key = "Week", value = "Cost")
wide
#> # A tibble: 3 x 4
#> Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 3. 6. 9.
#> 2 B 1. 4. 7.
#> 3 C 2. 5. 8.
init_data
#> # A tibble: 3 x 4
#> Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 B 1. 4. 7.
#> 2 C 2. 5. 8.
#> 3 A 3. 6. 9.
Supply factor_key = TRUE
to gather()
Oh maybe that's a different issue
I ran into this too, but with a further complication: if there is only a single identifier column, it seems the result is always sorted (as far as I tried, anyway), but if you have multiple identifier columns, the result switches to being unsorted at some number of rows. The more duplicate identifiers, the lower the number of rows needed for the result to be unsorted.
Very confusing.
library(dplyr) # 0.7.99.9000
library(tidyr) # 0.8.1.9000
new_df <- function(n) {
set.seed(1)
tibble(
i = rep(seq_len(n / 2), each = 2),
var = rep_len(c("a", "b"), n),
val = runif(n)
) %>% sample_frac(1)
}
op <- options(tibble.print_min = 4)
df_10k <- new_df(10000)
# Single identifier: always sorted
df_10k %>%
spread(var, val)
#> # A tibble: 5,000 x 3
#> i a b
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
#> 4 4 0.945 0.661
#> # ... with 4,996 more rows
new_df(5000000) %>%
spread(var, val)
#> # A tibble: 2,500,000 x 3
#> i a b
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
#> 4 4 0.945 0.661
#> # ... with 2.5e+06 more rows
# Double identifier: switches from sorted to unsorted
df_10k %>%
mutate(j = i) %>%
spread(var, val)
#> # A tibble: 5,000 x 4
#> i j a b
#> <int> <int> <dbl> <dbl>
#> 1 1 1 0.266 0.372
#> 2 2 2 0.573 0.908
#> 3 3 3 0.202 0.898
#> 4 4 4 0.945 0.661
#> # ... with 4,996 more rows
new_df(100000) %>%
mutate(j = i) %>%
spread(var, val)
#> # A tibble: 50,000 x 4
#> i j a b
#> <int> <int> <dbl> <dbl>
#> 1 35026 35026 0.224 0.158
#> 2 35088 35088 0.0712 0.439
#> 3 15647 15647 0.330 0.0746
#> 4 27617 27617 0.138 0.0349
#> # ... with 5e+04 more rows
# Triple identifier: switches earlier
df_10k %>%
mutate(j = i, k = j) %>%
spread(var, val)
#> # A tibble: 5,000 x 5
#> i j k a b
#> <int> <int> <int> <dbl> <dbl>
#> 1 324 324 324 0.0308 0.862
#> 2 3383 3383 3383 0.722 0.868
#> 3 3677 3677 3677 0.762 0.598
#> 4 557 557 557 0.00986 0.686
#> # ... with 4,996 more rows
new_df(1000) %>%
mutate(j = i, k = j) %>%
spread(var, val)
#> # A tibble: 500 x 5
#> i j k a b
#> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 1 0.266 0.372
#> 2 2 2 2 0.573 0.908
#> 3 3 3 3 0.202 0.898
#> 4 4 4 4 0.945 0.661
#> # ... with 496 more rows
options(op)
Created on 2018-08-27 by the reprex package (v0.2.0).
I think this might be related to the problem that spread
does not preserve the _column_ order. For me this is a somewhat annoying problem when creating summary tables.
See here:
d <- tibble::tibble(i = c("c", "a", "b"), j = 1:3)
tidyr::spread(d, i, j)
#> # A tibble: 1 x 3
#> a b c
#> <int> <int> <int>
#> 1 2 3 1
Created on 2018-09-04 by the reprex
package (v0.2.0).
I found that the discrepancy in row order between data with many identifiers vs. few comes from the switch to using character ids (rather than integers) when the number of possible id combinations is > 2^31.
Relevant source is here: https://github.com/tidyverse/tidyr/blob/cbdd14e90b2a771e242a44d1ed5eea84d53da642/R/id.R#L16-L29
In the above if
-branches, order(res)
won't be the same, which results in the varying row orders.
This is handled by pivot()
the successor to gather()
and spread()
(spread()
and gather()
need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
df <- tibble(
Group = c("B", "C", "A"),
`Cost Week 1` = c(1, 2, 3),
`Cost Week 2` = c(4, 5, 6),
`Cost Week 3` = c(7, 8, 9)
)
spec <- df %>%
pivot_spec_long(-1, measure = "cost") %>%
mutate(
week = readr::parse_number(variable),
variable = NULL
)
spec
#> # A tibble: 3 x 3
#> col_name measure week
#> <chr> <chr> <dbl>
#> 1 Cost Week 1 cost 1
#> 2 Cost Week 2 cost 2
#> 3 Cost Week 3 cost 3
df %>% pivot(spec)
#> # A tibble: 9 x 3
#> Group week cost
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 B 2 2
#> 3 B 3 3
#> 4 C 1 4
#> 5 C 2 5
#> 6 C 3 6
#> 7 A 1 7
#> 8 A 2 8
#> 9 A 3 9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#> Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 B 1 2 3
#> 2 C 4 5 6
#> 3 A 7 8 9
Created on 2019-03-03 by the reprex package (v0.2.1.9000)
Most helpful comment
This is handled by
pivot()
the successor togather()
andspread()
(spread()
andgather()
need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)Created on 2019-03-03 by the reprex package (v0.2.1.9000)