Tidyr: Pivot with duplicate column names

Created on 25 Jun 2018 · 12Comments · Source: tidyverse/tidyr

I'm more and more often confronted with messy spreadsheets with duplicate column names that need to be tidied up as for example in this SO question.

tidyr::gather simply refuse to do it (error message) and reshape::melt pr reshape2::melt return the wrong numbers without any warning. The data.table version of melt works as intended.

Here is a minimal reprex :

# Reprex
d <- data.frame(Group = c("A", "B"), 
                rbind(c(0, 0, 5, 5), 
                      c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")

# Dataframe with duplicate column names --> quite frequent situation in messy spreadsheets...
d
#>   Group Var1 Var2 Var1 Var2
#> 1     A    0    0    5    5
#> 2     B    0    0   10   10

# With tidyr we have an error message : definitively better than to have the 
# wrong numbers...
tidyr::gather(d,,,-1)
#> Error: Can't bind data because some arguments have the same name

# with reshape and reshape2 : wrong results (0 everywhere, the 5 and 10 values have disapeared)
reshape::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
#> 5     A     Var1     0
#> 6     B     Var1     0
#> 7     A     Var2     0
#> 8     B     Var2     0
reshape2::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0

# data.table::melt fails similarly when we work on a data.frame 
# but provides exactly the intended result if we work on a data.table
data.table::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
data.table::melt(data.table::as.data.table(d), id.vars = 1)
#>    Group variable value
#> 1:     A     Var1     0
#> 2:     B     Var1     0
#> 3:     A     Var2     0
#> 4:     B     Var2     0
#> 5:     A     Var1     5
#> 6:     B     Var1    10
#> 7:     A     Var2     5
#> 8:     B     Var2    10

# base::stack provides the right values but good luck for the other columns ...
stack(d[,-1])
#>   values    ind
#> 1      0   Var1
#> 2      0   Var1
#> 3      0   Var2
#> 4      0   Var2
#> 5      5 Var1.1
#> 6     10 Var1.1
#> 7      5 Var2.1
#> 8     10 Var2.1

Created on 2018-06-25 by the reprex package (v0.2.0).

feature pivoting

Source

GillesSanMartin

Most helpful comment

After thinking about it for a while, I think the best approach is to make the copies explicit. This makes it straightforward to implement, makes it obvious from the outside, and gives you all the needed data.

# A tibble: 8 x 4
  Group name  .copy value
  <fct> <chr> <int> <dbl>
1 A     Var1      1     0
2 A     Var2      1     0
3 A     Var1      2     5
4 A     Var2      2     5
5 B     Var1      1     0
6 B     Var2      1     0
7 B     Var1      2    10
8 B     Var2      2    10
Warning message:
Duplicate column names detected, adding .copy variable

hadley on 7 Mar 2019

👍3

All 12 comments

It is possible that name repair should get done by gather(). Regardless, you could also implement name repair yourself right now, prior to gathering. If the names are OK, it will be a no op.

tibble::set_names() repairs names on an existing tibble and is an emerging standard across multiple tidyverse packages. The dev version of readxl is already using this, for example, before it returns a tibble read out of xls[x]. The design of this function is discussed here: https://github.com/tidyverse/tibble/issues/217

d <- data.frame(Group = c("A", "B"), 
                rbind(c(0, 0, 5, 5), 
                      c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")

## tibble::set_tidy_names() is the emerging standard for making sure
## a tibble has non-duplicate and non-empty names
tidyr::gather(tibble::set_tidy_names(d),,,-1)
#> New names:
#> Var1 -> Var1..2
#> Var2 -> Var2..3
#> Var1 -> Var1..4
#> Var2 -> Var2..5
#>   Group     key value
#> 1     A Var1..2     0
#> 2     B Var1..2     0
#> 3     A Var2..3     0
#> 4     B Var2..3     0
#> 5     A Var1..4     5
#> 6     B Var1..4    10
#> 7     A Var2..5     5
#> 8     B Var2..5    10

Created on 2018-06-24 by the reprex package (v0.2.0.9000).

jennybc on 25 Jun 2018

Thanks for the hint!
You still need to back transform the variables names to their original one as these are truly the same values.

Maybe with something like :

d <- tidyr::gather(tibble::set_tidy_names(d),,,-1)
d$key <- factor(gsub("(.*)\\.\\.\\d+$", "\\1", d$key))

I still feel very uncomfortable with the behavior of reshape::melt because I have used it a lot and I'm afraid that I could already have been trapped into its strange behavior (in this admittedly strange situation).

GillesSanMartin on 25 Jun 2018

You still need to back transform the variables names to their original one as these are truly the same values.

Indeed! The need to do this explicitly (somewhere) seems like a good idea. Because your knowledge that the same-named VarX columns are indeed the same thing is external and probably shouldn't be silent default behaviour. Perhaps the tibble::set_tidy_names() / tibble::tidy_names() family should offer a helper or example code in the docs for post-processing.

jennybc on 25 Jun 2018

I think it's worth allowing gather() to work non-unique column names, since you are trying to get it into a tidy form. (Similarly it should probably allow missing column names)

hadley on 4 Jan 2019

👍2

Same issue here. I have duplicated column names, which once gatherer I will calculate an average.

PMassicotte on 28 Jan 2019

Pivot currently silently returns the wrong number of rows:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d
#>   Group Var1 Var2 Var1 Var2
#> 1     A    0    0    5    5
#> 2     B    0    0   10   10

spec <- d %>% pivot_spec_long(-Group)
spec
#> # A tibble: 2 x 3
#>   col_name measure variable
#>   <chr>    <chr>   <chr>   
#> 1 Var1     value   Var1    
#> 2 Var2     value   Var2

d %>% pivot(spec)
#> # A tibble: 4 x 3
#>   Group variable value
#>   <fct> <chr>    <dbl>
#> 1 A     Var1         0
#> 2 A     Var2         0
#> 3 B     Var1         0
#> 4 B     Var2         0

^{Created on 2019-03-03 by the reprex package (v0.2.1.9000)}

hadley on 3 Mar 2019

👍1

Also need to test with NA column names.

hadley on 3 Mar 2019

Thank you!

PMassicotte on 4 Mar 2019

Possibly related:

test_that("minimal spec works", {
  df <- tibble(x = "x", y = 1)
  sp <- tibble::tribble(
    ~.name, ~.value,
    "x", "X",
    "y", "X",
  )
  pv <- pivot_long(df, spec = sp)
  expect_named(pv, c("X"))
  expect_equal(pv$X, c("x", "y"))
})

hadley on 5 Mar 2019

Reprex update for latest interface:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d <- as_tibble(d, .name_repair = "minimal")
d
#> # A tibble: 2 x 5
#>   Group  Var1  Var2  Var1  Var2
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A         0     0     5     5
#> 2 B         0     0    10    10

d %>% pivot_long(-Group)
#> # A tibble: 4 x 3
#>   Group name  value
#>   <fct> <chr> <dbl>
#> 1 A     Var1      0
#> 2 A     Var2      0
#> 3 B     Var1      0
#> 4 B     Var2      0

^{Created on 2019-03-06 by the reprex package (v0.2.1.9000)}

hadley on 7 Mar 2019

# A tibble: 8 x 4
  Group name  .copy value
  <fct> <chr> <int> <dbl>
1 A     Var1      1     0
2 A     Var2      1     0
3 A     Var1      2     5
4 A     Var2      2     5
5 B     Var1      1     0
6 B     Var2      1     0
7 B     Var1      2    10
8 B     Var2      2    10
Warning message:
Duplicate column names detected, adding .copy variable

hadley on 7 Mar 2019

👍3

That would be a very nice solution indeed! Thanks !!!
This configuration typically happens when people have measured the same variables several times (eg different seasons) or on different subsample units in nested designs (eg trees within sites).
Typically encoded in spreadsheets as merged cells above the column titles (in my experience)...
The new .copy variable would ease the "salvaging" of this information.

GillesSanMartin on 7 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings