I'm more and more often confronted with messy spreadsheets with duplicate column names that need to be tidied up as for example in this SO question.
tidyr::gather simply refuse to do it (error message) and reshape::melt pr reshape2::melt return the wrong numbers without any warning. The data.table version of melt works as intended.
Here is a minimal reprex :
# Reprex
d <- data.frame(Group = c("A", "B"),
rbind(c(0, 0, 5, 5),
c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
# Dataframe with duplicate column names --> quite frequent situation in messy spreadsheets...
d
#> Group Var1 Var2 Var1 Var2
#> 1 A 0 0 5 5
#> 2 B 0 0 10 10
# With tidyr we have an error message : definitively better than to have the
# wrong numbers...
tidyr::gather(d,,,-1)
#> Error: Can't bind data because some arguments have the same name
# with reshape and reshape2 : wrong results (0 everywhere, the 5 and 10 values have disapeared)
reshape::melt(d, id.vars = 1)
#> Group variable value
#> 1 A Var1 0
#> 2 B Var1 0
#> 3 A Var2 0
#> 4 B Var2 0
#> 5 A Var1 0
#> 6 B Var1 0
#> 7 A Var2 0
#> 8 B Var2 0
reshape2::melt(d, id.vars = 1)
#> Group variable value
#> 1 A Var1 0
#> 2 B Var1 0
#> 3 A Var2 0
#> 4 B Var2 0
# data.table::melt fails similarly when we work on a data.frame
# but provides exactly the intended result if we work on a data.table
data.table::melt(d, id.vars = 1)
#> Group variable value
#> 1 A Var1 0
#> 2 B Var1 0
#> 3 A Var2 0
#> 4 B Var2 0
data.table::melt(data.table::as.data.table(d), id.vars = 1)
#> Group variable value
#> 1: A Var1 0
#> 2: B Var1 0
#> 3: A Var2 0
#> 4: B Var2 0
#> 5: A Var1 5
#> 6: B Var1 10
#> 7: A Var2 5
#> 8: B Var2 10
# base::stack provides the right values but good luck for the other columns ...
stack(d[,-1])
#> values ind
#> 1 0 Var1
#> 2 0 Var1
#> 3 0 Var2
#> 4 0 Var2
#> 5 5 Var1.1
#> 6 10 Var1.1
#> 7 5 Var2.1
#> 8 10 Var2.1
Created on 2018-06-25 by the reprex package (v0.2.0).
It is possible that name repair should get done by gather(). Regardless, you could also implement name repair yourself right now, prior to gathering. If the names are OK, it will be a no op.
tibble::set_names() repairs names on an existing tibble and is an emerging standard across multiple tidyverse packages. The dev version of readxl is already using this, for example, before it returns a tibble read out of xls[x]. The design of this function is discussed here: https://github.com/tidyverse/tibble/issues/217
d <- data.frame(Group = c("A", "B"),
rbind(c(0, 0, 5, 5),
c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
## tibble::set_tidy_names() is the emerging standard for making sure
## a tibble has non-duplicate and non-empty names
tidyr::gather(tibble::set_tidy_names(d),,,-1)
#> New names:
#> Var1 -> Var1..2
#> Var2 -> Var2..3
#> Var1 -> Var1..4
#> Var2 -> Var2..5
#> Group key value
#> 1 A Var1..2 0
#> 2 B Var1..2 0
#> 3 A Var2..3 0
#> 4 B Var2..3 0
#> 5 A Var1..4 5
#> 6 B Var1..4 10
#> 7 A Var2..5 5
#> 8 B Var2..5 10
Created on 2018-06-24 by the reprex package (v0.2.0.9000).
Thanks for the hint!
You still need to back transform the variables names to their original one as these are truly the same values.
Maybe with something like :
d <- tidyr::gather(tibble::set_tidy_names(d),,,-1)
d$key <- factor(gsub("(.*)\\.\\.\\d+$", "\\1", d$key))
I still feel very uncomfortable with the behavior of reshape::melt because I have used it a lot and I'm afraid that I could already have been trapped into its strange behavior (in this admittedly strange situation).
You still need to back transform the variables names to their original one as these are truly the same values.
Indeed! The need to do this explicitly (somewhere) seems like a good idea. Because your knowledge that the same-named VarX columns are indeed the same thing is external and probably shouldn't be silent default behaviour. Perhaps the tibble::set_tidy_names() / tibble::tidy_names() family should offer a helper or example code in the docs for post-processing.
I think it's worth allowing gather() to work non-unique column names, since you are trying to get it into a tidy form. (Similarly it should probably allow missing column names)
Same issue here. I have duplicated column names, which once gatherer I will calculate an average.
Pivot currently silently returns the wrong number of rows:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d
#> Group Var1 Var2 Var1 Var2
#> 1 A 0 0 5 5
#> 2 B 0 0 10 10
spec <- d %>% pivot_spec_long(-Group)
spec
#> # A tibble: 2 x 3
#> col_name measure variable
#> <chr> <chr> <chr>
#> 1 Var1 value Var1
#> 2 Var2 value Var2
d %>% pivot(spec)
#> # A tibble: 4 x 3
#> Group variable value
#> <fct> <chr> <dbl>
#> 1 A Var1 0
#> 2 A Var2 0
#> 3 B Var1 0
#> 4 B Var2 0
Created on 2019-03-03 by the reprex package (v0.2.1.9000)
Also need to test with NA column names.
Thank you!
Possibly related:
test_that("minimal spec works", {
df <- tibble(x = "x", y = 1)
sp <- tibble::tribble(
~.name, ~.value,
"x", "X",
"y", "X",
)
pv <- pivot_long(df, spec = sp)
expect_named(pv, c("X"))
expect_equal(pv$X, c("x", "y"))
})
Reprex update for latest interface:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d <- as_tibble(d, .name_repair = "minimal")
d
#> # A tibble: 2 x 5
#> Group Var1 Var2 Var1 Var2
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 0 5 5
#> 2 B 0 0 10 10
d %>% pivot_long(-Group)
#> # A tibble: 4 x 3
#> Group name value
#> <fct> <chr> <dbl>
#> 1 A Var1 0
#> 2 A Var2 0
#> 3 B Var1 0
#> 4 B Var2 0
Created on 2019-03-06 by the reprex package (v0.2.1.9000)
After thinking about it for a while, I think the best approach is to make the copies explicit. This makes it straightforward to implement, makes it obvious from the outside, and gives you all the needed data.
# A tibble: 8 x 4
Group name .copy value
<fct> <chr> <int> <dbl>
1 A Var1 1 0
2 A Var2 1 0
3 A Var1 2 5
4 A Var2 2 5
5 B Var1 1 0
6 B Var2 1 0
7 B Var1 2 10
8 B Var2 2 10
Warning message:
Duplicate column names detected, adding .copy variable
That would be a very nice solution indeed! Thanks !!!
This configuration typically happens when people have measured the same variables several times (eg different seasons) or on different subsample units in nested designs (eg trees within sites).
Typically encoded in spreadsheets as merged cells above the column titles (in my experience)...
The new .copy variable would ease the "salvaging" of this information.
Most helpful comment
After thinking about it for a while, I think the best approach is to make the copies explicit. This makes it straightforward to implement, makes it obvious from the outside, and gives you all the needed data.