Consider a simple data table:
dt <- data.table(i = 1:2, na = rnorm(2), nb=rnorm(2),
ua=runif(2), ub=runif(2))
dt
## i na nb ua ub
## 1: 1 0.8675148 1.1900491 0.09394934 0.2717421
## 2: 2 -0.1700282 0.9188715 0.58017687 0.5443863
when melting it into multiple colums we get:
melt(dt, measure.vars=list(c("na", "nb"), c("ua", "ub")))
## i variable value1 value2
## 1: 1 1 0.8675148 0.09394934
## 2: 2 1 -0.1700282 0.58017687
## 3: 1 2 1.1900491 0.27174213
## 4: 2 2 0.9188715 0.54438634
In particular variable is a factor with levels "1" and "2". This behavior seems to be undocumented. ?melt tells
‘list’ is a generalization of the vector version - each
element of the list (which should be ‘integer’ or
‘character’ as above) will become a ‘melt’ed column.
and
From version ‘1.9.6’, ‘melt’ gains a feature with ‘measure.vars’
accepting a list of ‘character’ or ‘integer’ vectors as well to
melt into multiple columns in a single function call efficiently.
The function ‘patterns’ can be used to provide regular expression
patterns. When used along with ‘melt’, if ‘cols’ argument is not
provided, the patterns will be matched against ‘names(data)’, for
convenience.
However, I cannot find anything about
I know there are related feature requests (#2551and #3396). I am also aware of related solutions (e.g. on SO) that revolve around renaming the corresponding factor levels. However, for such solutions to be considered safe, the behavior of numeric levels should be documented and considered part of the API.
data.table 1.12.6; R 3.4, 3.6.
I have the same question. It would be nice for variable column to have factors "a" and "b" instead of 1 and 2.
Look at this example, which has data.table, tidyr and base reshape versions of the same problem for comparison. The data.table version is also shown below.
variable.name is the name of an index into the times. We need to perform that indexing in post processing where pivot_longer does it automatically. Even base reshape is slightly better here as there is a times= argument that provides the times which are automatically indexed into so you don't have to do it yourself. Maybe melt could have such an argument or perhaps some other method of doing this. This would slightly simplify the second last line of code below.
Also it would be nice if melt had a drop_na argument to drop rows having NAs like pivot_longer does and have a key= argument to sort the result to reduce the amount of post processing. This would eliminate the last line in the code below.
library(data.table)
input <-
structure(list(id = c(1, 3, 6), school = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), read_1 = c(20, 22, 24), read_1_sp = c(TRUE,
FALSE, TRUE), read_2 = c(45, 47, 49), read_2_sp = c(FALSE, FALSE,
FALSE), math_1 = c(20, 22, NA), math_1_sp = c(TRUE, FALSE, NA
), math_2 = c(NA, 35, 37), math_2_sp = c(NA, FALSE, FALSE)), class = "data.frame",
row.names = c(NA, -3L))
input2 <- as.data.table(input, key = c("id", "school"))
times <- grep("\\d$", names(input2), value = TRUE) # score col names
melt(input2, measure = patterns(sp = "sp", score = "\\d$"), variable.name = "subject")[,
c("subject", "no"):= fread(text = times[subject], sep = "_")][,
na.omit(.SD), key = key(input2)]
hi this is solved by a new function in nc package which uses melt.data.table internally,
library(data.table)
dt <- data.table(
i = 1:2, na = rnorm(2), nb=rnorm(2),
ua=runif(2), ub=runif(2))
nc::capture_melt_multiple(dt, column="[un]", letter="[ab]")
#> i letter n u
#> 1: 1 a 0.5509765 0.7095506
#> 2: 2 a 0.7278650 0.2971809
#> 3: 1 b -0.4690630 0.9605627
#> 4: 2 b -1.4568312 0.3414062
nc::capture_melt_multiple(dt, letter="[un]", column="[ab]")
#> i letter a b
#> 1: 1 n 0.5509765 -0.4690630
#> 2: 2 n 0.7278650 -1.4568312
#> 3: 1 u 0.7095506 0.9605627
#> 4: 2 u 0.2971809 0.3414062
Again, we should first discuss if that functionality is going to be in scope of DT before closing.
the original question: "what is the relationship between the numeric label and the original column name?" it is true that we should add documentation about what values go into the variable column. This is related to #4455 in which the problem is that the numeric label in the variable column is not consistent between na.rm=TRUE and FALSE.
AFAICT the closest documentation is that variable.name is name for the measured variable names column which is somewhat ambiguous. I would recommend revising to something like:
variable.name: name of output column containing information about which input columns the data came from. If measure.vars is an integer/character vector, then this column contains names of melted columns from the input data table. If measure.vars is a list of integer/character vectors, then this column contains integers indicating the index/position in each of those vectors.
even better just avoid the variable column altogether, use new functionality in #4731
remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#> Use `force = TRUE` to force installation
library(data.table)
dt <- data.table(
i = 1:2, na = rnorm(2), nb=rnorm(2),
ua=runif(2), ub=runif(2))
melt(dt, measure.vars=measure(value.name, letter, pattern="([un])([ab])"))
#> i letter n u
#> 1: 1 a -0.6042333 0.3756086
#> 2: 2 a 0.4125218 0.2719224
#> 3: 1 b 0.2163859 0.5793461
#> 4: 2 b -0.6725394 0.1945757
melt(dt, measure.vars=measure(letter, value.name, pattern="([un])([ab])"))
#> i letter a b
#> 1: 1 n -0.6042333 0.2163859
#> 2: 2 n 0.4125218 -0.6725394
#> 3: 1 u 0.3756086 0.5793461
#> 4: 2 u 0.2719224 0.1945757
Just a reminder here: the issue is about missing documentation of the current behavior.
Myself, I would prefer the new functionality, but I can work around with the existing one. But without details laid out in documentation, it feels somewhat unsafe to assume how the current approach works. In particular, to assume that the numeric values correspond to the original values in alphabetic order.
I have proposed a doc fix in #4723
\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}