Data.table: Melt on multiple columns: undocumented behavior

Created on 16 Nov 2019  Â·  8Comments  Â·  Source: Rdatatable/data.table

Consider a simple data table:

dt <- data.table(i = 1:2, na = rnorm(2), nb=rnorm(2), 
                          ua=runif(2), ub=runif(2))
dt
##    i         na        nb         ua        ub
## 1: 1  0.8675148 1.1900491 0.09394934 0.2717421
## 2: 2 -0.1700282 0.9188715 0.58017687 0.5443863

when melting it into multiple colums we get:

melt(dt, measure.vars=list(c("na", "nb"), c("ua", "ub")))
##    i variable     value1     value2
## 1: 1        1  0.8675148 0.09394934
## 2: 2        1 -0.1700282 0.58017687
## 3: 1        2  1.1900491 0.27174213
## 4: 2        2  0.9188715 0.54438634

In particular variable is a factor with levels "1" and "2". This behavior seems to be undocumented. ?melt tells

‘list’ is a generalization of the vector version - each
element of the list (which should be ‘integer’ or
‘character’ as above) will become a ‘melt’ed column.

and

From version ‘1.9.6’, ‘melt’ gains a feature with ‘measure.vars’
accepting a list of ‘character’ or ‘integer’ vectors as well to
melt into multiple columns in a single function call efficiently.
The function ‘patterns’ can be used to provide regular expression
patterns. When used along with ‘melt’, if ‘cols’ argument is not
provided, the patterns will be matched against ‘names(data)’, for
convenience.

However, I cannot find anything about

  1. the fact that the variable will use numbered labels to denote the original columns, and
  2. what is the relationship between the numeric label and the original column name.

I know there are related feature requests (#2551and #3396). I am also aware of related solutions (e.g. on SO) that revolve around renaming the corresponding factor levels. However, for such solutions to be considered safe, the behavior of numeric levels should be documented and considered part of the API.

data.table 1.12.6; R 3.4, 3.6.

reshape

All 8 comments

I have the same question. It would be nice for variable column to have factors "a" and "b" instead of 1 and 2.

Look at this example, which has data.table, tidyr and base reshape versions of the same problem for comparison. The data.table version is also shown below.

https://stackoverflow.com/questions/61259478/manipulating-data-frame-format-in-base-r/61260008#61260008

variable.name is the name of an index into the times. We need to perform that indexing in post processing where pivot_longer does it automatically. Even base reshape is slightly better here as there is a times= argument that provides the times which are automatically indexed into so you don't have to do it yourself. Maybe melt could have such an argument or perhaps some other method of doing this. This would slightly simplify the second last line of code below.

Also it would be nice if melt had a drop_na argument to drop rows having NAs like pivot_longer does and have a key= argument to sort the result to reduce the amount of post processing. This would eliminate the last line in the code below.

library(data.table)

input <-
structure(list(id = c(1, 3, 6), school = structure(1:3, .Label = c("A", 
"B", "C"), class = "factor"), read_1 = c(20, 22, 24), read_1_sp = c(TRUE, 
FALSE, TRUE), read_2 = c(45, 47, 49), read_2_sp = c(FALSE, FALSE, 
FALSE), math_1 = c(20, 22, NA), math_1_sp = c(TRUE, FALSE, NA
), math_2 = c(NA, 35, 37), math_2_sp = c(NA, FALSE, FALSE)), class = "data.frame", 
row.names = c(NA, -3L))

input2 <- as.data.table(input, key = c("id", "school"))
times <- grep("\\d$", names(input2), value = TRUE)  # score col names

melt(input2, measure = patterns(sp = "sp", score = "\\d$"), variable.name = "subject")[, 
  c("subject", "no"):= fread(text = times[subject], sep = "_")][,   
  na.omit(.SD), key = key(input2)]  

hi this is solved by a new function in nc package which uses melt.data.table internally,

library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
nc::capture_melt_multiple(dt, column="[un]", letter="[ab]")
#>    i letter          n         u
#> 1: 1      a  0.5509765 0.7095506
#> 2: 2      a  0.7278650 0.2971809
#> 3: 1      b -0.4690630 0.9605627
#> 4: 2      b -1.4568312 0.3414062
nc::capture_melt_multiple(dt, letter="[un]", column="[ab]")
#>    i letter         a          b
#> 1: 1      n 0.5509765 -0.4690630
#> 2: 2      n 0.7278650 -1.4568312
#> 3: 1      u 0.7095506  0.9605627
#> 4: 2      u 0.2971809  0.3414062

Again, we should first discuss if that functionality is going to be in scope of DT before closing.

the original question: "what is the relationship between the numeric label and the original column name?" it is true that we should add documentation about what values go into the variable column. This is related to #4455 in which the problem is that the numeric label in the variable column is not consistent between na.rm=TRUE and FALSE.

AFAICT the closest documentation is that variable.name is name for the measured variable names column which is somewhat ambiguous. I would recommend revising to something like:
variable.name: name of output column containing information about which input columns the data came from. If measure.vars is an integer/character vector, then this column contains names of melted columns from the input data table. If measure.vars is a list of integer/character vectors, then this column contains integers indicating the index/position in each of those vectors.

even better just avoid the variable column altogether, use new functionality in #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
dt <- data.table(
  i = 1:2, na = rnorm(2), nb=rnorm(2), 
  ua=runif(2), ub=runif(2))  
melt(dt, measure.vars=measure(value.name, letter, pattern="([un])([ab])"))
#>    i letter          n         u
#> 1: 1      a -0.6042333 0.3756086
#> 2: 2      a  0.4125218 0.2719224
#> 3: 1      b  0.2163859 0.5793461
#> 4: 2      b -0.6725394 0.1945757
melt(dt, measure.vars=measure(letter, value.name, pattern="([un])([ab])"))
#>    i letter          a          b
#> 1: 1      n -0.6042333  0.2163859
#> 2: 2      n  0.4125218 -0.6725394
#> 3: 1      u  0.3756086  0.5793461
#> 4: 2      u  0.2719224  0.1945757

Just a reminder here: the issue is about missing documentation of the current behavior.

Myself, I would prefer the new functionality, but I can work around with the existing one. But without details laid out in documentation, it feels somewhat unsafe to assume how the current approach works. In particular, to assume that the numeric values correspond to the original values in alphabetic order.

I have proposed a doc fix in #4723
\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}

Was this page helpful?
0 / 5 - 0 ratings