Data.table: melt needs a variable.value parameter.

Created on 19 Mar 2017  路  9Comments  路  Source: Rdatatable/data.table

In order to reshape data.tables such as this one:

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
ID ZA_2001 ZA_2002 BB_2001 BB_2002 CC_2007
1       1       5       3       6       6
2       2       4       3       6       5
3       3       3       3       6       4
4       4       2       3       6       3
5       5       1       3       6       2

to this one

ID  measure ZA BB CC
1    2001  1  3 NA
1    2002  5  6 NA
1    2007 NA NA  6
2    2001  2  3 NA
2    2002  4  6 NA
2    2007 NA NA  5
3    2001  3  3 NA
3    2002  3  6 NA
3    2007 NA NA  4
4    2001  4  3 NA
4    2002  2  6 NA
4    2007 NA NA  3
5    2001  5  3 NA
5    2002  1  6 NA
5    2007 NA NA  2

and keep the original variables' order I need to do:

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

But it needs too many lines of code, a lot of memory, and it forces type conversions.

It would be great if we could just do it using melt:

melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")   

(or using regex instead).

But it doesn't work, it only produces 10 rows, and it would need an option to specify where to get the year from.

something like variable.values=sub(".*_","",names(mydata) )[-1]

or even better a syntax such as the splitstackshape package:
Reshape(mydata,id.vars="ID",var.stubs =... ,sep="_")
it would be great.

Low enhancement reshape

Most helpful comment

here is another solution using only data.table::melt,

remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- melt(mydata, measure.vars=list(
  ZA=c("ZA_2001", "ZA_2002", NA),
  BB=c("BB_2001", "BB_2002", NA),
  CC=c(NA, NA, "CC_2007"))))
#>     ID variable ZA BB CC
#>  1:  1        1  1  3 NA
#>  2:  2        1  2  3 NA
#>  3:  3        1  3  3 NA
#>  4:  4        1  4  3 NA
#>  5:  5        1  5  3 NA
#>  6:  1        2  5  6 NA
#>  7:  2        2  4  6 NA
#>  8:  3        2  3  6 NA
#>  9:  4        2  2  6 NA
#> 10:  5        2  1  6 NA
#> 11:  1        3 NA NA  6
#> 12:  2        3 NA NA  5
#> 13:  3        3 NA NA  4
#> 14:  4        3 NA NA  3
#> 15:  5        3 NA NA  2
mydata.tall[, measure := c(2001, 2002, 2007)[variable] ]
mydata.tall[order(ID, measure), .(
  ID, measure, ZA, BB, CC)]
#>     ID measure ZA BB CC
#>  1:  1    2001  1  3 NA
#>  2:  1    2002  5  6 NA
#>  3:  1    2007 NA NA  6
#>  4:  2    2001  2  3 NA
#>  5:  2    2002  4  6 NA
#>  6:  2    2007 NA NA  5
#>  7:  3    2001  3  3 NA
#>  8:  3    2002  3  6 NA
#>  9:  3    2007 NA NA  4
#> 10:  4    2001  4  3 NA
#> 11:  4    2002  2  6 NA
#> 12:  4    2007 NA NA  3
#> 13:  5    2001  5  3 NA
#> 14:  5    2002  1  6 NA
#> 15:  5    2007 NA NA  2

All 9 comments

Complete your set of columns and it'll work:

cols = setDT(tstrsplit(names(mydata)[-1], "_"))[, 
  CJ(var = V1, year = V2, unique = TRUE)][, 
  paste(var, year, sep="_")]

mydata[, setdiff(cols, names(mydata)) := NA ]
setcolorder(mydata, c("ID", cols))
melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")
# copied your code for the last line

Sure, you still need to fill in the year values, but that's another issue that's already been filed, I guess.

I've found that the melt step in my solution could be a big problem when working with medium-sized datatables.
If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is 50000000 x 500, just too much to continue.

I guess you can get the years' column with

temp <- melt(mydata, measure.vars=patterns("ZA","BB","CC"),
   value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")

temp[,year := rep(unique(sub(".*_", "", names(mydata)[-1])),
   length.out=nrow(temp))]

or also using data from your CJ.
And adding the options sorted=F to CJ.

@franknarf1 , any direct way to get this?

ID measure ZA BB CC
1 2001 1 3 NA
1 2002 5 6 NA
1 2007 NA NA 6
2 2001 2 3 NA
2 2002 4 6 NA
2 2007 NA NA 5
3 2001 3 3 NA
3 2002 3 6 NA
3 2007 NA NA 4
4 2001 4 3 NA
4 2002 2 6 NA
4 2007 NA NA 3
5 2001 5 3 NA
5 2002 1 6 NA
5 2007 NA NA 2

instead of this: (your method)

ID year ZA BB CC
1 2001 1 3 NA
2 2002 2 3 NA
3 2007 3 3 NA
4 2001 4 3 NA
5 2002 5 3 NA
1 2007 5 6 NA
2 2001 4 6 NA
3 2002 3 6 NA
4 2007 2 6 NA
5 2001 1 6 NA
1 2002 NA NA 6
2 2007 NA NA 5
3 2001 NA NA 4
4 2002 NA NA 3
5 2007 NA NA 2

without reordeing the result explicitily?

any direct way to get this?

Well, there's

mydata[, melt(.SD, measure.vars=patterns("ZA","BB","CC"),
 value.name =c("ZA","BB","CC"), variable.name="year"), by=ID]

However, doing this by group is a lot less efficient, probably.

There are no plans to add new arguments to melt / dcast. However there could be clever ways of extracting levels to variable column automatically from the input pattern.. although I don't see it too often a case.

hi the missing input columns in this melt is similar to #4027 and is fixed by #4720 if you use the new function in nc package (which uses data.table internally), not merged into master yet but you can use it via:

remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'nc' from a github remote, the SHA1 (11b61f8e) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- nc::capture_melt_multiple(
  mydata,
  column=".*",
  "_",
  year="[0-9]+", as.integer,
  fill=TRUE))
#>     ID year BB CC ZA
#>  1:  1 2001  3 NA  1
#>  2:  2 2001  3 NA  2
#>  3:  3 2001  3 NA  3
#>  4:  4 2001  3 NA  4
#>  5:  5 2001  3 NA  5
#>  6:  1 2002  6 NA  5
#>  7:  2 2002  6 NA  4
#>  8:  3 2002  6 NA  3
#>  9:  4 2002  6 NA  2
#> 10:  5 2002  6 NA  1
#> 11:  1 2007 NA  6 NA
#> 12:  2 2007 NA  5 NA
#> 13:  3 2007 NA  4 NA
#> 14:  4 2007 NA  3 NA
#> 15:  5 2007 NA  2 NA
mydata.tall[order(ID, year)]
#>     ID year BB CC ZA
#>  1:  1 2001  3 NA  1
#>  2:  1 2002  6 NA  5
#>  3:  1 2007 NA  6 NA
#>  4:  2 2001  3 NA  2
#>  5:  2 2002  6 NA  4
#>  6:  2 2007 NA  5 NA
#>  7:  3 2001  3 NA  3
#>  8:  3 2002  6 NA  3
#>  9:  3 2007 NA  4 NA
#> 10:  4 2001  3 NA  4
#> 11:  4 2002  6 NA  2
#> 12:  4 2007 NA  3 NA
#> 13:  5 2001  3 NA  5
#> 14:  5 2002  6 NA  1
#> 15:  5 2007 NA  2 NA

here is another solution using only data.table::melt,

remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- melt(mydata, measure.vars=list(
  ZA=c("ZA_2001", "ZA_2002", NA),
  BB=c("BB_2001", "BB_2002", NA),
  CC=c(NA, NA, "CC_2007"))))
#>     ID variable ZA BB CC
#>  1:  1        1  1  3 NA
#>  2:  2        1  2  3 NA
#>  3:  3        1  3  3 NA
#>  4:  4        1  4  3 NA
#>  5:  5        1  5  3 NA
#>  6:  1        2  5  6 NA
#>  7:  2        2  4  6 NA
#>  8:  3        2  3  6 NA
#>  9:  4        2  2  6 NA
#> 10:  5        2  1  6 NA
#> 11:  1        3 NA NA  6
#> 12:  2        3 NA NA  5
#> 13:  3        3 NA NA  4
#> 14:  4        3 NA NA  3
#> 15:  5        3 NA NA  2
mydata.tall[, measure := c(2001, 2002, 2007)[variable] ]
mydata.tall[order(ID, measure), .(
  ID, measure, ZA, BB, CC)]
#>     ID measure ZA BB CC
#>  1:  1    2001  1  3 NA
#>  2:  1    2002  5  6 NA
#>  3:  1    2007 NA NA  6
#>  4:  2    2001  2  3 NA
#>  5:  2    2002  4  6 NA
#>  6:  2    2007 NA NA  5
#>  7:  3    2001  3  3 NA
#>  8:  3    2002  3  6 NA
#>  9:  3    2007 NA NA  4
#> 10:  4    2001  4  3 NA
#> 11:  4    2002  2  6 NA
#> 12:  4    2007 NA NA  3
#> 13:  5    2001  5  3 NA
#> 14:  5    2002  1  6 NA
#> 15:  5    2007 NA NA  2

pure data.table solution using #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
melt(mydata, measure.vars=measure(value.name, year))
#>     ID year ZA BB CC
#>  1:  1 2001  1  3 NA
#>  2:  2 2001  2  3 NA
#>  3:  3 2001  3  3 NA
#>  4:  4 2001  4  3 NA
#>  5:  5 2001  5  3 NA
#>  6:  1 2002  5  6 NA
#>  7:  2 2002  4  6 NA
#>  8:  3 2002  3  6 NA
#>  9:  4 2002  2  6 NA
#> 10:  5 2002  1  6 NA
#> 11:  1 2007 NA NA  6
#> 12:  2 2007 NA NA  5
#> 13:  3 2007 NA NA  4
#> 14:  4 2007 NA NA  3
#> 15:  5 2007 NA NA  2
Was this page helpful?
0 / 5 - 0 ratings