In order to reshape data.tables such as this one:
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
ID ZA_2001 ZA_2002 BB_2001 BB_2002 CC_2007
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2
to this one
ID measure ZA BB CC
1 2001 1 3 NA
1 2002 5 6 NA
1 2007 NA NA 6
2 2001 2 3 NA
2 2002 4 6 NA
2 2007 NA NA 5
3 2001 3 3 NA
3 2002 3 6 NA
3 2007 NA NA 4
4 2001 4 3 NA
4 2002 2 6 NA
4 2007 NA NA 3
5 2001 5 3 NA
5 2002 1 6 NA
5 2007 NA NA 2
and keep the original variables' order I need to do:
idvars = grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),
measure = sub('.*_', '', variable), variable = NULL)]
temp[,var:=factor(var, levels=unique(var))]
dcast( temp, ... ~ var, value.var='value' )
But it needs too many lines of code, a lot of memory, and it forces type conversions.
It would be great if we could just do it using melt:
melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")
(or using regex instead).
But it doesn't work, it only produces 10 rows, and it would need an option to specify where to get the year from.
something like variable.values=sub(".*_","",names(mydata) )[-1]
or even better a syntax such as the splitstackshape package:
Reshape(mydata,id.vars="ID",var.stubs =... ,sep="_")
it would be great.
Complete your set of columns and it'll work:
cols = setDT(tstrsplit(names(mydata)[-1], "_"))[,
CJ(var = V1, year = V2, unique = TRUE)][,
paste(var, year, sep="_")]
mydata[, setdiff(cols, names(mydata)) := NA ]
setcolorder(mydata, c("ID", cols))
melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")
# copied your code for the last line
Sure, you still need to fill in the year values, but that's another issue that's already been filed, I guess.
I've found that the melt step in my solution could be a big problem when working with medium-sized datatables.
If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is 50000000 x 500, just too much to continue.
I guess you can get the years' column with
temp <- melt(mydata, measure.vars=patterns("ZA","BB","CC"),
value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")
temp[,year := rep(unique(sub(".*_", "", names(mydata)[-1])),
length.out=nrow(temp))]
or also using data from your CJ.
And adding the options sorted=F to CJ.
@franknarf1 , any direct way to get this?
ID measure ZA BB CC
1 2001 1 3 NA
1 2002 5 6 NA
1 2007 NA NA 6
2 2001 2 3 NA
2 2002 4 6 NA
2 2007 NA NA 5
3 2001 3 3 NA
3 2002 3 6 NA
3 2007 NA NA 4
4 2001 4 3 NA
4 2002 2 6 NA
4 2007 NA NA 3
5 2001 5 3 NA
5 2002 1 6 NA
5 2007 NA NA 2
instead of this: (your method)
ID year ZA BB CC
1 2001 1 3 NA
2 2002 2 3 NA
3 2007 3 3 NA
4 2001 4 3 NA
5 2002 5 3 NA
1 2007 5 6 NA
2 2001 4 6 NA
3 2002 3 6 NA
4 2007 2 6 NA
5 2001 1 6 NA
1 2002 NA NA 6
2 2007 NA NA 5
3 2001 NA NA 4
4 2002 NA NA 3
5 2007 NA NA 2
without reordeing the result explicitily?
any direct way to get this?
Well, there's
mydata[, melt(.SD, measure.vars=patterns("ZA","BB","CC"),
value.name =c("ZA","BB","CC"), variable.name="year"), by=ID]
However, doing this by group is a lot less efficient, probably.
There are no plans to add new arguments to melt / dcast. However there could be clever ways of extracting levels to variable column automatically from the input pattern.. although I don't see it too often a case.
hi the missing input columns in this melt is similar to #4027 and is fixed by #4720 if you use the new function in nc package (which uses data.table internally), not merged into master yet but you can use it via:
remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#> Use `force = TRUE` to force installation
#> Skipping install of 'nc' from a github remote, the SHA1 (11b61f8e) has not changed since last install.
#> Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- nc::capture_melt_multiple(
mydata,
column=".*",
"_",
year="[0-9]+", as.integer,
fill=TRUE))
#> ID year BB CC ZA
#> 1: 1 2001 3 NA 1
#> 2: 2 2001 3 NA 2
#> 3: 3 2001 3 NA 3
#> 4: 4 2001 3 NA 4
#> 5: 5 2001 3 NA 5
#> 6: 1 2002 6 NA 5
#> 7: 2 2002 6 NA 4
#> 8: 3 2002 6 NA 3
#> 9: 4 2002 6 NA 2
#> 10: 5 2002 6 NA 1
#> 11: 1 2007 NA 6 NA
#> 12: 2 2007 NA 5 NA
#> 13: 3 2007 NA 4 NA
#> 14: 4 2007 NA 3 NA
#> 15: 5 2007 NA 2 NA
mydata.tall[order(ID, year)]
#> ID year BB CC ZA
#> 1: 1 2001 3 NA 1
#> 2: 1 2002 6 NA 5
#> 3: 1 2007 NA 6 NA
#> 4: 2 2001 3 NA 2
#> 5: 2 2002 6 NA 4
#> 6: 2 2007 NA 5 NA
#> 7: 3 2001 3 NA 3
#> 8: 3 2002 6 NA 3
#> 9: 3 2007 NA 4 NA
#> 10: 4 2001 3 NA 4
#> 11: 4 2002 6 NA 2
#> 12: 4 2007 NA 3 NA
#> 13: 5 2001 3 NA 5
#> 14: 5 2002 6 NA 1
#> 15: 5 2007 NA 2 NA
here is another solution using only data.table::melt,
remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#> Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- melt(mydata, measure.vars=list(
ZA=c("ZA_2001", "ZA_2002", NA),
BB=c("BB_2001", "BB_2002", NA),
CC=c(NA, NA, "CC_2007"))))
#> ID variable ZA BB CC
#> 1: 1 1 1 3 NA
#> 2: 2 1 2 3 NA
#> 3: 3 1 3 3 NA
#> 4: 4 1 4 3 NA
#> 5: 5 1 5 3 NA
#> 6: 1 2 5 6 NA
#> 7: 2 2 4 6 NA
#> 8: 3 2 3 6 NA
#> 9: 4 2 2 6 NA
#> 10: 5 2 1 6 NA
#> 11: 1 3 NA NA 6
#> 12: 2 3 NA NA 5
#> 13: 3 3 NA NA 4
#> 14: 4 3 NA NA 3
#> 15: 5 3 NA NA 2
mydata.tall[, measure := c(2001, 2002, 2007)[variable] ]
mydata.tall[order(ID, measure), .(
ID, measure, ZA, BB, CC)]
#> ID measure ZA BB CC
#> 1: 1 2001 1 3 NA
#> 2: 1 2002 5 6 NA
#> 3: 1 2007 NA NA 6
#> 4: 2 2001 2 3 NA
#> 5: 2 2002 4 6 NA
#> 6: 2 2007 NA NA 5
#> 7: 3 2001 3 3 NA
#> 8: 3 2002 3 6 NA
#> 9: 3 2007 NA NA 4
#> 10: 4 2001 4 3 NA
#> 11: 4 2002 2 6 NA
#> 12: 4 2007 NA NA 3
#> 13: 5 2001 5 3 NA
#> 14: 5 2002 1 6 NA
#> 15: 5 2007 NA NA 2
pure data.table solution using #4731
remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#> Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
melt(mydata, measure.vars=measure(value.name, year))
#> ID year ZA BB CC
#> 1: 1 2001 1 3 NA
#> 2: 2 2001 2 3 NA
#> 3: 3 2001 3 3 NA
#> 4: 4 2001 4 3 NA
#> 5: 5 2001 5 3 NA
#> 6: 1 2002 5 6 NA
#> 7: 2 2002 4 6 NA
#> 8: 3 2002 3 6 NA
#> 9: 4 2002 2 6 NA
#> 10: 5 2002 1 6 NA
#> 11: 1 2007 NA NA 6
#> 12: 2 2007 NA NA 5
#> 13: 3 2007 NA NA 4
#> 14: 4 2007 NA NA 3
#> 15: 5 2007 NA NA 2
Most helpful comment
here is another solution using only data.table::melt,