I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.
An example:
set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
paste0("meas", rep(1:3, c(3, 3, 2)),
rep(c("_jan", "_feb", "_mar"))))
DT
# meas1_jan meas1_feb meas1_mar meas2_jan
# 1: 0.6600937557 0.3694227 0.04429776 0.02691871
# 2: 0.5562533571 0.2260477 0.99935621 0.62940965
# 3: 0.7800425424 0.9500860 0.05847563 0.90400149
# 4: 0.3372507696 0.1853700 0.96434815 0.23819722
# 5: 0.0008951684 0.2743278 0.33468493 0.55100025
# 6: 0.4064611420 0.6773607 0.76984110 0.03732329
# 7: 0.5057564366 0.6481860 0.46362786 0.10610388
# 8: 0.2232209216 0.3236526 0.94694898 0.32251033
# 9: 0.9359388519 0.1219411 0.79383274 0.52573046
# 10: 0.1367496471 0.0121765 0.80466101 0.86377631
# meas2_feb meas2_mar meas3_jan meas3_feb
# 1: 0.438328012 0.48073709 0.15475894 0.6031853
# 2: 0.951975449 0.38903743 0.41983518 0.1806789
# 3: 0.006685105 0.53661894 0.01102345 0.6490597
# 4: 0.719372826 0.49205557 0.24464973 0.9969787
# 5: 0.318251195 0.98845165 0.75659034 0.3054178
# 6: 0.117685706 0.10667707 0.60810269 0.1472457
# 7: 0.210446290 0.76481919 0.70049390 0.4345142
# 8: 0.900549510 0.68583183 0.93117619 0.3676016
# 9: 0.628168668 0.02526911 0.55742879 0.7729294
# 10: 0.498467443 0.17884387 0.97037215 0.1924867
We want to melt grouping each month's data, so patterns is quite helpful:
melt(DT[ , lapply(.SD, mean)],
measure.vars = patterns("_jan", "_feb", "_mar"))
Looks beautiful, but the output less so:
# variable value1 value2 value3
#1: 1 0.4542663 0.3788571 0.6180074
#2: 2 0.4204972 0.4789930 0.4648342
#3: 3 0.5354431 0.4650098 NA
We've lost a lot of info. Does value1 correspond to jan, feb, or mar? Does variable == 1 mean meas1, meas2, or meas3?
I hope that value1 means jan, value2 means feb, and value3 means mar, but this is unclear; ditto that valuek corresponds to variable == k. This is especially true if the column order of the input is potentially unknown.
The value.name and variable.name fields are pretty useless to help us here.
It seems the robust way to deal with this as of now is to specify the measure.vars in a list, like so:
melt(DT[ , lapply(.SD, mean)],
measure.vars = list(paste0("meas", 1:3, "_jan"),
paste0("meas", 1:3, "_feb"),
paste0("meas", 1:2, "_mar")))
With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make patterns seem a lot less helpful as a function.
Good point. It'd be nice to have value column and variable levels filled by guessing. But I won't be working on melt/cast anytime soon for now. This stuff is in R-side, IIRC. PRs are of course welcome.
@jangorecki do you think you can take a look at this? We'll need a way to extract the names and levels for value/variable column..
@MichaelChirico, It seems like we're thinking along the same lines for these things.
For what it's worth, I always find that "stubs" at the end of a variable name are more cumbersome to deal with. This is the case even with base R's reshape function.
Here's something that I had put together. Something in there slows it down to ~ half the speed of just using melt (but it's still very fast).
Even with that function, I would propose changing the name format to "jan_mean1" and so on.
Here's an example:
set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
paste0("meas", rep(1:3, c(3, 3, 2)),
rep(c("_jan", "_feb", "_mar"))))
ReshapeLong_(setnames(DT[, lapply(.SD, mean)],
gsub("(.*)_(.*)", "\\2_\\1", names(DT))),
stubs = c("jan", "feb", "mar"))
# variable jan feb mar
# 1: _meas1 0.4542663 0.3788571 0.6180074
# 2: _meas2 0.4204972 0.4789930 0.4648342
# 3: _meas3 0.5354431 0.4650098 NA
My rudimentary version of this was merged.stack in "splitstackshape", which has an atStart argument. However, it needs an "id" column the way it was written:
merged.stack(DT[, lapply(.SD, mean)][, id := sequence(.N)],
var.stubs = c("_jan", "_feb", "_mar"),
sep = "var.stubs", atStart = FALSE)
# id .time_1 _jan _feb _mar
# 1: 1 meas1 0.4542663 0.3788571 0.6180074
# 2: 1 meas2 0.4204972 0.4789930 0.4648342
# 3: 1 meas3 0.5354431 0.4650098 NA
@arun, I think we were discussing about using a character vector in patterns (which could then be reused in the value.name argument). (Probably in a chat on SO.)
@arunsrinivasan sure
@mrdwab be aware "@arun" won't notify our Arun, it's different person :)
@jangorecki this is not a priority just so you know :-).
@MichaelChirico I don't see why value.name would be useless. If you pass a vector of names corresponding to pattern (or list elements if list used) as value.name = c("jan","feb","mar") you get expected column names. Still this could be extended to extract from names of measure.vars list / patterns(...), but it wouldn't unlock any new capabilities.
I do agree on variable column values, currently for multi measure vars it is just an integer sequence, so no straight way to put expected values there. This seems to be Cfmelt code, what I could do is to remap integers after Cfmelt processing.
library(data.table)
set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
paste0("meas", rep(1:3, c(3, 3, 2)),
rep(c("_jan", "_feb", "_mar"))))
melt(DT[ , lapply(.SD, mean)],
measure.vars = list(paste0("meas", 1:3, "_jan"),
paste0("meas", 1:3, "_feb"),
paste0("meas", 1:2, "_mar")),
variable.name = "var.label",
value.name = c("jan","feb","mar"))
# var.label jan feb mar
#1: meas1 0.4542663 0.3788571 0.6180074
#2: meas2 0.4204972 0.4789930 0.4648342
#3: meas3 0.5354431 0.4650098 NA
melt(DT[ , lapply(.SD, mean)],
measure.vars = patterns("_jan", "_feb", "_mar"),
variable.name = "var.label",
value.name = c("jan","feb","mar"))
# var.label jan feb mar
#1: meas1 0.4542663 0.3788571 0.6180074
#2: meas2 0.4204972 0.4789930 0.4648342
#3: meas3 0.5354431 0.4650098 NA
it trims value.name values (also ,_) from column names that goes intovariablecolumn values.
In general it should be handled with vectorized longest common substring, which in base R seems to be quite a big block of code.
For exact changes see: https://github.com/Rdatatable/data.table/compare/master...jangorecki:named-melt - it is yet very error prone (fails one example now), it passvalue.nametogsubpattern argument so any special chars like dot will now have unexpected output.
Before I go any further with that I would prefer to be sure we want such solution, to post-processCfmeltwithvariable column values remapping. Feel free to pick it up further.
@jangorecki hmm, I thought I had tried that for value.name, must have been too late at night to have been thinking straight.
Certainly this capability is not clear from the man page:
value.name: name for the molten data values column. The default name is 'value'
Not clear that it accepts name_s_ -- reads as if value.name must be passed length-one character.
While we're on the man page, this phrasing is misleading:
measure.vars: ... Use the functionpatternsto provide multiple patterns conveniently
patterns is _not_ a function, as evidenced by its implementation:
measure.sub = substitute(measure.vars)
if (is.call(measure.sub) && measure.sub[[1L]] == "patterns") {
measure.vars = lapply(eval(measure.sub), grep, names(data))
}
If it were a function, this workaround/approach would work
mos <- c("jan","feb","mar")
melt(DT[ , lapply(.SD, mean)],
measure.vars = do.call("patterns",as.list(paste0("_", mos))),
value.name = mos)
For programmatic use you need to have that call in place. You can do it with substitute, just wrap below call with eval.
substitute(
melt(DT[ , lapply(.SD, mean)],
measure.vars = .measure.vars,
value.name = mos),
list(.measure.vars = as.call(c(as.name("patterns"),paste0("_", mos))))
)
Anyway I'm not quite convinced to Cfmelt post-process remapping variable values. Any better solution from me would probably need to wait until I pick up some C skills.
@jangorecki certainly the dox can be improved, value.name should be improved:
name for the molten data values column. The default name is 'value'
Almost impossible to guess that value.name will correctly handle an input of length >1 from that
This may have already been covered/ obvious (I see Jan mentioned "this could be extended to extract from names of measure.vars list / patterns(...)"), but it would be nice to write value.name into patterns:
melt(crm, meas=patterns(N = "^N", a = "^a", b = "^b", d = "^d", g = "^g"))
(Borrowed from an SO question.)
Oh I like the named patterns function.
(I think this still falls under the same general issue:)
library(data.table)
df <- data.table(
a1 = c(1,2,3),
a2 = c(2,3,4),
b1 = c(4,5,6),
b2 = c(5,6,7)
)
aux <- data.table(
id = c("a1", "a2", "b1", "b2"),
group = c("a", "a", "b", "b")
)
Suppose the goal is to have (a1, b1) rbinded over (a2, b2):
# current syntax
aux_sp = split(aux$id, aux$group)
melt(df, meas = aux_sp, value.name = names(aux_sp))
# desired syntax, with names taken from the named list aux_sp
melt(df, meas = aux_sp)
Taken from SO: http://stackoverflow.com/q/42375113/
@franknarf1 I was just coming back here to suggest the same thing about naming the "arguments" to patterns. I think that's a very natural implementation for this.
@MichaelChirico , @mattdowle , I like the addition of named patterns. However, I think there were two embedded questions in this issue, but only one of them is closed. Isn't part of the question also how to make the "variable" column retain more descriptive details when melting, and not just return integer values?
@mrdwab indeed, I had missed that, however #2551 is dedicated to specifically that (and is on my to-do list), so rest assured the feature is still forthcoming 馃槃
Most helpful comment
This may have already been covered/ obvious (I see Jan mentioned "this could be extended to extract from names of
measure.varslist /patterns(...)"), but it would be nice to write value.name into patterns:(Borrowed from an SO question.)