Data.table: FR: expansion of melt functionality for handling names of output

Created on 25 Feb 2016 · 17Comments · Source: Rdatatable/data.table

I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.

An example:

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))
DT
#        meas1_jan meas1_feb  meas1_mar  meas2_jan
#  1: 0.6600937557 0.3694227 0.04429776 0.02691871
#  2: 0.5562533571 0.2260477 0.99935621 0.62940965
#  3: 0.7800425424 0.9500860 0.05847563 0.90400149
#  4: 0.3372507696 0.1853700 0.96434815 0.23819722
#  5: 0.0008951684 0.2743278 0.33468493 0.55100025
#  6: 0.4064611420 0.6773607 0.76984110 0.03732329
#  7: 0.5057564366 0.6481860 0.46362786 0.10610388
#  8: 0.2232209216 0.3236526 0.94694898 0.32251033
#  9: 0.9359388519 0.1219411 0.79383274 0.52573046
# 10: 0.1367496471 0.0121765 0.80466101 0.86377631
#       meas2_feb  meas2_mar  meas3_jan meas3_feb
#  1: 0.438328012 0.48073709 0.15475894 0.6031853
#  2: 0.951975449 0.38903743 0.41983518 0.1806789
#  3: 0.006685105 0.53661894 0.01102345 0.6490597
#  4: 0.719372826 0.49205557 0.24464973 0.9969787
#  5: 0.318251195 0.98845165 0.75659034 0.3054178
#  6: 0.117685706 0.10667707 0.60810269 0.1472457
#  7: 0.210446290 0.76481919 0.70049390 0.4345142
#  8: 0.900549510 0.68583183 0.93117619 0.3676016
#  9: 0.628168668 0.02526911 0.55742879 0.7729294
# 10: 0.498467443 0.17884387 0.97037215 0.1924867

We want to melt grouping each month's data, so patterns is quite helpful:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = patterns("_jan", "_feb", "_mar"))

Looks beautiful, but the output less so:

#    variable    value1    value2    value3
#1:        1 0.4542663 0.3788571 0.6180074
#2:        2 0.4204972 0.4789930 0.4648342
#3:        3 0.5354431 0.4650098        NA

We've lost a lot of info. Does value1 correspond to jan, feb, or mar? Does variable == 1 mean meas1, meas2, or meas3?

I hope that value1 means jan, value2 means feb, and value3 means mar, but this is unclear; ditto that valuek corresponds to variable == k. This is especially true if the column order of the input is potentially unknown.

The value.name and variable.name fields are pretty useless to help us here.

It seems the robust way to deal with this as of now is to specify the measure.vars in a list, like so:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = list(paste0("meas", 1:3, "_jan"),
                         paste0("meas", 1:3, "_feb"),
                         paste0("meas", 1:2, "_mar")))

With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make patterns seem a lot less helpful as a function.

Low enhancement

Source

MichaelChirico

👍5

Most helpful comment

This may have already been covered/ obvious (I see Jan mentioned "this could be extended to extract from names of measure.vars list / patterns(...)"), but it would be nice to write value.name into patterns:

melt(crm, meas=patterns(N = "^N", a = "^a", b = "^b", d = "^d", g = "^g"))

(Borrowed from an SO question.)

franknarf1 on 25 Aug 2016

👍3

All 17 comments

Good point. It'd be nice to have value column and variable levels filled by guessing. But I won't be working on melt/cast anytime soon for now. This stuff is in R-side, IIRC. PRs are of course welcome.

@jangorecki do you think you can take a look at this? We'll need a way to extract the names and levels for value/variable column..

arunsrinivasan on 25 Feb 2016

@MichaelChirico, It seems like we're thinking along the same lines for these things.

For what it's worth, I always find that "stubs" at the end of a variable name are more cumbersome to deal with. This is the case even with base R's reshape function.

Here's something that I had put together. Something in there slows it down to ~ half the speed of just using melt (but it's still very fast).

Even with that function, I would propose changing the name format to "jan_mean1" and so on.

Here's an example:

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))
ReshapeLong_(setnames(DT[, lapply(.SD, mean)], 
                      gsub("(.*)_(.*)", "\\2_\\1", names(DT))),
             stubs = c("jan", "feb", "mar"))
#    variable       jan       feb       mar
# 1:   _meas1 0.4542663 0.3788571 0.6180074
# 2:   _meas2 0.4204972 0.4789930 0.4648342
# 3:   _meas3 0.5354431 0.4650098        NA

My rudimentary version of this was merged.stack in "splitstackshape", which has an atStart argument. However, it needs an "id" column the way it was written:

merged.stack(DT[, lapply(.SD, mean)][, id := sequence(.N)], 
             var.stubs = c("_jan", "_feb", "_mar"), 
             sep = "var.stubs", atStart = FALSE)
#    id .time_1      _jan      _feb      _mar
# 1:  1   meas1 0.4542663 0.3788571 0.6180074
# 2:  1   meas2 0.4204972 0.4789930 0.4648342
# 3:  1   meas3 0.5354431 0.4650098        NA

mrdwab on 26 Feb 2016

@arun, I think we were discussing about using a character vector in patterns (which could then be reused in the value.name argument). (Probably in a chat on SO.)

mrdwab on 26 Feb 2016

@arunsrinivasan sure
@mrdwab be aware "@arun" won't notify our Arun, it's different person :)

jangorecki on 26 Feb 2016

@jangorecki this is not a priority just so you know :-).

arunsrinivasan on 27 Feb 2016

@MichaelChirico I don't see why value.name would be useless. If you pass a vector of names corresponding to pattern (or list elements if list used) as value.name = c("jan","feb","mar") you get expected column names. Still this could be extended to extract from names of measure.vars list / patterns(...), but it wouldn't unlock any new capabilities.
I do agree on variable column values, currently for multi measure vars it is just an integer sequence, so no straight way to put expected values there. This seems to be Cfmelt code, what I could do is to remap integers after Cfmelt processing.

jangorecki on 3 Mar 2016

library(data.table)

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))

melt(DT[ , lapply(.SD, mean)],
     measure.vars = list(paste0("meas", 1:3, "_jan"),
                         paste0("meas", 1:3, "_feb"),
                         paste0("meas", 1:2, "_mar")),
     variable.name = "var.label",
     value.name = c("jan","feb","mar"))
#   var.label       jan       feb       mar
#1:     meas1 0.4542663 0.3788571 0.6180074
#2:     meas2 0.4204972 0.4789930 0.4648342
#3:     meas3 0.5354431 0.4650098        NA
melt(DT[ , lapply(.SD, mean)],
     measure.vars = patterns("_jan", "_feb", "_mar"),
     variable.name = "var.label",
     value.name = c("jan","feb","mar"))
#   var.label       jan       feb       mar
#1:     meas1 0.4542663 0.3788571 0.6180074
#2:     meas2 0.4204972 0.4789930 0.4648342
#3:     meas3 0.5354431 0.4650098        NA

it trims value.name values (also ,_) from column names that goes intovariablecolumn values.
In general it should be handled with vectorized longest common substring, which in base R seems to be quite a big block of code.
For exact changes see: https://github.com/Rdatatable/data.table/compare/master...jangorecki:named-melt - it is yet very error prone (fails one example now), it passvalue.nametogsubpattern argument so any special chars like dot will now have unexpected output.
Before I go any further with that I would prefer to be sure we want such solution, to post-processCfmeltwithvariable column values remapping. Feel free to pick it up further.

jangorecki on 3 Mar 2016

@jangorecki hmm, I thought I had tried that for value.name, must have been too late at night to have been thinking straight.

Certainly this capability is not clear from the man page:

value.name: name for the molten data values column. The default name is 'value'

Not clear that it accepts name_s_ -- reads as if value.name must be passed length-one character.

While we're on the man page, this phrasing is misleading:

measure.vars: ... Use the function patterns to provide multiple patterns conveniently

patterns is _not_ a function, as evidenced by its implementation:

measure.sub = substitute(measure.vars)
if (is.call(measure.sub) && measure.sub[[1L]] == "patterns") {
    measure.vars = lapply(eval(measure.sub), grep, names(data))
}

If it were a function, this workaround/approach would work

mos <- c("jan","feb","mar")

melt(DT[ , lapply(.SD, mean)],
     measure.vars = do.call("patterns",as.list(paste0("_", mos))),
     value.name = mos)

MichaelChirico on 3 Mar 2016

For programmatic use you need to have that call in place. You can do it with substitute, just wrap below call with eval.

substitute(
    melt(DT[ , lapply(.SD, mean)],
         measure.vars = .measure.vars,
         value.name = mos),
    list(.measure.vars = as.call(c(as.name("patterns"),paste0("_", mos))))
)

Anyway I'm not quite convinced to Cfmelt post-process remapping variable values. Any better solution from me would probably need to wait until I pick up some C skills.

jangorecki on 3 Mar 2016

@jangorecki certainly the dox can be improved, value.name should be improved:

name for the molten data values column. The default name is 'value'

Almost impossible to guess that value.name will correctly handle an input of length >1 from that

MichaelChirico on 6 Apr 2016

melt(crm, meas=patterns(N = "^N", a = "^a", b = "^b", d = "^d", g = "^g"))

(Borrowed from an SO question.)

franknarf1 on 25 Aug 2016

👍3

Oh I like the named patterns function.

arunsrinivasan on 25 Aug 2016

(I think this still falls under the same general issue:)

library(data.table)
df <- data.table(
  a1 = c(1,2,3), 
  a2 = c(2,3,4), 
  b1 = c(4,5,6), 
  b2 = c(5,6,7)
)
aux <- data.table(
  id = c("a1", "a2", "b1", "b2"),
  group = c("a", "a", "b", "b")
)

Suppose the goal is to have (a1, b1) rbinded over (a2, b2):

# current syntax
aux_sp = split(aux$id, aux$group)
melt(df, meas = aux_sp, value.name = names(aux_sp))

# desired syntax, with names taken from the named list aux_sp
melt(df, meas = aux_sp)

Taken from SO: http://stackoverflow.com/q/42375113/

franknarf1 on 21 Feb 2017

@franknarf1 I was just coming back here to suggest the same thing about naming the "arguments" to patterns. I think that's a very natural implementation for this.

MichaelChirico on 8 May 2017

👍1

Update:

https://stackoverflow.com/a/48265181/3576984

MichaelChirico on 15 Jan 2018

@MichaelChirico , @mattdowle , I like the addition of named patterns. However, I think there were two embedded questions in this issue, but only one of them is closed. Isn't part of the question also how to make the "variable" column retain more descriptive details when melting, and not just return integer values?

mrdwab on 18 Jan 2018

@mrdwab indeed, I had missed that, however #2551 is dedicated to specifically that (and is on my to-do list), so rest assured the feature is still forthcoming 😄

MichaelChirico on 18 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings