xref https://github.com/pandas-dev/pandas/issues/2567
In [27]: data = pd.DataFrame({'hr1': [514, 573], 'hr2': [545, 526],
...: 'team': ['Red Sox', 'Yankees'],
...: 'year1': [2007, 2008], 'year2': [2008, 2008]})
...:
In [28]: data
Out[28]:
hr1 hr2 team year1 year2
0 514 545 Red Sox 2007 2008
1 573 526 Yankees 2008 2008
In [29]: pd.lreshape(data, {'year': ['year1', 'year2'], 'hr': ['hr1', 'hr2']})
Out[29]:
team year hr
0 Red Sox 2007 514
1 Yankees 2008 573
2 Red Sox 2008 545
3 Yankees 2008 526
In [30]: pd.wide_to_long(data, ['hr', 'year'], 'team', 'index')
Out[30]:
hr year
team index
Red Sox 1 514 2007
Yankees 1 573 2008
Red Sox 2 545 2008
Yankees 2 526 2008
So we should drop one of these.
cc @nuffe
Yes having both is redundant, but I think wide_to_long is more flexible?
lreshape does not handle group variables of different length
wide_to_long produces the correct result for lreshape's test case (but that is with dropna=False, which is also the output Stata gives)
I could not make lreshape produce the intented output for all of wide_to_long's test cases. This one, or this, for example.
Is lreshape getting deprecated? There are some SO answers getting a decent amount of upvotes.
@jreback I really like wide_to_long as it's the easiest way to 'simultaneously melt' different sets of columns. It would be nice if the identification variables, i were optional as lreshape is slightly easier when there are no identificaiton variables. Also, it would be good if i were changed to id_vars and j changed to var_name. Maybe this can all be solved if melt were to take a list of lists of columns.
@tdpetrou well reducing the API surface area is good. not averse to modifying .melt() to do this. if you have a proposal pls put it up.
the is we have 3 functions to do somewhat similar things. happy to consolidate the API. (aside from which documentaiton on lresahpe is nil and wide_to_long not much better)
The simplest addition to melt would be to add functionality to do the simultaneous melting of different sets of columns. I think this would be achievable with the value_vars parameter accepting a list of lists or even a dictionary of lists (like lreshape). I think this would eliminate any use of lreshape.
To add the functionality of pd.wide_to_long, you might have to add three parameters, stubs, sep and suffix, where stubs would be a boolean whether or not the value_vars are stubnames or not.
I agree the current configuration is not elegant. I made an earlier PR to wide_to_long to fix some edge cases that where wrong (which I discovered while cleaning a data set) but don't think it fits nicely into a consistent "calculus of data manipulations".
Looking to R and the "tidyverse" they now and then change their API and introduce new "verbs" for existing concepts: before, long was melt, and wide was done with dcast. Now it's gather and spread. In econometrics and statistics, long and wide is the common nomenclature, and is what Stata adheres to. Stata may be a dinosaur, but they are extremely consistent in their API and naming scheme.
Pandas' melt is a copy of Hadley Wickham's melt, which is a modification of base R's reshape (same command name as Stata by the way) with a new name - giving the API a impression of bits and pieces taken from here and there.
I don't really have a good and general proposal for a solution here, more than that IMHO a nomenclature should perhaps be chosen and stuck with.
@erikcs I made a major enhancement to melt in #17677. With that, it can simultaneously melt any number of columns, and supports any kind of multiindex (it had very poor support before that) and handles duplicate column names as well. It also has wide_to_long functionality and with a little more tweaking it will exactly replicate it.