Pandas: CLN/API: wide_to_long or lreshape

Created on 28 Dec 2016 · 9Comments · Source: pandas-dev/pandas

xref https://github.com/pandas-dev/pandas/issues/2567

In [27]: data = pd.DataFrame({'hr1': [514, 573], 'hr2': [545, 526],
    ...:                       'team': ['Red Sox', 'Yankees'],
    ...:                       'year1': [2007, 2008], 'year2': [2008, 2008]})
    ...: 

In [28]: data
Out[28]: 
   hr1  hr2     team  year1  year2
0  514  545  Red Sox   2007   2008
1  573  526  Yankees   2008   2008

In [29]: pd.lreshape(data, {'year': ['year1', 'year2'], 'hr': ['hr1', 'hr2']})
Out[29]: 
      team  year   hr
0  Red Sox  2007  514
1  Yankees  2008  573
2  Red Sox  2008  545
3  Yankees  2008  526

In [30]: pd.wide_to_long(data, ['hr', 'year'], 'team', 'index')
Out[30]: 
                hr  year
team    index           
Red Sox 1      514  2007
Yankees 1      573  2008
Red Sox 2      545  2008
Yankees 2      526  2008

So we should drop one of these.

API Design Reshaping

Source

jreback

All 9 comments

cc @nuffe

jreback on 28 Dec 2016

Yes having both is redundant, but I think wide_to_long is more flexible?

lreshape does not handle group variables of different length

wide_to_long produces the correct result for lreshape's test case (but that is with dropna=False, which is also the output Stata gives)

I could not make lreshape produce the intented output for all of wide_to_long's test cases. This one, or this, for example.

erikcs on 28 Dec 2016

Is lreshape getting deprecated? There are some SO answers getting a decent amount of upvotes.

tdpetrou on 18 Aug 2017

@jreback I really like wide_to_long as it's the easiest way to 'simultaneously melt' different sets of columns. It would be nice if the identification variables, i were optional as lreshape is slightly easier when there are no identificaiton variables. Also, it would be good if i were changed to id_vars and j changed to var_name. Maybe this can all be solved if melt were to take a list of lists of columns.

tdpetrou on 23 Aug 2017

@tdpetrou well reducing the API surface area is good. not averse to modifying .melt() to do this. if you have a proposal pls put it up.

jreback on 24 Aug 2017

the is we have 3 functions to do somewhat similar things. happy to consolidate the API. (aside from which documentaiton on lresahpe is nil and wide_to_long not much better)

jreback on 24 Aug 2017

👍1

The simplest addition to melt would be to add functionality to do the simultaneous melting of different sets of columns. I think this would be achievable with the value_vars parameter accepting a list of lists or even a dictionary of lists (like lreshape). I think this would eliminate any use of lreshape.

To add the functionality of pd.wide_to_long, you might have to add three parameters, stubs, sep and suffix, where stubs would be a boolean whether or not the value_vars are stubnames or not.

tdpetrou on 24 Aug 2017

👍1

I agree the current configuration is not elegant. I made an earlier PR to wide_to_long to fix some edge cases that where wrong (which I discovered while cleaning a data set) but don't think it fits nicely into a consistent "calculus of data manipulations".

Looking to R and the "tidyverse" they now and then change their API and introduce new "verbs" for existing concepts: before, long was melt, and wide was done with dcast. Now it's gather and spread. In econometrics and statistics, long and wide is the common nomenclature, and is what Stata adheres to. Stata may be a dinosaur, but they are extremely consistent in their API and naming scheme.

Pandas' melt is a copy of Hadley Wickham's melt, which is a modification of base R's reshape (same command name as Stata by the way) with a new name - giving the API a impression of bits and pieces taken from here and there.

I don't really have a good and general proposal for a solution here, more than that IMHO a nomenclature should perhaps be chosen and stuck with.

erikcs on 13 Oct 2017

@erikcs I made a major enhancement to melt in #17677. With that, it can simultaneously melt any number of columns, and supports any kind of multiindex (it had very poor support before that) and handles duplicate column names as well. It also has wide_to_long functionality and with a little more tweaking it will exactly replicate it.

tdpetrou on 13 Oct 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings