Pandas: Index gets lost when DataFrame melt method is used

Created on 5 Sep 2017  Â·  5Comments  Â·  Source: pandas-dev/pandas

Index gets lost when DataFrame melt method is used

import pandas as pd
import numpy as np
df = pd.DataFrame({"Numbers_1":range(0,3),
                   "Numbers_2":range(3,6),
                   "Letters":["A","B","C"]})
df.set_index("Letters",inplace=True)
print(df)

Letters | Numbers_1 | Numbers_2
| ------| ------| ------|
A | 0 | 3
B | 1 | 4
C | 2 | 5

df_melted = df.melt()
print(df_melted)

. |variable | value
| ------| ------| ------|
0 | Numbers_1 | 0
1 | Numbers_1 | 1
2 | Numbers_1 | 2
3 | Numbers_2 | 3
4 | Numbers_2 | 4
5 | Numbers_2 | 5

Problem description

When melting a dataframe, I expected the original index to be reused. However, the original index is lost in the melt method. This is probably meant by wesm's comment (# TODO: what about the existing index?)
https://github.com/pandas-dev/pandas/blob/133a2087d038da035a57ab90aad557a328b3d60b/pandas/core/reshape/reshape.py#L715

Expected Output

I would expect something like

n_row,n_col = df.shape
index_melted = list(df.index.get_values())*n_col
melt_id = list(np.arange(n_col).repeat(n_row))
temp = list(zip(*[index_melted,melt_id]))

index_melted_uniq = pd.MultiIndex.from_tuples(temp,names=[df.index.names[0], 'melt_id'])
index_numbers = list(range(df.shape[1]))*n_row

data = {'variable':df.columns.repeat(n_row),
        "value":df.values.ravel('F')}

df_expected = pd.DataFrame(data,columns = ["variable","value"], index=index_melted_uniq)
print(df_expected)

Letters | melt_id | variable | value
|------|------|------|------|
A | 0 | Numbers_1 | 0
B | 0 | Numbers_1 | 1
C | 0 | Numbers_1 | 2
A | 1 | Numbers_2 | 3
B | 1 | Numbers_2 | 4
C | 1 | Numbers_2 | 5

Where Letters and melt_id are two multiindex levels and variable and value are actual columns.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: d0f62c2816ada96a991f5a624a52c9a4f09617f7
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.21.0.dev+420.gd0f62c2
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.2.2.post20170724
Cython: 0.26
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

API Design Reshaping

Most helpful comment

Thanks @NiklasKeck. I thought about proposing this a while back, but never wrote up an issue. melt is already quite complex as is, but this seems worthwhile to avoid an awkward .reset_index() / .set_index() dance.

I never worked out the correct way to handle the interaction between the existing index and the id_vars. Your melt_id is an option but I'll need to think about it more. In an ideal world, I think that df.index + id_vars would always be unique, and we'd use that as the MI:

In [34]: df.reset_index().melt(id_vars=['Letters']).set_index(['Letters', 'variable'])
Out[34]:
                   value
Letters variable
A       Numbers_1      0
B       Numbers_1      1
C       Numbers_1      2
A       Numbers_2      3
B       Numbers_2      4
C       Numbers_2      5

but that may not be true in general.

Anyway, I think this would be a useful addition (as an option keyword, to preserve backwards compatibility)

All 5 comments

Thanks @NiklasKeck. I thought about proposing this a while back, but never wrote up an issue. melt is already quite complex as is, but this seems worthwhile to avoid an awkward .reset_index() / .set_index() dance.

I never worked out the correct way to handle the interaction between the existing index and the id_vars. Your melt_id is an option but I'll need to think about it more. In an ideal world, I think that df.index + id_vars would always be unique, and we'd use that as the MI:

In [34]: df.reset_index().melt(id_vars=['Letters']).set_index(['Letters', 'variable'])
Out[34]:
                   value
Letters variable
A       Numbers_1      0
B       Numbers_1      1
C       Numbers_1      2
A       Numbers_2      3
B       Numbers_2      4
C       Numbers_2      5

but that may not be true in general.

Anyway, I think this would be a useful addition (as an option keyword, to preserve backwards compatibility)

Thanks @TomAugspurger for the rapid and nice response. Your idea to use the original column names (variable) for the additional level is more intuitive than the arbitrary melt_id. I think the result should be unique if the original index and column names were unique. I will think about how to implement that.
A boolean optional keyword argument like keep_index=False might be an idea to ensure backwards compatibility.

I just met this issue too and felt it's such a pity for losing valuable index information. What happened to this feature addition and what can i do to make it happen?

Looks like https://github.com/pandas-dev/pandas/pull/17459. Feel free to
revive it if you're interested in this.

On Tue, Dec 11, 2018 at 11:39 AM gitgithan notifications@github.com wrote:

I just met this issue too and felt it's such a pity for losing valuable
index information. What happened to this feature addition and what can i do
to make it happen?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/17440#issuecomment-446292593,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIl5u5oNC65Kq9JI8Q15M1yV122-Uks5u3-3DgaJpZM4PM6m-
.

I'm very interested in this!

Was this page helpful?
0 / 5 - 0 ratings