Pandas: QST: Any support for df["x"] = y where y.columns are MultiIndexed?

Created on 14 Aug 2020 · 9Comments · Source: pandas-dev/pandas

We're working with DataFrames where the columns are MultiIndexed, so e.g. ones that look like this:

       pca1      pca2
0  0.754675  1.868685
1 -1.861651 -0.048236
2 -0.797750  0.388400

which one can get through pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]])).

We now want to combine several of those to e.g. get this:

        pca                 nmf          
       pca1      pca2      nmf1      nmf2
0  1.671707  0.452155  1.671707  0.452155
1  0.861315 -0.100849  0.861315 -0.100849
2  1.056616 -0.852532  1.056616 -0.852532

We know that we can do this through e.g. pd.concat([df_pca, df_nmf], axis=1). Is there any support for doing the same like this: df["pca"] = df_pca for some df? We get ValueError: Wrong number of items passed 4, placement implies 1.

It's really important for us to allow usage like this: df["pca"] = df_pca and not just through concat.

Enhancement Indexing MultiIndex

Source

henrifroese

Most helpful comment

Sure, see here (probably should have linked this in the first place :octocat: ). You can already see our plans to change some pandas code, we were hoping for a nicer solution. From a user perspective working with these "Subcolumns" (i.e. MultiIndex in the columns), it _feels like df["pca"] = df_pca should work_.

EDIT: We just found a better solution; it still requires changing pandas code but it's much cleaner:

import pandas as pd
import numpy as np

_pd_original__setitem__ = pd.DataFrame.__setitem__
pd.DataFrame._original__setitem__ = _pd_original__setitem__

def _hero__setitem__(self, key, value):

    # called when doing self["key"] = value
    # e.g. df["pca"] = hero.pca(df["texts"]) is
    # doing pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    #
    # what we want to do:
    # 1. user calls df["pca"] = hero.pca(df["texts"]), so __setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    # 2. we make self multiindexed if it isn't already
    # 3. we change value columns so the first level is named `key`
    # 4. we do self[value.columns] = value

    # 1.
    if isinstance(value, pd.DataFrame) and isinstance(value.columns, pd.MultiIndex) and isinstance(key, str):

        # 2.
        if not isinstance(self.columns, pd.MultiIndex):
            self.columns = pd.MultiIndex.from_tuples([(col_name, "") for col_name in self.columns.values])

        # 3.
        value.columns = pd.MultiIndex.from_tuples([(key, subcol_name) for _, subcol_name in value.columns.values])

        # 4.
        self[value.columns] = value

    else:

        self._original__setitem__(key, value)


pd.DataFrame.__setitem__ = _hero__setitem__

Example:

df = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
df2 = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pipapo'], ["nmf1", "nmf2"]]))


df["nmf"] = df2
print(df)

henrifroese on 14 Aug 2020

👍2

All 9 comments

Thanks for asking this. Can you explain why concat is not sufficient here?

rhshadrach on 14 Aug 2020

EDIT: We just found a better solution; it still requires changing pandas code but it's much cleaner:

import pandas as pd
import numpy as np

_pd_original__setitem__ = pd.DataFrame.__setitem__
pd.DataFrame._original__setitem__ = _pd_original__setitem__

def _hero__setitem__(self, key, value):

    # called when doing self["key"] = value
    # e.g. df["pca"] = hero.pca(df["texts"]) is
    # doing pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    #
    # what we want to do:
    # 1. user calls df["pca"] = hero.pca(df["texts"]), so __setitem__(self=df, key="pca", value=hero.pca(df["texts"])
    # 2. we make self multiindexed if it isn't already
    # 3. we change value columns so the first level is named `key`
    # 4. we do self[value.columns] = value

    # 1.
    if isinstance(value, pd.DataFrame) and isinstance(value.columns, pd.MultiIndex) and isinstance(key, str):

        # 2.
        if not isinstance(self.columns, pd.MultiIndex):
            self.columns = pd.MultiIndex.from_tuples([(col_name, "") for col_name in self.columns.values])

        # 3.
        value.columns = pd.MultiIndex.from_tuples([(key, subcol_name) for _, subcol_name in value.columns.values])

        # 4.
        self[value.columns] = value

    else:

        self._original__setitem__(key, value)


pd.DataFrame.__setitem__ = _hero__setitem__

Example:

df = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
df2 = pd.DataFrame(np.random.normal(size=(6,)).reshape(
    (3, 2)), columns=pd.MultiIndex.from_product([['pipapo'], ["nmf1", "nmf2"]]))


df["nmf"] = df2
print(df)

henrifroese on 14 Aug 2020

👍2

Is something of the form:

df[df_pca.columns] = df_pca

acceptable? Here is a small example:

````
df = pd.DataFrame(
{
('a', '1'): [1, 2, 3],
}
)

new_vals = pd.DataFrame(
{
('b', '1'): [4, 5, 6],
('b', '2'): [7, 8, 9],
}
)
df[new_val.columns] = new_vals
print(df)
````

Output:

a b 1 1 2 0 1 4 7 1 2 5 8 2 3 6 9

Note: in the link, you mention combining a DataFrame with an Index of columns with one that has a MultiIndex. Even with pd.concat, you will not get a MultiIndex:

````
df = pd.DataFrame(
{
'a': [1, 2, 3],
}
)

new_val = pd.DataFrame(
{
('b', '1'): [4, 5, 6],
('b', '2'): [7, 8, 9],
}
)
df[new_val.columns] = new_val
print(df)
````

Output:

a (b, 1) (b, 2) 0 1 4 7 1 2 5 8 2 3 6 9

rhshadrach on 14 Aug 2020

👍1

Thanks, see the edit above :robot: :grimacing: . It makes our implementation way easier so that's great, still we have to change some Pandas just for our library so that's suboptimal.

henrifroese on 14 Aug 2020

Glad you found a workable solution. As for implementing something like this in pandas itself, I think the added complexity may not be worth it. Would like to hear others thoughts on this though.

rhshadrach on 16 Aug 2020

Glad you found a workable solution

Right, we have sadly now noticed this (so it isn't a viable solution after all :confused:):

So our main issue is that we want to

store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with "Subcolumns")
and allow users to place this in their DataFrame with df["pca"] = ....

The problem we're now facing with our implementation:

Internally, pandas will at some point loop over all "subcolumns" in pd.DataFrame._setitem_array and do

for k1, k2 in zip(key, value.columns):
    self[k1] = value[k2]

which is of course extremely slow when working with a few hundred dimensions / subcolumns.

So seems like we're actually back at square one to find a performant and good-looking implementation of matrices in DataFrames.

henrifroese on 16 Aug 2020

Speaking mostly as a pretty heavy pandas user here, this has always seemed like the biggest piece of missing functionality to me. I feel like a lot of what I do (as an economist and otherwise) is looking at various data series over a panel (say US counties over time). If you have GDP and population, it would be amazing to just be able to do the intuitive thing to calculate and assign GDP per capita. I know how to use concat or stack/unstack, but I've seen less experienced users get tripped up by this.

I'm not super well-versed on MultiIndex internals, but I am curious, what are the major hurdles preventing this? Is it issues with determining whether the self and other indices are compatible or more a matter of possible unintended consequences on the user side?