Pandas: Bug: rename incapable of accepting tuples as new name

Created on 1 Feb 2018 · 26Comments · Source: pandas-dev/pandas

Pandas is incapable of renaming a pandas.Index object with tuples as the new value. Providing a tuple as new_name in pandas.DataFrame.rename({old_name: new_name}, axis="index") returns a pandas.MultiIndex object, and providing it within a singleton tuple returns an undesirable result. See code below (work-around at bottom):...

import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.arange(5), index=[(x, x) for x in range(5)], columns=["Value"])
print(df) # Note that df.index is a pd.Index object of 2-length tuples

# Wish to rename axis label, but keep the same style
df2 = df.rename({(1,1):(1,5)}, axis="index") 

print(df2)  # Woah! - df2.index is of MultiIndex type
print(df2.index) # ... and here's proof

# Maybe I can get around this by passing it as a singleton tuple...
df3 = df.rename({(1,1):((1,5),)}, axis="index") 
print(df3) # ... apparently not

Will produce the output:

        Value
(0, 0)      0
(1, 1)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4

     Value
0 0      0
1 5      1
2 2      2
3 3      3
4 4      4
MultiIndex(levels=[[0, 1, 2, 3, 4], [0, 2, 3, 4, 5]],
           labels=[[0, 1, 2, 3, 4], [0, 4, 1, 2, 3]])

           Value
(0, 0)         0
((1, 5),)      1
(2, 2)         2
(3, 3)         3
(4, 4)         4

Desired/Expected output:

        Value
(0, 0)      0
(1, 5)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4

Problem description

The current behaviour is a problem for two reasons:

It is un-intuitive - I can't see why a user would expect renaming an index to change the index's type.
There is no way rename Index objects with tuples

I have checked for similar issues by search of the word rename, and at time of writing, pandas 0.22.0 is the latest released version.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-112-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.0.3
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.8.0
bs4: 4.5.1
html5lib: 1.0b10
sqlalchemy: 1.1.3
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Workaround

The workaround below uses set_value function which the documentation tells the user to avoid using (unless you really know what you're doing):

df.index.set_value(df.index.get_values(), (1,1), (1, 5)) 
df.reset_index(inplace=True)
df.set_index("index", inplace=True)
df.index.name = None # Arguably not necessary...
print(df)

Produces the output:

        Value
(0, 0)      0
(1, 5)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4

Bug MultiIndex

Source

charlie0389

All 26 comments

You're fighting against pandas by using tuples as keys in your Index, instead of using a MultiIndex..

cc @toobaz is this worth attempting to support?

TomAugspurger on 2 Feb 2018

I agree with @TomAugspurger that tuples as keys are weird. That said, I _might_ have fixed this somewhere... maybe https://github.com/pandas-dev/pandas/pull/18600 ... in any case I guess it will be supported eventually.

toobaz on 2 Feb 2018

Consider this circumstance:

You have segments of data which is indexed by a natural hierarchical relationship (i.e. each segment of data is suitable for a multi-index).
However, different segments of the data do not have the same heirarchical relationship (i.e. not the same levels, labels, or dimensions), so concatentating is not an option (or is at least messy and/or difficult to generalise).
It is necessary to merge/concatenate the data.
It is necessary to select any and all rows by index.
The heirarchical relationship has to be preserved in some form.

In this circumstance, it would seem to me that a simple Index with tuples is the most obvious and easy solution, and may even be the only option.

charlie0389 on 3 Feb 2018

I might have fixed this somewhere... maybe #18600 .

Uhm, no, that PR is unrelated. And I was probably just confused.

I still think this is going to be fixed... sooner or later.

toobaz on 5 Feb 2018

👍1

as @TomAugspurger says above, this is simply not supported and you are fighting pandas like crazy here. The only way I could see doing this going forward would be to have an actual TupleIndex (subclassing EA) that is pretty explicity created here.

Closing this as won't fix.

jreback on 30 Mar 2018

FWIW, I think think when https://github.com/pandas-dev/pandas/issues/17246 is fixed, this will happen to be fixed as well.

TomAugspurger on 30 Mar 2018

I don't see this as "supporting tuples", but as "supporting anything which we don't state is not supported" (and can be supported). The bug must lie in a tupleize_cols somewhere - that is, the code is "actively" doing something wrong, it's not just "missing a feature".

This said, I totally agree this is low priority.

toobaz on 31 Mar 2018

I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False, but I assume this approach would be avoided because it would be a large impact on the API. This surprising change of type is discussed in #17246, and will hopefully be included in the fix.

@jreback argues that the fix is inappropriate, and that using tuples is unsupported. If that be the case - assuming #17246 is not going to be fixed soon, or, even if it is fixed, it doesn't fix this bug - then I think it should be clearly documented that tuples are not supported. Not supporting tuples I think would be a little disappointing, simply because I can't see a more obvious way to support the circumstances I have previously outlined above.

charlie0389 on 31 Mar 2018

I think this thread might be benefited by an example of why supporting tuples is a good idea. Consider for example, the country Australia and the states within it: NSW, QLD, VIC, TAS, WA, SA, NT, ACT. Also consider the region "Murray Darling Basin", which also has a natural hierarchical relationship to Australia, but specifies an area within NSW, VIC and SA (but does not completely include all of those states - it specifies the water catchment area). With reference to my earlier comments about the circumstances in which tuples in the index are useful:

You have segments of data which is indexed by a natural hierarchical relationship (i.e. each segment of data is suitable for a multi-index).

There exists a natural hierarchical relationship between 'Australia' and these states - i.e. each state lies within Australia. There is also a relationship between 'Murray Darling Basin' and 'Australia'.

However, different segments of the data do not have the same heirarchical relationship (i.e. not the same levels, labels, or dimensions), so concatentating is not an option (or is at least messy and/or difficult to generalise).

Consider that you wish to include in your dataframe, data series with names:
('Australia', 'NSW')
('Australia', 'Murray Darling Basin')
It would be inappropriate to call 'Murray Darling Basin' a state, and the data that it refers to will have no obvious mathematical connection to the data regarding the other states.

It is necessary to merge/concatenate the data.

Because I should be able to.

It is necessary to select any and all rows by index.

If it's a multiindex, and there are None or * fields, my recollection is that this doesn't play nice (hence a 3-level multi-index may not be a simple workaround).

The heirarchical relationship has to be preserved in some form.

Because I want to export to a file that is interpreted by another program that understand the hierarchical relationship..

I hope it is now clear why an Index of tuples becomes the most obvious, if not the best, option for solving some problems.

charlie0389 on 31 Mar 2018

I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False,

... or that one can pass tupleize_cols=False even when the default is tupleize_cols=True ;-)

Then probably this bug is fixed _also_ if the default changes (as mentioned by @TomAugspurger ), but that was not my point. When I wrote that the code is "actively" doing something wrong, I just meant that tupleize_cols=True means "infer", while tupleize_cols=False means "avoid inferring", regardless of which is the default.

toobaz on 31 Mar 2018

This is hard, since it isn't really clear from the name that .rename can change the index type.

w.r.t. your example @charlie0389, it's hard to say anything without actual code / data. I suspect that a MI is able to handle your problem.

TomAugspurger on 3 Apr 2018

Ok, for an example, please consider the following code:

print("Consider the given data:")
given_data = [0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]
print(given_data)
print()
print("With the given identifiers:")
given_labels = [("Australia", "NSW"), ("Australia", "ACT"), 
                                      ("Australia", "QLD"), 
                                     ("Australia", "NT"), ("Australia", "SA"), 
                                     ("Australia", "WA"), 
                                     ("Australia", "TAS"), ("Australia", "VIC"), 
                                     ("Australia", "Murray-Darling Basin")]
print(given_labels)
df = pd.DataFrame(data=given_data, 
                  index=given_labels,
             columns=["Millions of Sq. kms"])
print()
print("Which can be stored appropriately in the dataframe:")
print(df)
print("""
Because data in the same column is intepreted to be of the same \
type, this form implies that all the labels \
are conceptual equals (which is True - they all identify land areas in Australia). \
Furthermore, this \
allows the user to keep the hierarchical relationship \
between the first and second fields of each tuple (and is therefore the desired form).
""")


df.index = pd.Index(df.index.tolist())
print(df)
print("""This structure implies that all items in the second index column are conceptual \
equals (which is False). (The Murray-Darling basin is not a state of Australia).
""")

# Note that restructuring doesn't really make sense either - for example:
df = pd.DataFrame(data=[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0], 
                  index=[("Australia", None, "NSW"), ("Australia", None, "ACT"), 
                                      ("Australia", None, "QLD"), 
                                     ("Australia", None,  "NT"), ("Australia", None,  "SA"), 
                                     ("Australia", None, "WA"), 
                                     ("Australia", None,  "TAS"), ("Australia", None, "VIC"), 
                                     ("Australia", "Murray-Darling Basin", None)], 
             columns=["Millions of Sq. kms"])
df.index = pd.MultiIndex.from_tuples(df.index)
print(df)
print("""
I'd argue this structure is unacceptable because it requires knowledge/logic to mutate \
given_index and to select any (or all) rows of the table. For example:
""")

print("Selecting all items:")
print(df.loc["Australia", :, :, :])
print()
print("Selecting a single item:")
print(df.loc["Australia", "Murray-Darling Basin", :, :])
print("""Both the selections above require knowledge that there are 3 fields which: 
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data that has more than 3 fields is \
appended to the frame?)""")

Which has the following output:

Consider the given data:
[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]

With the given identifiers:
[('Australia', 'NSW'), ('Australia', 'ACT'), ('Australia', 'QLD'), 
('Australia', 'NT'), ('Australia', 'SA'), ('Australia', 'WA'), ('Australia', 'TAS'), 
('Australia', 'VIC'), ('Australia', 'Murray-Darling Basin')]

Which can be stored appropriately in the dataframe:
                                   Millions of Sq. kms
(Australia, NSW)                                 0.800
(Australia, ACT)                                 0.002
(Australia, QLD)                                 1.700
(Australia, NT)                                  1.300
(Australia, SA)                                  1.000
(Australia, WA)                                  2.500
(Australia, TAS)                                 0.060
(Australia, VIC)                                 0.200
(Australia, Murray-Darling Basin)                1.000

Because data in the same column is intepreted to be of the same type, 
this form implies that all the labels are conceptual equals (which is True - 
they all identify land areas in Australia). Furthermore, this allows the 
user to keep the hierarchical relationship between the first and second 
fields of each tuple (and is therefore the desired form).

                                Millions of Sq. kms
Australia NSW                                 0.800
          ACT                                 0.002
          QLD                                 1.700
          NT                                  1.300
          SA                                  1.000
          WA                                  2.500
          TAS                                 0.060
          VIC                                 0.200
          Murray-Darling Basin                1.000
This structure implies that all items in the second index column are 
conceptual equals (which is False). (The Murray-Darling basin is 
not a state of Australia).

                                    Millions of Sq. kms
Australia NaN                  NSW                0.800
                               ACT                0.002
                               QLD                1.700
                               NT                 1.300
                               SA                 1.000
                               WA                 2.500
                               TAS                0.060
                               VIC                0.200
          Murray-Darling Basin NaN                1.000

I'd argue this structure is unacceptable because it requires knowledge/logic 
to mutate given_index and to select any (or all) rows of the table. For example:

Selecting all items:
                          Millions of Sq. kms
NaN                  NSW                0.800
                     ACT                0.002
                     QLD                1.700
                     NT                 1.300
                     SA                 1.000
                     WA                 2.500
                     TAS                0.060
                     VIC                0.200
Murray-Darling Basin NaN                1.000

Selecting a single item:
     Millions of Sq. kms
NaN                  1.0
Both the selections above require knowledge that there are 3 fields which: 
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data 
that has more than 3 fields is appended to the frame?)

Apologies for the wordiness, but I think it illustrates the conceptual point I'm trying to make.

charlie0389 on 4 Apr 2018

this form implies that all the labels are conceptual equals

This structure implies that all items in the second index column are conceptual equals (which is False).

I don't think it's relevant, but what do those two sentences mean?

The reason I say it's not relevant, is because the meaning you attach to a MultiIndex is up to you. Typically they're used to represent hierarchical data, but that's not necessary. It really is just a multi-part label, just like a tuple.

Attempting to interpret the "conceptual equals" bit, it seems like you're implicitly putting data in the index. You have some kind of is_state property in your head. That property is a piece of data not a label.

I don't understand the 3-level example. Again, though, it looks like you're putting some data in the index when it should go in the columns. Assuming the new level is something like is_water.

midx = pd.MultiIndex.from_tuples(given_labels)
df = pd.DataFrame({
    "sq. kms": given_data,
    "is_water": [False] * 8 + [True]
}, index=midx)
df

results in

		sq. kms	is_water
Australia	NSW	0.800	False
	ACT	0.002	False
	QLD	1.700	False
	NT	1.300	False
	SA	1.000	False
	WA	2.500	False
	TAS	0.060	False
	VIC	0.200	False
	Murray-Darling Basin	1.000	True

Which (IIUC) is a much better way to represent the data.

TomAugspurger on 4 Apr 2018

As much as I love indicizing stuff with MultiIndex, Python is a flexible language, people are used to that flexibility, and I think this makes it hard to argue that tuples as keys don't _make sense_. MultiIndexes are great if there is some hierarchical structure (i.e., levels have a meaning, i.e., "_all items in the second index column are conceptual equals_"), but this is not necessarily the case.

Consider keys which represent paths:

In [3]: megabytes = pd.Series([103, 30, 5],
                              index=pd.Index([('usr', 'share'), ('usr', 'bin'), ('usr', 'local', 'bin')], tupleize_cols=False))

In [4]: megabytes
Out[4]: 
(usr, share)         103
(usr, bin)            30
(usr, local, bin)      5
dtype: int64

This is not an index which it makes sense to store as MultiIndex - you don't even know ex ante the number of levels it would need. Sure, we could transform tuples in strings relatively easily... but you will apply this transformation only if you have to.

So: we can always say pandas does not support tuples _because it's just too messy_ (in terms of API, not necessarily just implementation). I just _don't think_ it is the case. But I might be wrong. In any case, I don't think that investigating the intentions of anybody who wants to use tuples as keys (also see #20597) is a viable long term solution :-)

toobaz on 4 Apr 2018

Is it possible to leave this bug open please?

Going from the discussion so far, no one is disputing the fact that this is a bug. The only question is whether it should be supported, which even then, there still seems to be little disagreement that this should be supported in one form or another - all the discussion appears to revolve around implementation.

charlie0389 on 13 Apr 2018

For those that stumble upon this at a later date and are similarly frustrated by this bug, the following code fixes it:

    @staticmethod
    def _transform_index(index, func, level=None, tupleize_cols=False):
        """
        Apply function to all values found in index.

        This includes transforming multiindex entries separately.
        Only apply function to one level of the MultiIndex if level is specified.
        """
        # Copied from pandas.core.internals._transform_index() with minor modification 
        # in response to pandas bug #19497
        if isinstance(index, pd.MultiIndex):
            if level is not None:
                items = [tuple(func(y) if i == level else y
                               for i, y in enumerate(x)) for x in index]
            else:
                items = [tuple(func(y) for y in x) for x in index]
            return pd.MultiIndex.from_tuples(items, names=index.names)
        else:
            items = [func(x) for x in index]
            return pd.Index(items, name=index.name, tupleize_cols=tupleize_cols)

The only differences are the function signature, and the last return line.

charlie0389 on 18 Apr 2018

@charlie0389 if you can open a PR where you

change the current _transform_index to tupleize_cols=False
add a test (e.g. your initial example)
verify that it passes all current tests (the part that most worries me)

... then I think it would be a good candidate for inclusion.

If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurger @jreback better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.

Reopening this at least temporarily as a fix seems feasible and simple.

toobaz on 18 Apr 2018

Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other? If you have tuples going in and out, you get an Index. If you have a MI going in and tuples coming out, you get a MI?

From: Pietro Battiston notifications@github.com
Sent: Tuesday, April 17, 2018 11:03:53 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] Bug: rename incapable of accepting tuples as new name (#19497)

@charlie0389https://github.com/charlie0389 if you can open a PR where you

change the current _transform_index to tupleize_cols=False
add a test (e.g. your initial example)
verify that it passes all current tests (the part that most worries me)

... then I think it would be a good candidate for inclusion.

If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurgerhttps://github.com/TomAugspurger @jrebackhttps://github.com/jreback better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.

Reopening this at least temporarily as a fix seems feasible and simple.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/19497#issuecomment-382251600, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIs9QzqC6VkhzwterzbYuVN5yTS8xks5tprspgaJpZM4R2YV9.

TomAugspurger on 18 Apr 2018

Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other?

Yes, that was my idea (with keep_type=True, that is, tupleize_cols=False), and that's how I would design things from scratch. I just ignore whether it breaks code relying on "chasing the type".

toobaz on 18 Apr 2018

@toobaz do you think we need a new parameter (keep_type=True) to .rename? I'm trying to think of situations where keep_type=False would be useful.

TomAugspurger on 18 Apr 2018

do you think we need a new parameter (keep_type=True) to .rename?

I don't think we do _in principle_: my only concern was about backwards compatibility (and if we don't, then the fix to this is really a matter of passing tupleize_cols=False).

I'm trying to think of situations where keep_type=False would be useful.

For all examples I can think of, explicitly recasting is a better solution.

toobaz on 18 Apr 2018

What's the backwards compatibility concern?

I misunderstood rename with a MI. I assumed the mapping got tuples, instead it gets the scalar elements.


In [22]: s = pd.Series(1, index=pd.MultiIndex.from_product([["A", "B"], ['a', 'b']]))

In [23]: s
Out[23]:
A  a    1
   b    1
B  a    1
   b    1
dtype: int64

In [24]: s.rename({"A": 'a'})
Out[24]:
a  a    1
   b    1
B  a    1
   b    1
dtype: int64

In that case, I think that passing tupleize_cols=False internally is just fine.

TomAugspurger on 18 Apr 2018

What's the backwards compatibility concern?

Just that somebody (I would already be happy if it doesn't happen in some tests) assumed the following is a reasonable way to create a MultiIndex:

In [2]: pd.Series(range(3)).rename({0 : (0,1), 1 : (1, 2), 2 : (2, 3)})
Out[2]: 
0  1    0
1  2    1
2  3    2
dtype: int64

toobaz on 18 Apr 2018

(but mine might be pure paranoia: if tests pass, I would proceed)

toobaz on 18 Apr 2018

For completeness: in principle code out there could also be relying on the fact that

In [2]: pd.Series(range(3), index=['1', '2', '3']).rename({'1' : 1, '2' : 2, '3' : 3.}).index
Out[2]: Float64Index([1.0, 2.0, 3.0], dtype='float64')

although implementation wise this can be decoupled from the issue of multi vs. flat, documentation wise we probably just want to say "the resulting index will have the same type", and disable this automatic conversion.

toobaz on 18 Apr 2018

assumed the following is a reasonable way to create a MultiIndex

Understood. That is a valid concern...

although implementation wise this can be decoupled from the issue of multi vs. flat

I think converting between types (numeric vs. Index, etc.) is fine. It's the conversion between multi vs. flat that we (maybe) want to disallow via .rename.

TomAugspurger on 18 Apr 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings