Pandas is incapable of renaming a pandas.Index object with tuples as the new value. Providing a tuple as new_name in pandas.DataFrame.rename({old_name: new_name}, axis="index") returns a pandas.MultiIndex object, and providing it within a singleton tuple returns an undesirable result. See code below (work-around at bottom):...
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.arange(5), index=[(x, x) for x in range(5)], columns=["Value"])
print(df) # Note that df.index is a pd.Index object of 2-length tuples
# Wish to rename axis label, but keep the same style
df2 = df.rename({(1,1):(1,5)}, axis="index")
print(df2) # Woah! - df2.index is of MultiIndex type
print(df2.index) # ... and here's proof
# Maybe I can get around this by passing it as a singleton tuple...
df3 = df.rename({(1,1):((1,5),)}, axis="index")
print(df3) # ... apparently not
Will produce the output:
Value
(0, 0) 0
(1, 1) 1
(2, 2) 2
(3, 3) 3
(4, 4) 4
Value
0 0 0
1 5 1
2 2 2
3 3 3
4 4 4
MultiIndex(levels=[[0, 1, 2, 3, 4], [0, 2, 3, 4, 5]],
labels=[[0, 1, 2, 3, 4], [0, 4, 1, 2, 3]])
Value
(0, 0) 0
((1, 5),) 1
(2, 2) 2
(3, 3) 3
(4, 4) 4
Desired/Expected output:
Value
(0, 0) 0
(1, 5) 1
(2, 2) 2
(3, 3) 3
(4, 4) 4
The current behaviour is a problem for two reasons:
I have checked for similar issues by search of the word rename, and at time of writing, pandas 0.22.0 is the latest released version.
pd.show_versions()commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-112-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: 3.0.3
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.8.0
bs4: 4.5.1
html5lib: 1.0b10
sqlalchemy: 1.1.3
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The workaround below uses set_value function which the documentation tells the user to avoid using (unless you really know what you're doing):
df.index.set_value(df.index.get_values(), (1,1), (1, 5))
df.reset_index(inplace=True)
df.set_index("index", inplace=True)
df.index.name = None # Arguably not necessary...
print(df)
Produces the output:
Value
(0, 0) 0
(1, 5) 1
(2, 2) 2
(3, 3) 3
(4, 4) 4
You're fighting against pandas by using tuples as keys in your Index, instead of using a MultiIndex..
cc @toobaz is this worth attempting to support?
I agree with @TomAugspurger that tuples as keys are weird. That said, I _might_ have fixed this somewhere... maybe https://github.com/pandas-dev/pandas/pull/18600 ... in any case I guess it will be supported eventually.
Consider this circumstance:
In this circumstance, it would seem to me that a simple Index with tuples is the most obvious and easy solution, and may even be the only option.
I might have fixed this somewhere... maybe #18600 .
Uhm, no, that PR is unrelated. And I was probably just confused.
I still think this is going to be fixed... sooner or later.
as @TomAugspurger says above, this is simply not supported and you are fighting pandas like crazy here. The only way I could see doing this going forward would be to have an actual TupleIndex (subclassing EA) that is pretty explicity created here.
Closing this as won't fix.
FWIW, I think think when https://github.com/pandas-dev/pandas/issues/17246 is fixed, this will happen to be fixed as well.
I don't see this as "supporting tuples", but as "supporting anything which we don't state is not supported" (and can be supported). The bug must lie in a tupleize_cols somewhere - that is, the code is "actively" doing something wrong, it's not just "missing a feature".
This said, I totally agree this is low priority.
I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False, but I assume this approach would be avoided because it would be a large impact on the API. This surprising change of type is discussed in #17246, and will hopefully be included in the fix.
@jreback argues that the fix is inappropriate, and that using tuples is unsupported. If that be the case - assuming #17246 is not going to be fixed soon, or, even if it is fixed, it doesn't fix this bug - then I think it should be clearly documented that tuples are not supported. Not supporting tuples I think would be a little disappointing, simply because I can't see a more obvious way to support the circumstances I have previously outlined above.
I think this thread might be benefited by an example of why supporting tuples is a good idea. Consider for example, the country Australia and the states within it: NSW, QLD, VIC, TAS, WA, SA, NT, ACT. Also consider the region "Murray Darling Basin", which also has a natural hierarchical relationship to Australia, but specifies an area within NSW, VIC and SA (but does not completely include all of those states - it specifies the water catchment area). With reference to my earlier comments about the circumstances in which tuples in the index are useful:
You have segments of data which is indexed by a natural hierarchical relationship (i.e. each segment of data is suitable for a multi-index).
There exists a natural hierarchical relationship between 'Australia' and these states - i.e. each state lies within Australia. There is also a relationship between 'Murray Darling Basin' and 'Australia'.
However, different segments of the data do not have the same heirarchical relationship (i.e. not the same levels, labels, or dimensions), so concatentating is not an option (or is at least messy and/or difficult to generalise).
Consider that you wish to include in your dataframe, data series with names:
('Australia', 'NSW')
('Australia', 'Murray Darling Basin')
It would be inappropriate to call 'Murray Darling Basin' a state, and the data that it refers to will have no obvious mathematical connection to the data regarding the other states.
It is necessary to merge/concatenate the data.
Because I should be able to.
It is necessary to select any and all rows by index.
If it's a multiindex, and there are None or * fields, my recollection is that this doesn't play nice (hence a 3-level multi-index may not be a simple workaround).
The heirarchical relationship has to be preserved in some form.
Because I want to export to a file that is interpreted by another program that understand the hierarchical relationship..
I hope it is now clear why an Index of tuples becomes the most obvious, if not the best, option for solving some problems.
I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False,
... or that one can pass tupleize_cols=False even when the default is tupleize_cols=True ;-)
Then probably this bug is fixed _also_ if the default changes (as mentioned by @TomAugspurger ), but that was not my point. When I wrote that the code is "actively" doing something wrong, I just meant that tupleize_cols=True means "infer", while tupleize_cols=False means "avoid inferring", regardless of which is the default.
This is hard, since it isn't really clear from the name that .rename can change the index type.
w.r.t. your example @charlie0389, it's hard to say anything without actual code / data. I suspect that a MI is able to handle your problem.
Ok, for an example, please consider the following code:
print("Consider the given data:")
given_data = [0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]
print(given_data)
print()
print("With the given identifiers:")
given_labels = [("Australia", "NSW"), ("Australia", "ACT"),
("Australia", "QLD"),
("Australia", "NT"), ("Australia", "SA"),
("Australia", "WA"),
("Australia", "TAS"), ("Australia", "VIC"),
("Australia", "Murray-Darling Basin")]
print(given_labels)
df = pd.DataFrame(data=given_data,
index=given_labels,
columns=["Millions of Sq. kms"])
print()
print("Which can be stored appropriately in the dataframe:")
print(df)
print("""
Because data in the same column is intepreted to be of the same \
type, this form implies that all the labels \
are conceptual equals (which is True - they all identify land areas in Australia). \
Furthermore, this \
allows the user to keep the hierarchical relationship \
between the first and second fields of each tuple (and is therefore the desired form).
""")
df.index = pd.Index(df.index.tolist())
print(df)
print("""This structure implies that all items in the second index column are conceptual \
equals (which is False). (The Murray-Darling basin is not a state of Australia).
""")
# Note that restructuring doesn't really make sense either - for example:
df = pd.DataFrame(data=[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0],
index=[("Australia", None, "NSW"), ("Australia", None, "ACT"),
("Australia", None, "QLD"),
("Australia", None, "NT"), ("Australia", None, "SA"),
("Australia", None, "WA"),
("Australia", None, "TAS"), ("Australia", None, "VIC"),
("Australia", "Murray-Darling Basin", None)],
columns=["Millions of Sq. kms"])
df.index = pd.MultiIndex.from_tuples(df.index)
print(df)
print("""
I'd argue this structure is unacceptable because it requires knowledge/logic to mutate \
given_index and to select any (or all) rows of the table. For example:
""")
print("Selecting all items:")
print(df.loc["Australia", :, :, :])
print()
print("Selecting a single item:")
print(df.loc["Australia", "Murray-Darling Basin", :, :])
print("""Both the selections above require knowledge that there are 3 fields which:
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data that has more than 3 fields is \
appended to the frame?)""")
Which has the following output:
Consider the given data:
[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]
With the given identifiers:
[('Australia', 'NSW'), ('Australia', 'ACT'), ('Australia', 'QLD'),
('Australia', 'NT'), ('Australia', 'SA'), ('Australia', 'WA'), ('Australia', 'TAS'),
('Australia', 'VIC'), ('Australia', 'Murray-Darling Basin')]
Which can be stored appropriately in the dataframe:
Millions of Sq. kms
(Australia, NSW) 0.800
(Australia, ACT) 0.002
(Australia, QLD) 1.700
(Australia, NT) 1.300
(Australia, SA) 1.000
(Australia, WA) 2.500
(Australia, TAS) 0.060
(Australia, VIC) 0.200
(Australia, Murray-Darling Basin) 1.000
Because data in the same column is intepreted to be of the same type,
this form implies that all the labels are conceptual equals (which is True -
they all identify land areas in Australia). Furthermore, this allows the
user to keep the hierarchical relationship between the first and second
fields of each tuple (and is therefore the desired form).
Millions of Sq. kms
Australia NSW 0.800
ACT 0.002
QLD 1.700
NT 1.300
SA 1.000
WA 2.500
TAS 0.060
VIC 0.200
Murray-Darling Basin 1.000
This structure implies that all items in the second index column are
conceptual equals (which is False). (The Murray-Darling basin is
not a state of Australia).
Millions of Sq. kms
Australia NaN NSW 0.800
ACT 0.002
QLD 1.700
NT 1.300
SA 1.000
WA 2.500
TAS 0.060
VIC 0.200
Murray-Darling Basin NaN 1.000
I'd argue this structure is unacceptable because it requires knowledge/logic
to mutate given_index and to select any (or all) rows of the table. For example:
Selecting all items:
Millions of Sq. kms
NaN NSW 0.800
ACT 0.002
QLD 1.700
NT 1.300
SA 1.000
WA 2.500
TAS 0.060
VIC 0.200
Murray-Darling Basin NaN 1.000
Selecting a single item:
Millions of Sq. kms
NaN 1.0
Both the selections above require knowledge that there are 3 fields which:
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data
that has more than 3 fields is appended to the frame?)
Apologies for the wordiness, but I think it illustrates the conceptual point I'm trying to make.
this form implies that all the labels are conceptual equals
This structure implies that all items in the second index column are conceptual equals (which is False).
I don't think it's relevant, but what do those two sentences mean?
The reason I say it's not relevant, is because the meaning you attach to a MultiIndex is up to you. Typically they're used to represent hierarchical data, but that's not necessary. It really is just a multi-part label, just like a tuple.
Attempting to interpret the "conceptual equals" bit, it seems like you're implicitly putting data in the index. You have some kind of is_state property in your head. That property is a piece of data not a label.
I don't understand the 3-level example. Again, though, it looks like you're putting some data in the index when it should go in the columns. Assuming the new level is something like is_water.
midx = pd.MultiIndex.from_tuples(given_labels)
df = pd.DataFrame({
"sq. kms": given_data,
"is_water": [False] * 8 + [True]
}, index=midx)
df
results in
| sq. kms | is_water | ||
|---|---|---|---|
| Australia | NSW | 0.800 | False |
| ACT | 0.002 | False | |
| QLD | 1.700 | False | |
| NT | 1.300 | False | |
| SA | 1.000 | False | |
| WA | 2.500 | False | |
| TAS | 0.060 | False | |
| VIC | 0.200 | False | |
| Murray-Darling Basin | 1.000 | True |
Which (IIUC) is a much better way to represent the data.
As much as I love indicizing stuff with MultiIndex, Python is a flexible language, people are used to that flexibility, and I think this makes it hard to argue that tuples as keys don't _make sense_. MultiIndexes are great if there is some hierarchical structure (i.e., levels have a meaning, i.e., "_all items in the second index column are conceptual equals_"), but this is not necessarily the case.
Consider keys which represent paths:
In [3]: megabytes = pd.Series([103, 30, 5],
index=pd.Index([('usr', 'share'), ('usr', 'bin'), ('usr', 'local', 'bin')], tupleize_cols=False))
In [4]: megabytes
Out[4]:
(usr, share) 103
(usr, bin) 30
(usr, local, bin) 5
dtype: int64
This is not an index which it makes sense to store as MultiIndex - you don't even know ex ante the number of levels it would need. Sure, we could transform tuples in strings relatively easily... but you will apply this transformation only if you have to.
So: we can always say pandas does not support tuples _because it's just too messy_ (in terms of API, not necessarily just implementation). I just _don't think_ it is the case. But I might be wrong. In any case, I don't think that investigating the intentions of anybody who wants to use tuples as keys (also see #20597) is a viable long term solution :-)
Is it possible to leave this bug open please?
Going from the discussion so far, no one is disputing the fact that this is a bug. The only question is whether it should be supported, which even then, there still seems to be little disagreement that this should be supported in one form or another - all the discussion appears to revolve around implementation.
For those that stumble upon this at a later date and are similarly frustrated by this bug, the following code fixes it:
@staticmethod
def _transform_index(index, func, level=None, tupleize_cols=False):
"""
Apply function to all values found in index.
This includes transforming multiindex entries separately.
Only apply function to one level of the MultiIndex if level is specified.
"""
# Copied from pandas.core.internals._transform_index() with minor modification
# in response to pandas bug #19497
if isinstance(index, pd.MultiIndex):
if level is not None:
items = [tuple(func(y) if i == level else y
for i, y in enumerate(x)) for x in index]
else:
items = [tuple(func(y) for y in x) for x in index]
return pd.MultiIndex.from_tuples(items, names=index.names)
else:
items = [func(x) for x in index]
return pd.Index(items, name=index.name, tupleize_cols=tupleize_cols)
The only differences are the function signature, and the last return line.
@charlie0389 if you can open a PR where you
_transform_index to tupleize_cols=False... then I think it would be a good candidate for inclusion.
If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurger @jreback better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.
Reopening this at least temporarily as a fix seems feasible and simple.
Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other? If you have tuples going in and out, you get an Index. If you have a MI going in and tuples coming out, you get a MI?
From: Pietro Battiston notifications@github.com
Sent: Tuesday, April 17, 2018 11:03:53 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] Bug: rename incapable of accepting tuples as new name (#19497)
@charlie0389https://github.com/charlie0389 if you can open a PR where you
... then I think it would be a good candidate for inclusion.
If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurgerhttps://github.com/TomAugspurger @jrebackhttps://github.com/jreback better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.
Reopening this at least temporarily as a fix seems feasible and simple.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/pandas-dev/pandas/issues/19497#issuecomment-382251600, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABQHIs9QzqC6VkhzwterzbYuVN5yTS8xks5tprspgaJpZM4R2YV9.
Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other?
Yes, that was my idea (with keep_type=True, that is, tupleize_cols=False), and that's how I would design things from scratch. I just ignore whether it breaks code relying on "chasing the type".
@toobaz do you think we need a new parameter (keep_type=True) to .rename? I'm trying to think of situations where keep_type=False would be useful.
do you think we need a new parameter (keep_type=True) to .rename?
I don't think we do _in principle_: my only concern was about backwards compatibility (and if we don't, then the fix to this is really a matter of passing tupleize_cols=False).
I'm trying to think of situations where keep_type=False would be useful.
For all examples I can think of, explicitly recasting is a better solution.
What's the backwards compatibility concern?
I misunderstood rename with a MI. I assumed the mapping got tuples, instead it gets the scalar elements.
In [22]: s = pd.Series(1, index=pd.MultiIndex.from_product([["A", "B"], ['a', 'b']]))
In [23]: s
Out[23]:
A a 1
b 1
B a 1
b 1
dtype: int64
In [24]: s.rename({"A": 'a'})
Out[24]:
a a 1
b 1
B a 1
b 1
dtype: int64
In that case, I think that passing tupleize_cols=False internally is just fine.
What's the backwards compatibility concern?
Just that somebody (I would already be happy if it doesn't happen in some tests) assumed the following is a reasonable way to create a MultiIndex:
In [2]: pd.Series(range(3)).rename({0 : (0,1), 1 : (1, 2), 2 : (2, 3)})
Out[2]:
0 1 0
1 2 1
2 3 2
dtype: int64
(but mine might be pure paranoia: if tests pass, I would proceed)
For completeness: in principle code out there could also be relying on the fact that
In [2]: pd.Series(range(3), index=['1', '2', '3']).rename({'1' : 1, '2' : 2, '3' : 3.}).index
Out[2]: Float64Index([1.0, 2.0, 3.0], dtype='float64')
although implementation wise this can be decoupled from the issue of multi vs. flat, documentation wise we probably just want to say "the resulting index will have the same type", and disable this automatic conversion.
assumed the following is a reasonable way to create a MultiIndex
Understood. That is a valid concern...
although implementation wise this can be decoupled from the issue of multi vs. flat
I think converting between types (numeric vs. Index, etc.) is fine. It's the conversion between multi vs. flat that we (maybe) want to disallow via .rename.