xref https://github.com/pandas-dev/pandas/pull/17361
In [24]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))
In [25]: df.sort_values(['a', 'b'])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: 'a' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[25]:
a b
a
1 1 3
2 2 4
What should the user do in this situation? Should we provide a keyword to disambiguate? A literal like pd.ColumnLabel('a')
or pd.IndexName('a')
? Or do we require that they rename an index or column?
Right now, they're essentially stuck with the last one. If we want to discourage that, then I suppose that's OK. But it's somewhat common to end up with overlapping names, from e.g. a groupby.
cc @jmmease
I think the original idea was that this would indeed not be allowed any more, so that you cannot sort using a name that both index level name as column name.
This is also consistent with the idea of seeing an index as a "special column", and thus you have two "columns" with the same name, which currently also raises an error if you try to sort on that.
(But of course, that "idea" is currently not fully in practice, so it is not necessarily a good argument)
My recollection aligns with @jorisvandenbossche's, that we were working towards treating index levels as 'special columns'. Perhaps we should start collecting some common use cases where the ambiguity arises. @TomAugspurger could you post an example of the groupby
situation you referred to?
In [5]: def func(df):
...: return df.assign(b=df.b - df.b.mean())
...:
...:
In [6]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
In [7]: df.groupby("a").apply(func)
Out[7]:
a b
a
1 0 1 0.0
2 1 2 0.0
To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?
e.g. the last one would be something like "use groupby('a', as_index=False)
".
Closing this, and we can improve docs as we discover problems.
It is not ambiguous when I specify axis=0
.
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))
df.sort_values(['a', 'b'], axis=0)
But the warning still.
@xieyuheng I agree with you should be not a warning if you specify axis=0, but pandas owners often have a special way to treat bugs, warnings
@VelizarVESSELINOV what do you mean?
@xieyuheng I think that is ambiguous. It's not clear whether the a
is referring to the index or the column.
Another example of the error:
pd.DataFrame(
[1,2,3,3,3,4,5,5,6,7,7,7,8,9], columns = ['a']
).groupby(
'a',
# as_index=False
).agg({
'a':'count'
}).sort_values(
'a'
)
Traceback:
ValueError: 'a' is both an index level and a column label, which is ambiguous.
To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?
e.g. the last one would be something like "use
groupby('a', as_index=False)
".
This negates the usefulness of the entire groupby. I would prefer if this resulted in the same groupby values being shown without the ambiguous name (maybe with the word 'values' for the index column?
Renaming the index to a different name and using the new name as index resolves a similar issue that arrises with the use of pivot_table.
For example, for a df DataFrame that contains 'A', 'age', and 'gender', if you first groupby 'A' and want to create pivot_table using 'A' as index, python throws up TypeError: 'A' used as index and column.
The following worked for me.
df["a"] = df.A
df.index.name = "a"
my_sum = df.pivot_table('age', index="a", columns='gender',
aggfunc="sum")
Most helpful comment
To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?
e.g. the last one would be something like "use
groupby('a', as_index=False)
".