Pandas: What to do with future Ambiguity Error about overlapping names when sorting?

Created on 16 May 2018 · 12Comments · Source: pandas-dev/pandas

xref https://github.com/pandas-dev/pandas/pull/17361

In [24]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

In [25]: df.sort_values(['a', 'b'])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: 'a' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[25]:
   a  b
a
1  1  3
2  2  4

What should the user do in this situation? Should we provide a keyword to disambiguate? A literal like pd.ColumnLabel('a') or pd.IndexName('a')? Or do we require that they rename an index or column?
Right now, they're essentially stuck with the last one. If we want to discourage that, then I suppose that's OK. But it's somewhat common to end up with overlapping names, from e.g. a groupby.

cc @jmmease

API Design

Source

TomAugspurger

Most helpful comment

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

TomAugspurger on 16 May 2018

👍6

All 12 comments

I think the original idea was that this would indeed not be allowed any more, so that you cannot sort using a name that both index level name as column name.

jorisvandenbossche on 16 May 2018

This is also consistent with the idea of seeing an index as a "special column", and thus you have two "columns" with the same name, which currently also raises an error if you try to sort on that.
(But of course, that "idea" is currently not fully in practice, so it is not necessarily a good argument)

jorisvandenbossche on 16 May 2018

My recollection aligns with @jorisvandenbossche's, that we were working towards treating index levels as 'special columns'. Perhaps we should start collecting some common use cases where the ambiguity arises. @TomAugspurger could you post an example of the groupby situation you referred to?

jonmmease on 16 May 2018

In [5]: def func(df):
   ...:     return df.assign(b=df.b - df.b.mean())
   ...:
   ...:

In [6]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [7]: df.groupby("a").apply(func)
Out[7]:
     a    b
a
1 0  1  0.0
2 1  2  0.0

TomAugspurger on 16 May 2018

e.g. the last one would be something like "use groupby('a', as_index=False)".

TomAugspurger on 16 May 2018

👍6

Closing this, and we can improve docs as we discover problems.

TomAugspurger on 16 May 2018

It is not ambiguous when I specify axis=0.

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

df.sort_values(['a', 'b'], axis=0)

But the warning still.

xieyuheng on 18 Jan 2019

👍3

@xieyuheng I agree with you should be not a warning if you specify axis=0, but pandas owners often have a special way to treat bugs, warnings

VelizarVESSELINOV on 13 Feb 2019

@VelizarVESSELINOV what do you mean?

@xieyuheng I think that is ambiguous. It's not clear whether the a is referring to the index or the column.

TomAugspurger on 13 Feb 2019

Another example of the error:

pd.DataFrame(
    [1,2,3,3,3,4,5,5,6,7,7,7,8,9], columns = ['a']
).groupby(
    'a',
#   as_index=False
).agg({
    'a':'count'
}).sort_values(
    'a'
)

Traceback:

ValueError: 'a' is both an index level and a column label, which is ambiguous.

Vetrintsev on 28 Mar 2019

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

This negates the usefulness of the entire groupby. I would prefer if this resulted in the same groupby values being shown without the ambiguous name (maybe with the word 'values' for the index column?

ivoska on 5 Feb 2020

Renaming the index to a different name and using the new name as index resolves a similar issue that arrises with the use of pivot_table.

For example, for a df DataFrame that contains 'A', 'age', and 'gender', if you first groupby 'A' and want to create pivot_table using 'A' as index, python throws up TypeError: 'A' used as index and column.

The following worked for me.