Pandas: What to do with future Ambiguity Error about overlapping names when sorting?

Created on 16 May 2018  路  12Comments  路  Source: pandas-dev/pandas

xref https://github.com/pandas-dev/pandas/pull/17361

In [24]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

In [25]: df.sort_values(['a', 'b'])
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: 'a' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[25]:
   a  b
a
1  1  3
2  2  4

What should the user do in this situation? Should we provide a keyword to disambiguate? A literal like pd.ColumnLabel('a') or pd.IndexName('a')? Or do we require that they rename an index or column?
Right now, they're essentially stuck with the last one. If we want to discourage that, then I suppose that's OK. But it's somewhat common to end up with overlapping names, from e.g. a groupby.

cc @jmmease

API Design

Most helpful comment

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

All 12 comments

I think the original idea was that this would indeed not be allowed any more, so that you cannot sort using a name that both index level name as column name.

This is also consistent with the idea of seeing an index as a "special column", and thus you have two "columns" with the same name, which currently also raises an error if you try to sort on that.
(But of course, that "idea" is currently not fully in practice, so it is not necessarily a good argument)

My recollection aligns with @jorisvandenbossche's, that we were working towards treating index levels as 'special columns'. Perhaps we should start collecting some common use cases where the ambiguity arises. @TomAugspurger could you post an example of the groupby situation you referred to?

In [5]: def func(df):
   ...:     return df.assign(b=df.b - df.b.mean())
   ...:
   ...:

In [6]: df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

In [7]: df.groupby("a").apply(func)
Out[7]:
     a    b
a
1 0  1  0.0
2 1  2  0.0

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

Closing this, and we can improve docs as we discover problems.

It is not ambiguous when I specify axis=0.

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=pd.Index([1, 2], name='a'))

df.sort_values(['a', 'b'], axis=0)

But the warning still.

@xieyuheng I agree with you should be not a warning if you specify axis=0, but pandas owners often have a special way to treat bugs, warnings

@VelizarVESSELINOV what do you mean?

@xieyuheng I think that is ambiguous. It's not clear whether the a is referring to the index or the column.

Another example of the error:

pd.DataFrame(
    [1,2,3,3,3,4,5,5,6,7,7,7,8,9], columns = ['a']
).groupby(
    'a',
#   as_index=False
).agg({
    'a':'count'
}).sort_values(
    'a'
)

Traceback:

ValueError: 'a' is both an index level and a column label, which is ambiguous.

To be clear, I'm OK with going down this road of "indexes are a special type of column, so don't have overlapping names". Perhaps we'll use this issue to collect use-cases where they arise, and document how to avoid them?

e.g. the last one would be something like "use groupby('a', as_index=False)".

This negates the usefulness of the entire groupby. I would prefer if this resulted in the same groupby values being shown without the ambiguous name (maybe with the word 'values' for the index column?

Renaming the index to a different name and using the new name as index resolves a similar issue that arrises with the use of pivot_table.

For example, for a df DataFrame that contains 'A', 'age', and 'gender', if you first groupby 'A' and want to create pivot_table using 'A' as index, python throws up TypeError: 'A' used as index and column.

The following worked for me.

Create a new variable and use it as index.

df["a"] = df.A
df.index.name = "a"
my_sum = df.pivot_table('age', index="a", columns='gender',
aggfunc="sum")

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  路  3Comments

Abrosimov-a-a picture Abrosimov-a-a  路  3Comments

ericdf picture ericdf  路  3Comments

idanivanov picture idanivanov  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments