Pandas: Aliases for column names

Created on 30 Nov 2015  路  10Comments  路  Source: pandas-dev/pandas

When I work with Pandas DataFrames, I prefer to keep the full column names for clarity. So when I print out the head, or use describe, I get a meaningful table. However, this also means I have column names like "Time of Sale" that become annoying to type out.

A nice compromise seems like it would be to have short "aliases" for column names. For instance, I can define the tos average for the above, perhaps like so:

df = pd.read_csv(...)
df.set_alias({'Time of Sale' : 'tos'})

Then, the __get_attribute__ method can look up aliases in addition to column names, so I can refer to that column simply as df.tos. But for all other purposes, the columns name is still the descriptive full name.

Would this make sense?

API Design Enhancement Indexing

Most helpful comment

I'd also like to see such a feature. For me the favourite use case would be to have nice, legible axes labels (with units) in seaborn plots. I know one can manually set the axis labels, but I find this error prone, too verbose and it leads to code duplication.

If you ask me, the easier way would be to keep the _current_ name in the role of an alias as @bbirand proposes, and to add some other field for a longer name, which can default to the "normal" name if none is explicitly given.

All 10 comments

related to #10349

I suppose this is possible. This would be fairly easy to implement, but would require a good number of test cases to ensure its propogating correctly (e.g. this is analagous to the name attribute for Indexes in that it propogates when appropriate).

Further would require an audit of the indexing code for it to be a synonymous application (e.g. you can use the alias where you could use the actual label).

So while this is interesting, it would require a pull-request from the community to jump start it.

I'll have a go at this when I get a chance. It also occurred to me that these aliases may be useful when dealing with DataFrame.query() methods. Based on my trials, this function does not work when there are spaces on the column names (please correct me if I'm wrong, I couldn't get them to work).

no .query processes strings so you cannot use strings, this is noted in the documentation.

I'm not a big fan of including this feature in pandas itself, because it would make the pandas data model significantly more complex. Maybe this could be implemented in some sort of add-on package that wraps pandas DataFrames? Another option would be a DataFrame subclass.

There are certainly risks that could be introduced from adding aliasing, but wouldn't a straightforward strategy be to augment the logic in get_attribute() that, presumably, already does some form of this. So if an alias dictionary existed on the DataFrame then it would try again provided the requested attribute (not found using "the usual mechanism") had a key entry in the alias dictionary. E.g.

# 1. works today:
df['Time of Sale']

# 2. fails today:
df.time_of_sale

# 3. could work in the future:
df.alias = dict(time_of_sale='Time of Sale')
df.time_of_sale

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

Or maybe I misunderstand and 2. is already possible today. If so, could someone point me in the right direction toward documentation? I too would find this quite useful.

In order to do 2., you would have to rename the column, possibly using http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

And then, when you'd like to print or plot it, you'd rename it back to the original version.

I too think this would still be a good addition for interactive work. To make things even more interesting, I would alias "Time of Sale" to "tos", so I can work with the data as df.tos, but then see the full name when plotted.

I'd also like to see such a feature. For me the favourite use case would be to have nice, legible axes labels (with units) in seaborn plots. I know one can manually set the axis labels, but I find this error prone, too verbose and it leads to code duplication.

If you ask me, the easier way would be to keep the _current_ name in the role of an alias as @bbirand proposes, and to add some other field for a longer name, which can default to the "normal" name if none is explicitly given.

Any update on this feature?

we need and equivalent for "SELECT max(column1)*0.25+ 0.44*sum(column2) as 'calculated_column' from TABLE group by column3,column5"

@luisfelipe18 - Actually, for aggregation you already have aliasing in Pandas, see here (I'd recommend reading through the entire post).

The current issue refers to aliasing existing columns, regardless of aggregation.

IMO, we shouldn't use this in pandas itself. Indexing is complicated enough without aliases.

We'd be better served by adopting / defining a convention (similar to how xarray uses CF conventions) for mapping column names to descriptive names. These could be stored in the DataFrame.attrs dict which (should) propagate through operations. Then downstream libraries (e.g. plotting libraries, libraries for generating tables for presentation) can use the descriptive names.

Was this page helpful?
0 / 5 - 0 ratings