Pandas: DataFrames should have a name attribute.

Created on 5 Dec 2011  Â·  37Comments  Â·  Source: pandas-dev/pandas

was: should DataFrames have a name attribute?

@y-p

API Design Enhancement Ideas

Most helpful comment

I'd upvote this one. I'm using it to auto-title plots and think it would certainly be a nice feature.

All 37 comments

IMO it would make sense only if one would export to Excel worksheets, in such a case it would be nice to have.

It can also be used to set default _path_ in DataFrame.save(), e.g path = DataFrame.name

Could also be integrated into DataFrame.to_html and stuff like that also. I don't think it's too hard to add-- just will be a bit of a slog to make sure the name is passed on in the right places (it was quite a bit of hacking to add name to Series). Shoot for January or February sometime

I'd upvote this one. I'm using it to auto-title plots and think it would certainly be a nice feature.

I found uses for it too, however the name (as of v0.90) doesn't survive pickling, which if it did, would be useful to have working (my workaround is a bit of a fudge). To see the problem, try the following:

import pandas as pd
df = pd.DataFrame( data=np.ones([6,6]) )
df.name = 'Ones'
df.save('ones.df')
df2 = pd.load('ones.df')
print df2.name

I'd love to be able to dive in and contribute a fix, but I'm still not so well-versed in the library and many aspects of Python.

It's not a simple addition (you have to worry about preserving metadata through computations), but it would be nice. We'll probably look into it in the somewhat near future

Hi Paul,

I have written a couple functions that will let you transfer all the custom
attributes from one dataframe to antother. Check out the function
transfer_attributes()

https://github.com/hugadams/pyuvvis/tree/master/pyuvvis/pandas_utils

In particular, if you save dataframeserial.py, it will save and load your
dataframes while preserving any custom attributes. In the file
df_attrhandler, you can use the function called "transfer_attr" to do
something like:

df=DataFrame()
df.name='test'
df2=DataFrame

transfer_attr(df, df2)
print df2.name
'test'

I agree that persistent custom attributes would be a key development in the
future, and there is already a github issue open for it. In fact, a
package that I"ll be posting to the list soon, really does depend on these
custom attributes.

On Mon, Dec 3, 2012 at 1:01 PM, Wes McKinney [email protected]:

It's not a simple addition (you have to worry about preserving metadata
through computations), but it would be nice. We'll probably look into it in
the somewhat near future

—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/447#issuecomment-10963804.

Now that I have some time, I wanted to followup with this.

DataFrame, IMO, should have a .name attribute, and df.columns also should have a .name attribute and so should df.index, or at least I've found this useful in my work.

In any case, I think persistent attributes, and to a lesser extent, instance methods, would be an extremely important addition to pandas. Here's my reasoning:

Everybody that uses pandas for analysis outside of the scope of timeseries will eventually benefit from customizing/subclassing a DataFrame at some point. Usually, the dataframe is the ideal object for storing the numerical data, but there is also pertinent information that could go along with it to really customize the object. For example, a dataframe becomes the ideal choice for a spectroscopy experiment if one can store an extra array, the spectral baseline, outside of the dataframe. Additionally, experimental metadata ought to be stored. This is so easily done by adding attributes to the dataframe, that it almost begs to be the canonical way to handle spectral data.

The functions I wrote in the above link use a crude method to transfer arbitrary attributes between dataframes. In short, it first examines an empty DataFrame's attributes, and compares these with a list of attributes from the user's dataframe. Any differences are then transferred to the new dataframe. As a hack until a better solution presents itself, dataframe returns could call my transfer_attr() function before returning a new DataFrame. I wouldn't know how to integrate this fully into pandas otherwise.

I know this is low on the priority list, but I really do think that persistent custom attributes would be a big step forward, and not just an appeasement for corner case users.

Interesting. I'd like to add a couple of notes:

  • You're suggesting there needs to be a mechanism to attach arbitrary metadata to a Dataframe.
    good idea.
    I don't though see why custom attributes must be implemented as "attributes" in the python sense.
    a metadata dict with a simple api which is serialized along with the dataframe would take care
    of most use cases and shouldn't be hard to implement. Is there some requirement you have
    for which this is not adequate?
  • The .name issue is seperate. The .name(s) attributes are not "custom", but are "baked in"
    ,relied on by internal code and can affect other parts of the package in hard to predict ways
    (see the ongoing excel save/read issue).

y-p,

I would be fine with a metadatadict, or whatever is the most elegant solution to the problem. The reason that I like adding attributes is for access. Something like df.name is easier for people to keep up with than df.metadata['name']; however, if you gave the metadata dict attribute access, then df.metadict.name is also pretty simple. Am I understanding you correctly? Whatever solution ends up being the most simple to implement, would be useful.

I agree that the name issue is separate. If .name is too baked in as you say, then sure, don't include it. But if the pandas Index object also had a way to persist attributes, or a persistent metadata dict, then one could just slap names or whatever attributes they want, onto these object as well.

I like the idea of providing attribute access under a predefined attribute rather then
directly on the object (i.e. df.tags.measurement_date) as the latter pollutes the namespace
and hurts backwards-compatibility when new methods or instance variables are added in
the future.

Whatever is best for pandas would be fine with me. If the import gets too
tedious, it is easy enough to make some properties or basic getters/setters
for the convienence of the user. Something like "get_baseline()" may be
easier to present to users than df.tags.baselinedics.baseline1. In any
case, as long as the functionality is there, it will be very useful.

On Fri, Dec 7, 2012 at 8:10 PM, y-p [email protected] wrote:

directly on the object (i.e. df.tags.measurement_date) as the latter
pollutes the namespace

new custom metadata issue at #2485.
@hugadams - your thoughts (and PR, pending discussion) are welcome.

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?

You can try using the metadataframe class if you want. Let me kn ow and ill
update my repo. You can monkey patch a name attr in but it will return to a
default value everytime a new dataframe is created.
On Apr 4, 2013 5:53 PM, "Spencer Lyon" [email protected] wrote:

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?

—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/447#issuecomment-15926456
.

Alternatively you can create a composite class that stores the name
attribute. Unfortunately you have to specify which dataframe methods can be
called and still return this class instead of returning a dataframe. If
you only need say two methods of df then this is worth doing. Ultimately
metadaraframe is an effort to do this generically for all methods and
dataframe operators so probably easier to start with it. For just a single
attribute addition, this is a heavyhanded solution, but all that I know if.
On Apr 4, 2013 8:14 PM, "Adam Hughes" [email protected] wrote:

You can try using the metadataframe class if you want. Let me kn ow and
ill update my repo. You can monkey patch a name attr in but it will return
to a default value everytime a new dataframe is created.
On Apr 4, 2013 5:53 PM, "Spencer Lyon" [email protected] wrote:

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?

—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/447#issuecomment-15926456
.

A df name attribute would be useful when slicing panels down to dataframes, parallel to the case where a df column name becomes a series name when sliced. In theory, this should generalize to any number of dimensions.

+1 on this. I was wanting it just a few days ago.

updated title. we'll see when the rest can follow.

@hugadams btw - columns and index now get name attributes (if you have a hierarchical index, it's called names...) not sure if that covers what you were looking for in terms of columns and index.

@jtratner I added support for this by just defining _prop_attributes (see Series)

not complete though as Series mostly uses the original old method

and need a better method to resolve name conflicts and such (eg when u add frames with different names what happens), same issue in Series though

so a bit of a project but all the support is there for this

@jreback , does doing this fit in naturally into the NDFrame unification deal?

technically this is easy (just add to _metadata), but it still needs proper progation...I think its worthwhile, but will take some time to get right

series names aren't implemented as metadata, and series now derive from NDFrame.
metadata is fine (well.. you know how I feel), but names are a special case.

@y-p ahh but they are! (well they are in the _metadata list). Actually combined with __finallize__ this makes it possible for sub-classes to implement their own metadata! (e.g. geopandas does this)

Well... then they didn't used to be. nevermind. let it sit for another 18 months or so.

Another use of a name attribute would be for GUI's dealing with dataframes. I have one such program which allows a user to load many csv files and plot columns from them. The backend uses dataframes to load and store the CSV data.
The only useful way I can think of to allow the user to select which dataframe to plot from would be having a human-readable attribute (i.e. a name) describing the dataframe to display on the GUI, which can then be used to grab the correct dataframe.
I did a work-around by simply inheriting from dataframe and adding a name property. I then reimplemented class methods (like read_csv) to return an instance of 'NamedDataFrame' rather than DataFrame.

in ipython notebook and similar REPL, it would make sense to display dataframe name or more generally custom metadata, like in Excel toolbar (count, min, max, average, sum, nans, numerical count).

When saving results of an analysis, resulting in several different outputs, it would be so nice to automatically save and name your output:

def save_results(df, df_name):
    if len(df) > 0:
        print "Saving {} {} variants to current working directory".format(len(df), df_name)
        df.to_csv('{}.csv'.format(df_name), header=True, encoding='utf-8', index=False)
    else:
        print "No {} variants to save.".format(df_name)

Ohhh. I see now from StackExchange that I can do something like this:

import pandas as pd
df = pd.DataFrame([])
df.df_name = 'Binky'

@Summer Rae, great finding! Thanks for sharing.

Also computed properties are contained in .describe()

On Tue, Aug 18, 2015, 6:36 PM Summer Rae [email protected] wrote:

Ohhh. I see now from StackExchange
http://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe
that I can do something like this:
import pandas as pd
df = pd.DataFrame([])
df.df_name = 'Binky'

—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/447#issuecomment-132391026.

@summerela wonderful finding!

Adding a name attribute to DataFrame would add a lot of complexity. The benefits (compared with the benefits of named Series) to me are less clear. Closing for now

I realize that this is a year out of date, but I'd like to pitch in a use case where having a name for a data frame can be really useful.

When performing multi-block analysis (i.e. multi-block partial least squares) in another package (like statsmodels), it would be awesome if we could specify R style formulas via patsy and run this sort of analysis as something like as follows

result = pls_multiblock(formula="Z ~ X + Y + U + V", blocks=(Z, X, Y, U, V) )

where Z, X, Y, U, V are all matrices (represented as pandas data frames).
Representing these objects as pandas data frames is advantageous, since we can keep track of the ordering of the index/column names. But more importantly, if we have information about the naming, and flexibility concerning what sort of models we want to construct on the fly. The implications of having the DataFrame.name being consistent with the Series.name are not clear to me, but having consistent unique identifiers for the DataFrames themselves can easily enable these sorts of analyses.

@mortonjt For this sort of multi-dimensional data analysis, I would consider using xarray, which does already support a name attribute on DataArray objects. Note that we are deprecating pandas.Panel.

I totally forgot about xarray ... Thanks @shoyer!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

michaelaye picture michaelaye  Â·  64Comments

mpenning picture mpenning  Â·  48Comments

jsexauer picture jsexauer  Â·  81Comments

datapythonista picture datapythonista  Â·  44Comments

ShaharNaveh picture ShaharNaveh  Â·  137Comments