related:
https://github.com/pydata/pandas/issues/39 (column descriptions)
https://github.com/pydata/pandas/issues/686 (serialization concerns)
https://github.com/pydata/pandas/issues/447#issuecomment-11152782 (Feature request, implementation variant)
Ideas and issues:
storage of this data is is pretty easy to implement in HDFStore.
(not pushing HDFStor as a general mechanism!)
general thoughts on meta data:
specific to HDFStore:
pytables it is a very good fit in terms of features, but:
oh - was not suggesting we use this as a backend for specific storage
of meta deta in general (the above points were my comments in general on meta
data - reading it again it DOES look like I am pushing HDFStore)
was just pointing out that HDFStore can support meta deta if pandas structures do
to answer your questions
+1 for all meta data living under a single attribute
I'm against allowing non-serializable objects as metadata, at all. But not sure
if that should be a constraint on the objects or the serialization format.
in any case, a hook+type tag mechanism would allow users to plant ids of external
objects and reconstruct things at load-time.
I've been thinking of suggesting a hooking mechanism elsewhere (for custom representations
of dataframes - viz, html and so on).
what do you mean by not in memory capable?
HDF5 has an in memory+ stdout writer and pytables support has been added
recently.
(https://github.com/PyTables/PyTables/pull/173)
On Tue, Dec 11, 2012 at 8:31 AM, jreback [email protected] wrote:
oh - was not suggesting we use this as a backend for specific storage
of meta deta in general
just that HDFStore can support meta deta if pandas structures doto answer your questions
- not a hard dependency - nor should pandas make it one
- not yet py3 (being worked on now I believe)
- not in memory capable
—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2485#issuecomment-11233812.
oh. I wasn't aware of that and didn't find anythig in the online docs.
This seems to have been added after the latest 2.4.0 pytables release and so is not
yet available off pypi or the distros.
Thanks for including me on this request y-p.
IMO, it seems like we should not try to prohibit objects as metadata based on their serialization capacity. I only say this because how would one account for every possible object? For example, Chaco plots from the Enthought Tool Suite don't serialize easily, but who would know that unless they tried. I think that it's best to let users put anything as metadata, and if it can't serialize, then they'll know when an error is thrown. It is also possible to have the program serialize everything but the metadata, and then just alert the user that this aspect has been lost.
Does anyone here know the pandas source code well enough to understand how to implement something like this? I really don't have a clue, but hope this isn't asking too much of the developers.
Also, I think this addition will be a nice way to appease people who are always looking to subclass a dataframe.
*up vote for adding attribute being called 'meta'
*up vote for putting it on Index classes as well as Series, DataFrame and Panel
Last time I checked, HDF5 has a limit on the size of the AttributeSet. I had to get around it by having my store object encapsulate a directory, with .h5 and pickled meta objects.
I think that adding metadata to the DataFrame object requires that it serialize and work with all backends (pickle, hdf5, etc). Which probably means restricting the type of metadata that can be added. There are corner cases to pickling custom classes that would become pandas problems.
Hi guys. I'm a bit curious about something. This fix is currently addressing adding custom attributes to a dataframe. The values of these attributes, they can be python functions no? If so, this might be a workaround to adding custom instance methods to a dataframe. I know some people way back when were interested in this possibility.
I think the way this could work is the dataframe should have a new method, call it... I dunno, add_custom_method(). This would take in a function, then add the function to the 'meta' attribute dictionary, with some sort of traceback to let the program know it is special.
When the proposed new machinery assigns custom attributes to the new dataframe, it also may be neat to automatically promote such a function to an instance method. If it could do that, then we would have a way to effectively subclass a DataFrame without actually doing so.
This is likely overkill for the first first go around, but maybe something to think about down the road.
@dalejung do you have a link to the AttributeSet limit?
@hugadams you can simply monkey-patch if you want custom instance methods
import pandas
def my_func(self, **kwargs):
return self * 2
pandas.DataFrame.my_func = myfunc
@jreback: Thanks for pointing this out man. I've heard of monkeypatching instance methods, but always thought it was more of a colloquialism for something more difficult.
Thanks for showing me this.
@jreback http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.html#SpecIssues maybe? It's been awhile and it could be that pytables hasn't implemented new HDF5 features.
Personally, I had a dataset with ~40k items of metadata. Nothing complicated, just large. It was much easier to just pickle that stuff separately and use HDF for the actual data.
@dalejung thanks for the link....I am not sure of use-cases for meta data beyond simple structures anyhow....if you have regular data you can always store as separate structures or pickle or whatever....
@hugadams np....good luck
@jreback sure, but that's kind of the state now. You can use DataFrames as attributes of custom classes. You can keep track of your metadata separately.
My point is that there would be an expectation for the DataFrame metadata serialization to work. The HDF5 limit is worse because it's based on size and not type, which means it can work until it suddenly does not.
There are always going to be use-cases we don't think of. Adding a metadata attribute that sometimes saves will be asking for trouble.
@dalejung ok...once this PR GH #2497 is merged in you can try this out in a limited way (limited because data frames don't 'yet' pass this around). could catch errors if you try to store too much (not much to do in this case EXCEPT fail)
Looks like the for and against of the thorny serialization issue are clear.
Here is another thorny issue - what's the semantics of propegating meta through operations?
df1.meta.observation_date = "1/1/1981"
df1.meta.origin = "tower1"
df2.meta.observation_date = "1/1/1982"
df2.meta.origin = "tower2"
df3=pd.concat(df1,df2)
# or merge, addition, ix, apply, etc'
Now, what's the "correct" meta for df3?
I'd be interested to hear specific examples of the problems you hope this will solve for you,
what are the kinds of meta tags you wish you had for your work?
`
@y-p I agree that propagation logic gets wonky. From experience, whether to propagate meta1/meta2/nothing is specific to the situation and doesn't follow any rule.
Maybe the need for metadata would be fulfilled by easier composition tools? For example, I tend to delegate attribute calls to the child dataframe and also connect the repr/str. There are certain conveniences that pandas provides that you lose with a simple composition.
Thinking about it, an api like the numpy array might be useful to allow composition classes to substitute for DataFrames.
Hi y-p. You bring up very good points in regard to merging. My thoughts would be that merged quantities that share keys should store results in a tuple, instead of overwriting; however, this is still a unfavorable situation.
You know, once the monkey patching was made clear to me by jreback, I realized that I could most likely get all the functionality I was looking for in custom attributes. Perhaps what would be more helpful at this point, rather than custom attributes, would be a small tutorial on the main page about how to monkey patch and customize pandas DataStructures.
In my personal situation, I no longer feel that custom metadata would really make or break my projects if monkey patching is adequate; however, you guys seem to have a better overview of pandas, so I think that it really is your judgement call if the new pros of metadata would outweigh the cons.
Thanks for all the ideas, here is my summary:
Dropping the milestone for now but will leave open if someone has more to add.
if you need (1), please open an issue and explain your use-case.
Hey y-p. Thanks for leaving this open. It turns out that monkey patching has not solved my problem as I originally thought it would.
Yes, monkey patching does allow one to add custom instance methods and attributes to a dataframe; however, any function that results in a new dataframe will not retain the values of these custom attributes.
From an email currently on the mailing list:
import pandas
pandas.DataFrame.test=None
df=pandas.DataFrame(name='Bill')
df.name
>>> 'Bill'
df2=df.mul(50)
df2.name
>>>
I've put together a custom dataframe for spectroscopy that I'm very excited about putting at the center of a new spectroscopy package; however, realized that every operation that returns a new dataframe resets all of my custom attributes. The instance methods and slots for the attributes are retained, so this is better than nothing, but still is going to hamper my program.
The only workaround I can find is to add some sort of attribute transfer function to every single dataframe method that I want to work with my custom dataframe. Thus, the whole point of making my object a custom dataframe is lost.
With this in mind, I think monkey patching is not adequate unless there's a workaround that I'm not aware of. Will see if anyone replies on the mailing list.
@hugadams you are probably much better off to create a class to hold both the frame and the meta and then forward methods as needed to handle manipulations...something like
class MyObject(object):
def __init__(self, df, meta):
self.df = df
self.meta = meta
@property
def ix(self):
return self.df.ix
depending on what exactly you need to do, the following will work
o = MyObject(df, meta)
o.ix[:,'foo'] = 'bar'
o.name = 'myobj'
and then you can custom serialization, object creation, etc
you coulud event allow getattr to automatically forward methods to df/meta as needed
only gets tricky when you do mutations
o.df = o.df * 5
you can even handle this by defining __mul__
in MyObject
you prob have a limited set of operations that you really want to support, power users can just
reach in and grab o.df if they need to...
hth
@jreback
Thanks for the input. I will certainly keep this in mind if the metadata idea of this thread never reaches fruition, as it seems to be the best way forward. Do you know offhand how I can implement direct slicing eg:
o['col1'] instead of o.df['col1']
I wasn't sure how to transfer that functionality to my custom object without a direct call to the underlying dataframe.
Thanks for pointing out the mul redefintion. This will help me going forward.
This really does feel like a roundabout solution to the Dataframe's inability to be subclassed. Especially if my custom object were to evolve with pandas, this would require maintenance to keep it synced up with changes to the Dataframe API.
What if we do this- Using jreback's example, we create a generic class with the specific intention of being subclassed for custom use? We can include the most common Dataframe methods and update all the operators accordingly. Then, hopeless fools like me who come along with the intent to customize have a really strong starting point.
I think that pandas' full potential has yet to be recognized by the research community, and anticipate it will diffuse into many more scientific fields. As such, if we could present them with a generic class for customizing dataframes, then researchers may be more inclined to build packages around pandas, rather than coming up with their own ad-hoc datastructures.
There are only a handful of methods you prob need to worry about, you can always access df anyhow
e.g. arithmetic, getitem,setitem,ix, maybe boolean
depends on what you want the user to be able to do with your object
python is all about least suprise. an object should do what you expect; in this case you are
having your object quack like a DataFrame with extra attributes, or are you really do more complex stuff like redefiing the way operators work?
for example you could redefine *
to mean call my cool multiplier function
, and in some fields this makes sense (e.g. frequency domain analysis you want *
to mean convolution)
can you provide an example of what you are trying to do?
# to provide: o['col1'] access
def __getitem__(self, key):
# you could intercept calls to metadata here for example
if key in meta:
return meta[key]
return self.df.__getitem__(self, key)
All I'm doing is creating a dataframe for spectral data. As such, it has a special index type that I've written called "SpecIndex" and several methods for transforming itself to various representations of data. It also has special methods for extending how temporal data is managed. In any case, these operations are well-contained in my monkey patched version, and also would be easily implemented in a new class formalism as you've shown.
After this, it really should just quack. Besides these spectroscopic functions and attributes, it should behave like a dataframe. Therefore, the most common operations on the dataframe, I would prefer to be seemless and promote to instance methods. I want to encourage users to learn pandas and use this tool for exploratory spectroscopy. As such, I'm trying to intercept any inconsistencies ahead of time like the one you pointed out about o.df=o.df * 5. Will I have to change the behavior of all the basic operators (eg * / + -) or just *? Any caveat like this, I'd like to correct in advance. In the end, I want the class layer itself to be as invisible as possible.
Do any more of these gotchas that come to mind?
It's best to think of Pandas objects like you do integers. If you had a hypothetical Person object, its height would just be a number. The number would have no idea it was a height or what unit it was in. It's just there for numerical operations. height / height_avg
doesn't care about the person's sex, weight, or race.
I think when the DataFrame is the primary data object this seems weird. But imagine that the Person object had a weight_history attribute. It wouldn't make sense to subclass a DataFrame to hold that attribute. Especially if other Pandas objects existed in Person data.
subclassing/metadata will always run into issues when doing exploratory analysis. Does SubDataFrame.tail() return a SubDataFrame? If it does, will it keep the same attributes? Do we want to make copy of the dict for all ops like +/-*?
After a certain point it becomes obvious that you're not working with a Person or SpectralSeries. You're working on an int or a DataFrame. In the same way that convert_height(Person person)
isn't more convenient than convert_height(int height)
, getting your users into the mindset that a DataFrame is just a data type will be simpler in the long run. Especially if your class gets more complicated and needs to hold more than one Pandas object.
@hugadams I would suggest looking at the various tests in pandas/tests/test_frame.py, and creating your own test suite. You can start by using your 'DataFrame' like object and see what breaks (obviously most things will break at the beginning). Then skip tests and/or fix things as you go.
You will probably want to change most of the arithmetic operations (e.g. * / + - ), e.g. anything you want a user to be able to do.
@hugadams If you want to see an old funky attempt at subclassing a df: http://nbviewer.ipython.org/4238540/
It quasi works because pretty much every DataFrame magic method calls another method, this gets intercepted in getattribute and redirected to the SubDF.df.
Thanks for all the help guys. I think agree that maybe subclassing will get me into trouble, and thanks for sharing the example.
I will attempt jreback's implementation, but may I first ask one final thing?
The only reason I want special behavior/subclassing is that I want my custom attributes to persist after operations on the dataframe. Looking at this subclass example, it leads me to believe that if it may not be so difficult to change the correct methods such that these few new fixed attributes are transferred to any new dataframe created from an instance method or general operation. I mean, dataframes already preserve their attribute values upon mutation. How hard would it be to simply add my handful of new attributes into this machinery? This seems like it might be less work than building an entirely new class just to store attributes. (Instance methods can be monkey patched afterall).
@dalejung, in the simplest case, if all I wanted to do was add a "name" attribute to dataframe, such that its value will persist after doing:
df=df*5
or
df=df.apply( somfunc )
Would this be an easy hack to the source you provided?
@hugadams 'has-a' allows your custom attributes to persist regardless of the changes on the df
or more to the point, change when you want them to change
(and, as an aside, say you implemented disk-based persistence, then this would be easy)
generally i use 'is-a' only when I really need specialized types
when you need a bit of both, you can do a mix-in (not advocating that here as this gets even more complicated!)
@jreback
Well, my current monkey patched dataframe already works and sets attributes in a self consistent manner. Therefore, the only problem I have is that the values of these attributes are lost under operations that return a new dataframe. From what I've learned then, my best solution would be simply to use this object as is, then implement the 'has-a' behavior you mentioned.
I apologize for being so helpless, but can you elaborate or provide an example of what you mean with 'has-a'? I don't see where in the Dataframe sourcecode this is used.
@hugadams 'has-a' is just another name for the container class I described above, 'is-a' is sublcassing
@jreback Ok, thanks man. I will go for it, I appreciate your help.
@hugadams It could persist the .name if you transferred in the wrap. but there will be corner cases where this will break down. Honestly, it's not a worthwhile path to go down. It's much better to use composition and keep the DataFrame a separate entity.
@dalejung
Thanks. I will try using composition.
Here's a wrong way to do it:
In [1]: import pandas as pd
...: def hook_class_init(cls):
...: def f(self,*args,**kwds):
...: from inspect import currentframe
...: f.__orig_init__(self,*args,**kwds)
...: obj = currentframe().f_back.f_locals.get('self')
...: if isinstance(obj,self.__class__) and hasattr(obj,"meta"):
...: setattr(self,"meta",getattr(obj,'meta'))
...: if not hasattr(cls.__init__,"__orig_init__"):
...: f.__orig_init__ = cls.__init__
...: cls.__init__=f
...:
...: def unhook_class_init(cls):
...: if hasattr(cls.__init__,"__orig_init__"):
...: cls.__init__=cls.__init__.__orig_init__
In [2]: hook_class_init(pd.DataFrame)
...: df1=mkdf(10,4)
...: df1.meta=dict(name="foo",weight="buzz")
...: x=df1.copy()
...: print x.meta
{'name': 'foo', 'weight': 'buzz'}
In [3]: unhook_class_init(pd.DataFrame)
...: x=df1.copy()
...: print x.meta
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-ed39e9901bfc> in <module>()
1 unhook_class_init(pd.DataFrame)
2 x=df1.copy()
----> 3 print x.meta
/home/user1/src/pandas/pandas/core/frame.pyc in __getattr__(self, name)
2020 return self[name]
2021 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2022 (type(self).__name__, name))
2023
2024 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'meta'
Thanks y-p.
Ya, I'm beginning to see the amount hacking required to do something like this. I'll stick with the composition method.
So I've been playing with the subclassing/composition stuff.
https://github.com/dalejung/trtools/blob/master/trtools/tools/composition.py
The simple use-case I've been using is a return dataset. I have ReturnFrame/ReturnSeries with attribute/methods to work specifically on returns. So far, it seems to be useful if only to save me from typing (1+returns) so often. As expected, I ran into the metadata issue where something like type(returns > .01) == ReturnSeries # true
occurs which makes no sense.
I also ran into the issue where a DataFrame will lose series class/metadata when added. I had to create a subclass that dumbly stores the class/metadata in dicts and rewrap on attr/item access.
https://github.com/dalejung/trtools/blob/master/trtools/core/dataset.py
It's been a couple days and I haven't run into any real issues outside of not initially understanding the numpy api. However, subclassing the pd.DataFrame, while necessary to trick pandas into accepting the class, makes things messy and requires using __getattribute__
to ensure I'm wrapping the return data.
I'm starting to think that having a __dataframe__
api would be a good choice. It would allow composition classes to gain a lot of the re-use and simplicity of subclassing while avoiding the complications. Supporting subclassing seems like a open-ended commitment for pandas. However, having a dataframe api would allow custom classes to easily hook into pandas while maintaining the contract that pandas only know and deals with pandas.DataFrames.
Hi dalejung. This looms pretty cool. On a vacation so cant play with it but
can you explain exactly what you mean by a dataframe api? Are you meaning
an api for customizing a dataframe or am I misunderstanding?
On Dec 31, 2012 11:18 AM, "dalejung" [email protected] wrote:
So I've been playing with the subclassing/composition stuff.
https://github.com/dalejung/trtools/blob/master/trtools/tools/composition.py
The simple use-case I've been using is a return dataset. I have
ReturnFrame/ReturnSeries with attribute/methods to work specifically on
returns. So far, it seems to be useful if only to save me from typing
(1+returns) so often. As expected, I ran into the metadata issue where
something like type(returns > .01) == ReturnSeries # true occurs which
makes no sense.I also ran into the issue where a DataFrame will lose series
class/metadata when added. I had to create a subclass that dumbly stores
the class/metadata in dicts and rewrap on attr/item access.https://github.com/dalejung/trtools/blob/master/trtools/core/dataset.py
It's been a couple days and I haven't run into any real issues outside of
not initially understanding the numpy api. However, subclassing the
pd.DataFrame, while necessary to trick pandas into accepting the class,
makes things messy and requires using _getattribute_ to ensure I'm
wrapping the return data.I'm starting to think that having a _dataframe_ api would be a good
choice. It would allow composition classes to gain a lot of the re-use and
simplicity of subclassing while avoiding the complications. Supporting
subclassing seems like a open-ended commitment for pandas. However, having
a dataframe api would allow custom classes to easily hook into pandas while
maintaining the contract that pandas only know and deals with
pandas.DataFrames.—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2485#issuecomment-11779697.
Hi guys,
I noticed that one way to take a bunch of the dataframe behavior is to overwrite the getattr call. EG:
class Foo(object):
def __init__(self):
self.df=DataFrame([1,2,3],[3,4,5])
def __getattr__(self, attr):
return getattr(self.df, attr)
def __getitem__(self, key):
''' Item lookup'''
return self.df.__getitem__(key)
This way, many important methods and attribtutes from a dataframe (like ix, apply, shape) seem to just work out of the box on my Foo object. This saves me a lot of manual effort in promoting the attributes and methods that I want to directly work on Foo. Do you guys anticipate any big errors or problems that this could introduce?
I just wanted to add to this thread that I put in a pull request for a composition class that attempts to mimic a dataframe in the most general way as possible.
https://github.com/pydata/pandas/pull/2695
Although I noticed DaleJung's implementation may work better, but I can't test it because I'm getting an error with lots of tracebook:
AttributeError: 'module' object has no attribute 'BufferedIOBase'
File "/home/hugadams/Desktop/trttools/trtools-master/trtools/tools/composition.py", line 4, in
import pandas as pd
File "/usr/local/EPD/lib/python2.7/site-packages/pandas/init.py", line 27, in
from pandas.core.api import *
File "/usr/local/EPD/lib/python2.7/site-packages/pandas/core/api.py", line 13, in
from pandas.core.series import Series, TimeSeries
File "/usr/local/EPD/lib/python2.7/site-packages/pandas/core/series.py", line 3120, in
import pandas.tools.plotting as _gfx
File "/usr/local/EPD/lib/python2.7/site-packages/pandas/tools/plotting.py", line 21, in
import pandas.tseries.converter as conv
File "/usr/local/EPD/lib/python2.7/site-packages/pandas/tseries/converter.py", line 7, in
import matplotlib.units as units
File "/usr/local/EPD/lib/python2.7/site-packages/matplotlib/__init__.py", line 151, in
from matplotlib.rcsetup import (defaultParams,
File "/usr/local/EPD/lib/python2.7/site-packages/matplotlib/rcsetup.py", line 20, in
from matplotlib.colors import is_color_like
File "/usr/local/EPD/lib/python2.7/site-packages/matplotlib/colors.py", line 54, in
import matplotlib.cbook as cbook
File "/usr/local/EPD/lib/python2.7/site-packages/matplotlib/cbook.py", line 11, in
import gzip
File "/usr/local/EPD/lib/python2.7/gzip.py", line 36, in
class GzipFile(io.BufferedIOBase):
DaleJung, if you are going to continue to work on this at some point in the future, can you let me know ([email protected])? I don't want to submit my pull request if your solution turns out to make more sense.
Thanks.
That error looks like a working directory issue. Likely my io directory is pre-emptying the base io module. Try importing from a different dir?
The composition class is something I actively used and will update as I run into issues. The reality is that it's a complete hack. I'd be wary of promoting its use without knowing internally what it's doing to masquerade around as a DataFrame.
Thanks dalejung. I will try to get it working.
Do you have a file in here that runs a series of tests on your object, such as the ones mentioned? If so, do you mind if I borrow it and try to run some tests on my implementation? You said that yours was a complete hack. I don't think my crack at it is necessarily a hack, but probably is not optimal, if that makes sense.
I've started making more use of the subclassing lately and so I moved it to a separate project.
https://github.com/dalejung/pandas-composition
I'm writing more test coverage. I usually subclass for very specific types of data so my day to day testing is constrained.
http://nbviewer.ipython.org/5864433 is kind of the stuff I use it for.
Thanks for letting me know Dale.
I'm at scipy, and at one of the symposia, some people were really
interested in the ability to subclass dataframes. Do you think this will
be merged into future releases, or will be kept apart from the main repo?
It's probably better implemented than my working solution, so looking
forward to it.
On Tue, Jun 25, 2013 at 11:18 PM, dalejung [email protected] wrote:
I've started making more use of the subclassing lately and so I moved it
to a separate project.https://github.com/dalejung/pandas-composition
I'm writing more test coverage. I usually subclass for very specific types
of data so my day to day testing is constrained.http://nbviewer.ipython.org/5864433 is kind of the stuff I use it for.
—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2485#issuecomment-20026359
.
Seems implausible to be honest. While I tried to make it as robust as possible, pandas-composition
works the way I think it should. Decisions like
df['name'] = s
should overwrite s.name
were easy to make for my use case. These are decisions that would have to be better thought out if included in pandas
. There's a reason it hasn't been included thus far.
Take the hdf5
for example. I chose to store the numerical data in hdf5
and the metadata separately. So it works more like a bundle on MacOSX. You wouldn't get a subclass from the hdf5
, you'd just get a DataFrame.
Plus, I still think it'd be better to create a dataframe api where a class can expose a DataFrame
representation for pandas operation. So instead of checking isinstance(obj, DataFrame)
and operating on obj
, we'd check for __dataframe__
, call it, and operate on its return.
That would make composition simple.
I see what you're saying. Here's a pretty hacked up solution that I ended
up using for my spectroscopy package.
https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py
It basically overloads as many operators as possible, and defers attribute
calls to the underlying df object that is stored in the composite class.
It is semi-hacky in that I just tried to overload and redirect until every
operations that I used in my research worked. If you can provide any
feedback, especially constructive, on its implementations or obvious design
flaws, I'd be very appreciative.
(Also, it's called MetaDataFrame, but I realize this is a poorly chosen
name.)
At Scipy this year, pandas was quite popular. I can feel it spilling over
to other domains, and people are running up against the same problems we
are. More folks have already began implementing their own solutions. I
feel that this is the time that at least giving these people a starting
point would be really well-received. Even if we acknowledge that they will
need to customize, and it won't be part of pandas, it would be nice to have
something to start with. At the very minimum, having a small package and a
few docs that say "here's how we implemented it and here's all the caveats"
would be a nice starting point for others looking to customize pandas data
structures.
On Thu, Jun 27, 2013 at 7:24 PM, dalejung [email protected] wrote:
Seems implausible to be honest. While I tried to make it as robust as
possible, pandas-composition works the way I think it should. Decisions
like
- whether to propagate meta-data
- whether df['name'] = s should overwrite s.name
- not supporting HDF5 directly.
were easy to make for my use case. These are decisions that would have to
be better thought out if included in pandas. There's a reason it hasn't
been included thus far.Take the hdf5 for example. I chose to store the numerical data in hdf5and the metadata separately. So it works more like a bundle on MacOSX. You
wouldn't get a subclass from the hdf5, you'd just get a DataFrame.Plus, I still think it'd be better to create a dataframe api where a class
can expose a DataFrame representation for pandas operation. So instead of
checking isinstance(obj, DataFrame) and operating on obj, we'd check for
dataframe, call it, and operate on its return.That would make composition simple.
—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2485#issuecomment-20163856
.
well, a start could be defining a set of functions like those that @cpcloud
abstracted out while cleaning up data.py (is_dataframe
, is_series
,
is_panel
...etc.) which could then be overridden/altered more easily than
a ton of isinstance checks all over the place.
On Thu, Jun 27, 2013 at 8:48 PM, Adam Hughes [email protected]:
I see what you're saying. Here's a pretty hacked up solution that I ended
up using for my spectroscopy package.https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py
It basically overloads as many operators as possible, and defers attribute
calls to the underlying df object that is stored in the composite class.
It is semi-hacky in that I just tried to overload and redirect until every
operations that I used in my research worked. If you can provide any
feedback, especially constructive, on its implementations or obvious
design
flaws, I'd be very appreciative.(Also, it's called MetaDataFrame, but I realize this is a poorly chosen
name.)At Scipy this year, pandas was quite popular. I can feel it spilling over
to other domains, and people are running up against the same problems we
are. More folks have already began implementing their own solutions. I
feel that this is the time that at least giving these people a starting
point would be really well-received. Even if we acknowledge that they will
need to customize, and it won't be part of pandas, it would be nice to
have
something to start with. At the very minimum, having a small package and a
few docs that say "here's how we implemented it and here's all the
caveats"
would be a nice starting point for others looking to customize pandas data
structures.On Thu, Jun 27, 2013 at 7:24 PM, dalejung [email protected]
wrote:Seems implausible to be honest. While I tried to make it as robust as
possible, pandas-composition works the way I think it should. Decisions
like
- whether to propagate meta-data
- whether df['name'] = s should overwrite s.name
- not supporting HDF5 directly.
were easy to make for my use case. These are decisions that would have
to
be better thought out if included in pandas. There's a reason it hasn't
been included thus far.Take the hdf5 for example. I chose to store the numerical data in
hdf5and the metadata separately. So it works more like a bundle on MacOSX.
You
wouldn't get a subclass from the hdf5, you'd just get a DataFrame.Plus, I still think it'd be better to create a dataframe api where a
class
can expose a DataFrame representation for pandas operation. So instead
of
checking isinstance(obj, DataFrame) and operating on obj, we'd check for
dataframe, call it, and operate on its return.That would make composition simple.
—
Reply to this email directly or view it on GitHub<
https://github.com/pydata/pandas/issues/2485#issuecomment-20163856>
.—
Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2485#issuecomment-20164504
.
@hugadams have you considered trying to refactor the DataFrame code to replace calls to DataFrame
to _constructor
and possibly adding other class calls (like Series
to _series
and Panel
to _panel
) which would (internally) return the objects to use to create elements (so, in many methods, instead of Series()
, could use self._series()
), etc. In particular, this would work because it might work in pandas core and be useful.
I sub-classed DataFrame in order to provide meta-data (such as a name attribute). In order to get around all the methods returning new dataframe objects, I created a decorator to grab the returned df and make it an instance of my sub class. This is rather painful though as it means re-implementing every such method and adding the decorator. e.g:
class NamedDataFrame(DataFrame):
@named_dataframe
def from_csv(...):
return super(NamedDataFrame, self).from_csv(...)
you can see what you can do w.r.t. #6923 , #6927
this is a much harder problem that at first glance.
you don't need to sub-class, just override _metadata
and __finalize__
and you can provide support the name attribute.
@jreback: your comment from #6923:
The entire problem arises from how to combine them.
Imagine we supported this:
s1.filename='a' s2.filename='b'
what is (s1+s2).filename?
Pandas already has chosen an approach for handling the semantics of metadata in Series: it's how the library handles the name
attribute. Personally I don't see why the basic behavior for _any_ metadata attribute shouldn't be any different:
>>> t = np.array([0,0.1,0.2])
>>> s1 = pd.Series(t*t,t,name='Tweedledee')
>>> s2 = pd.Series(t*t,t,name='Tweedledum')
>>> s1
0.0 0.00
0.1 0.01
0.2 0.04
Name: Tweedledee, dtype: float64
>>> s1*2
0.0 0.00
0.1 0.02
0.2 0.08
Name: Tweedledee, dtype: float64
>>> s1+2
0.0 2.00
0.1 2.01
0.2 2.04
Name: Tweedledee, dtype: float64
>>> s1+s2
0.0 0.00
0.1 0.02
0.2 0.08
dtype: float64
>>> s3 = pd.Series(t*t,t,name='Tweedledum')
>>> s1+s3
0.0 0.00
0.1 0.02
0.2 0.08
dtype: float64
>>> s2+s3
0.0 0.00
0.1 0.02
0.2 0.08
Name: Tweedledum, dtype: float64
>>> s1.iloc[:2]
0.0 0.00
0.1 0.01
Name: Tweedledee, dtype: float64
This shows that indexing and operations using a constant preserve the name. It also shows that binary operations between Series preserves the name if both operands share the same name, and removes the name if both operands have different names..
This is a baseline behavior that at least does something reasonable, and if extended to metadata in general, would be consistent with Pandas' existing behavior of the name attribute.
Yeah, in an ideal world we could write a units addon class and attach them to Series and have it do the right thing in handling math operations (require the same units for addition/subtraction, compute new units for multiplication/division/powers, require unitless numbers for most other functions). But right now it would be helpful just to have something basic.
I've checked out the _metadata
functionality and it seems like it persists only when using a Series with indexing; addition/multiplication by a constant drop the metadata value. Combination of series into a DataFrame doesn't seem to work properly, but I'm not as familiar with the semantics of DataFrame as I am with the Series objects.
@jason-s
ok, so are you proposing something?
Yes, but I'm not sure how to translate it from a concept to working Python code.
There is code in pandas.Series
that seems to preserve the name
attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series
objects.
Is there any reason why other entries in the _metadata
list could not be handled the same way, at least as a baseline behavior?
Jason,
While I don't have any opinions on what should be in pandas and what
shouldn't, I can bring to your attention some workarounds.
First, stephan hoyer has put a lot of work into the xray library (
http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer)
which intrinsically supports metadata on labeled arrays. Based on what
I've seen from the tutorials, it's the most robust solution to the problem.
Secondly, the geopandas library has a subclassed dataframe which stores
metadata. You can probably engineer your own from copying some of their
approaches:
https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8
Finally, I have a "MetaDataframe" object that's pretty much a hack, but
will work in the way you desire. All you have to do is subclass from it,
and metadata should persist. EG:
MyClass(MetaDataFrame):
...
You don't need the library it's in, just the class itself:
https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py
While I can't promise it will work correctly for all dataframe
functionality, you can implement it in only a few lines. Check out the
"SubFoo" class in the metadataframe.py file for an example.
On Tue, Oct 7, 2014 at 12:43 PM, jason-s [email protected] wrote:
Yes, but I'm not sure how to translate it from a concept to working Python
code.There is code in pandas.Series that seems to preserve the name attribute
in a meaningful way under indexing, binary operation with numeric
constants, and binary operation with other Series objects.Is there any reason why other entries in the _metadata list could not be
handled the same way, at least as a baseline behavior?—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.
Adam Hughes
Physics Ph.D Candidate
George Washington University
Sorry, and just to be clear, the GeoPandas object is a subclassed
dataframe. The MetaDataframe class is not; it's a composite class that
passes calls down to the dataframe. Therefore, while you can subclass it
very easily, I can't promise it's going to work perfectly in all use
cases. The GeoPandas/XRay solutions are more robust.
On Tue, Oct 7, 2014 at 12:54 PM, Adam Hughes [email protected]
wrote:
Jason,
While I don't have any opinions on what should be in pandas and what
shouldn't, I can bring to your attention some workarounds.First, stephan hoyer has put a lot of work into the xray library (
http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer)
which intrinsically supports metadata on labeled arrays. Based on what
I've seen from the tutorials, it's the most robust solution to the problem.Secondly, the geopandas library has a subclassed dataframe which stores
metadata. You can probably engineer your own from copying some of their
approaches:
https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8Finally, I have a "MetaDataframe" object that's pretty much a hack, but
will work in the way you desire. All you have to do is subclass from it,
and metadata should persist. EG:MyClass(MetaDataFrame):
...You don't need the library it's in, just the class itself:
https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.pyWhile I can't promise it will work correctly for all dataframe
functionality, you can implement it in only a few lines. Check out the
"SubFoo" class in the metadataframe.py file for an example.On Tue, Oct 7, 2014 at 12:43 PM, jason-s [email protected] wrote:
Yes, but I'm not sure how to translate it from a concept to working
Python code.There is code in pandas.Series that seems to preserve the name attribute
in a meaningful way under indexing, binary operation with numeric
constants, and binary operation with other Series objects.Is there any reason why other entries in the _metadata list could not be
handled the same way, at least as a baseline behavior?—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.
Adam Hughes
Physics Ph.D Candidate
George Washington University
Adam Hughes
Physics Ph.D Candidate
George Washington University
Thanks. I'll take a look, maybe even get my fingers dirty with pandas internals. I do think this should be done right + not rushed into, but I also think that it's important to get at least some useful basic functionality implemented + separate that from a more general solution that may or may not exist.
Something like:
s1attr = getattr(series1, attrname)
s2attr = getattr(series2, attrname)
try:
sresultattr = s1attr._combine(s2attr, op)
# if the attributes know how to combine themselves, let them
except:
# otherwise, if they're equal, propagate to output
# user must beware of mutable values with equivalence
if s1attr == s2attr:
sresultattr = s1attr
else:
sresultattr = None
@jason-s can you show an example of what you are wanting to do? pseudo-code is fine
you _can_ simply add to the _metadata
class-level attribute and then it will propogate that attribute.
Here is a longer discussion of the issue:
https://github.com/pydata/pandas/issues/2485
On Tue, Oct 7, 2014 at 1:12 PM, jason-s [email protected] wrote:
Thanks. I'll take a look, maybe even get my fingers dirty with pandas
internals. I do think this should be done right + not rushed into, but I
also think that it's important to get at least some useful basic
functionality implemented + separate that from a more general solution that
may or may not exist.Something like:
s1attr = getattr(series1, attrname)
s2attr = getattr(series2, attrname)
try:
sresultattr = s1attr._combine(s2attr, op)
# if the attributes know how to combine themselves, let them
except:
# otherwise, if they're equal, propagate to output
# user must beware of mutable values with equivalence
if s1attr == s2attr:
sresultattr = s1attr
else:
sresultattr = None—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-58221020.
Adam Hughes
Physics Ph.D Candidate
George Washington University
OK. I'll try to put some time in tonight. As the issue feature in github is a little clunky, what I'll probably do is create a sample IPython notebook + publish as a gist.
The _metadata
attribute works fine with Series
but seems to behave oddly in DataFrame
objects.
@jason-s Based on my experience with xray, the biggest complexity is how you handle metadata arguments that you can't (or don't want) to check for equality, e.g., if the metadata could be a numpy array, for which equality checks are elementwise, or worse, with some missing values (note np.nan != np.nan
). Of course, there are work arounds for this sort of stuff but it's pretty awkward.
I'll add more in #8572.
@hugadams Thanks for the xray plug. Next time use my GH handle and github will ping me automatically :).
Got it, sorry
On Sun, Oct 19, 2014 at 6:12 PM, Stephan Hoyer [email protected]
wrote:
@jason-s https://github.com/jason-s Based on my experience with xray
https://github.com/xray/xray, the biggest complexity is how you handle
metadata arguments that you can't (or don't want) to check for equality,
e.g., if the metadata could be a numpy array, for which equality checks are
elementwise, or worse, with some missing values (note np.nan != np.nan).
Of course, there are work arounds for this sort of stuff but it's pretty
awkward.I'll add more in #8572 https://github.com/pydata/pandas/issues/8572.
@hugadams https://github.com/hugadams next time use my GH handle and
github will ping me automatically :).—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-59667968.
Adam Hughes
Physics Ph.D Candidate
George Washington University
Any news on this issue? I just found myself wishing for the possibility to attach metadata (probably in a dict) to a dataframe.
its certainly possible to add a default propogated attribute like .attrs
via the _metadata/__finalize__
machinery. IIRC geopoandas does this.
But would need quite a bit of auditing and testing. You are welcome to have a go. Can you show your non-trivial use case?
My use case would be similar to what I imagine @hugadams meant when talking about working with spectroscopy results - data that are constant for the whole dataframe, like
dataframe.columns.name
for this - doesn't feel clean or idiomatic, but sufficient for this one case since I only wanted to attach _one_ string.I have the same use case as @bilderbuchi (recording scientific experimental metadata)
It's now much easier to subclass a dataframe and add your own attributes
and methods. This wasn't the case when I started the issue
On Jan 21, 2016 8:09 AM, "John Stowers" [email protected] wrote:
I have the same use case as @bilderbuchi https://github.com/bilderbuchi
(recording scientific experimental metadata)
- subject information - genotype, gender, age
- experiment information - version hashes, config hashes
—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-173563903.
yeah, but something that round-trips through a vanilla pickled dataframe would be preferable
Would the features offered by xarray be somthing that can be adopted here?
Data Structures
They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.
I think xarrayis what you want.
You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.
You should be able to download that file, then just make a class that has
attributes like you want. IE
df = MetaDataframe()
df.a = a
df.b = b
I thought that after 0.16, it was possible to simply subclass a dataframe,
right?
IE
class MyDF(DataFrame)
self.a = 50
self.b = 20
Or is this not the case?
On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx [email protected] wrote:
Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.htmlThey have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.
Adam Hughes
Physics Ph.D Candidate
George Washington University
Here's what I was talking about:
http://pandas.pydata.org/pandas-docs/stable/internals.html#override-constructor-properties
On Sat, Jan 23, 2016 at 1:32 PM, Adam Hughes [email protected]
wrote:
I think xarrayis what you want.
You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.You should be able to download that file, then just make a class that has
attributes like you want. IEdf = MetaDataframe()
df.a = a
df.b = bI thought that after 0.16, it was possible to simply subclass a dataframe,
right?IE
class MyDF(DataFrame)
self.a = 50
self.b = 20Or is this not the case?
On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx [email protected] wrote:
Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.htmlThey have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.
Adam Hughes
Physics Ph.D Candidate
George Washington University
Adam Hughes
Physics Ph.D Candidate
George Washington University
I think xarrayis what you want.
So did you want to express that all aiming at using metadata may better use xarray?
They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.
Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.
I think 'automatic unit conversion based on metadata attached to series' is
a significantly different and more involved feature request to this issue.
I hope a simpler upstream supported solution allowing attaching simple
text-only metadata can be found before increasing the scope too much.
On 25 January 2016 at 17:14, Stephan Hoyer [email protected] wrote:
They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.Just to be clear, xarray does support adding arbitrary metadata, but not
automatic unit conversion. We could hook up a library like pint to handle
this, but it's difficult to get all the edge cases working until numpy has
better dtype support.—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/2485#issuecomment-174558259.
This is quite simple in current versions of pandas.
I am using a sub-class here for illustration purposes.
Really all that would be needed would be adding the __finalize__
to most of the construction methods
(this already exists now for Series
, but not really for DataFrame
).
unambiguous propogation would be quite easy, and users could add in there own __finalize__
to handle more complicated cases (e.g. what would would you do when you have df + df2
)?
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:
:class MyDataFrame(DataFrame):
: _metadata = ['attrs']
:
: @property
: def _constructor(self):
: return MyDataFrame
:
: def _combine_const(self, other, *args, **kwargs):
: return super(MyDataFrame, self)._combine_const(other, *args, **kwargs).__finalize__(self)
:--
In [2]: df = MyDataFrame({'A' : [1,2,3]})
In [3]: df.attrs = {'foo' : 'bar'}
In [4]: df.attrs
Out[4]: {'foo': 'bar'}
In [5]: (df+1).attrs
Out[5]: {'foo': 'bar'}
Would take a patch for this, the modification are pretty straightforward, its the testing that is the key here.
@jreback is there a generic way to persist metadata amongst all transforms applied to a dataframe including groupbys? Or would one have to go through and override a lot of methods' to call __finalize__
?
@postelrich for most/all things, __finalize__
should already be defined (and so in theory you can make it persist attributes). Its not tested really well though.
For Series
I think this is quite robust. DataFrame
pretty good. I doubt this works at all for groupby / merge / most reductions. Those are really dependent on the __finalize__
(it may or may not be called), that is the simple part. The hard part is deciding what to do.
I've been working on an implementation of this that handles the propagation problem by making the Metadata object itself subclass Series. Then patch Series to relay methods to Metadata. Roughly:
class MSeries(pd.Series):
def __init__(self, *args, **kwargs):
pd.Series.__init__(self, *args, **kwargs)
self.metadata = SMeta(self)
def __add__(self, other):
res = pd.series.__add__(self, other)
res.metadata = self.metadata.__add__(other)
return res
class SMeta(pd.Series):
def __init__(self, parent):
super(...)
self.parent = parent
def __add__(self, other):
new_meta = SMeta(index=self.index)
other_meta = [... other or other.metadata or None depending ...]
for key in self.index:
new_meta[key] = self[key].__add__(other)
So it is up to the individual MetaDatum classes to figure out how to propagate.
I've generally got this working. The part that I have not gotten working is the desired MFrame
behavior df.metadata['A'] is df['A'].metadata
. Any ideas on how to make that happen?
Propagation of attributes (defined in _metadata) gives me some headaches...
Based on the code of jreback, I've tried the following:
from pandas import DataFrame
class MyDataFrame(DataFrame):
_metadata = ['attrs']
@property
def _constructor(self):
return MyDataFrame
def _combine_frame(self, other, *args, **kwargs):
return super(MyDataFrame, self)._combine_frame(other, *args, **kwargs).__finalize__(self)
dfA = MyDataFrame({'A' : [1,2,3]})
dfA.attrs = {'foo' : 'bar'}
dfB = MyDataFrame({'B' : [6,7,8]})
dfB.attrs = {'fuzzy': 'busy'}
dfC = dfA.append(dfB)
dfC.attrs #Returns error: 'MyDataFrame' object has no attribute 'attrs'
#I would like that it would be {'foo': 'bar'}
As jreback mentioned: there should be made choices: what to do with the appended atttributes.
However: I would be really helped when the attributes of only '''dfA''' simply propagate towards '''dfC'''
EDIT: more headache is more better, it pushes me to think harder :). Solved it, by stealing the __finalize__ solution which GeoPandas provided. __finalize__ works pretty good indeed. However, I'm not experienced enough to perform the testing.
Can't we just put metadata in the column name and change how columns are accessed? E.g. ["id"] would internally translate to {"name": "id"}.
Don't know the internals of pandas, so sorry if this might be a little naive. To me it just seems that the column name is really consistent across operations
My use case would be adding a description to "indicator variables" (just 0/1) which are otherwise look like var#1
, var#2
etc, and I do not want to pollute those names with potentially long values they actually stand for.
I think we have _metadata
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties and .attrs
defined for this metadata use cases. If these don't sufficiently cover the necessary use cases, new issues can be created about those 2 methods. Closing
Most helpful comment
I think we have
_metadata
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties and.attrs
defined for this metadata use cases. If these don't sufficiently cover the necessary use cases, new issues can be created about those 2 methods. Closing