Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.
The idea with group_over would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, ds.group_over(['lat', 'lon']) would let you iterate over or apply to 2D slices of ds, no matter how many dimensions it has.
Roughly speaking (it's a little more complex for the case of non-dimension variables), ds.group_over(dims) would get translated into ds.groupby([d for d in ds.dims if d not in dims]).
Related: #266
@shoyer -
I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame:
data.groupby([lambda x: x.hour, lambda x: x.timetuple().tm_yday]).mean()
The motivation comes in making a Hovmoller diagram. What we need is this functionality:
da.groupby(['time.hour', 'time.dayofyear']).mean().plot()
If you can point me in the right direction, I'll see if I can put something together.
@jhamman For your use case, both hour and dayofyear are along the time dimension, so arguably the result should be 1D with a MultiIndex instead of 2D. So it might make more sense to start with that, and then layer on stack/unstack or pivot functionality.
I guess there are two related use cases here:
Agreed, we have two use cases here.
For (1), can we just use the pandas grouping infrastructure. We just need to allow xray.DataArray.groupby to support an iterable and pandas.Grouper objects. I personally don't like the MultiIndex format and prefer to unstack the grouper operations when possible. In xray, I think we can justify going that route since we support N-D labeled dimensions much better than pandas.
For (2), I'll need to think a bit more about how this would work. Do we add a groupby method to DataArrayGroupBy? That sounds messy. Maybe we need to write a N-D grouper object?
For (2) I think it makes sense to extend the existing groupby to deal with multiple dimensions. Ie, let it take an iterable of dimension names.
>>> darray.groupby(['lat', 'lon'])
Then we'd have something similar to the SQL groupby, which is a good thing.
By the way, in #527 we were considering using this approach to make the faceted plots on both rows and columns.
In case it is of interest to anyone, the snippet below is a temporary and quite dirty solution I've used to do a multi-dimensional groupby...
It runs nested groupby-apply operations over each given dimension until no further grouping needs to be done, then applies the given function "apply_fn"
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
Obviously performance can potentially be quite poor. Passing the dimensions to group over in order of increasing length will reduce your cost a little.
Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.
Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.
No, I think the biggest issue is that grouping variables into a MultiIndex on the result sort of works (with the current PR https://github.com/pydata/xarray/pull/924), but it's very easy to end up with weird conflicts between coordinates / MultiIndex levels that are hard to resolve right now within the xarray data model. Probably it would be best to resolve https://github.com/pydata/xarray/issues/1603 first, which will make this much easier.
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically
Still relevant.
Most helpful comment
@shoyer -
I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame:
The motivation comes in making a Hovmoller diagram. What we need is this functionality:
If you can point me in the right direction, I'll see if I can put something together.