Xarray: Support multi-dimensional grouped operations and group_over

Created on 18 Feb 2015 · 9Comments · Source: pydata/xarray

Multi-dimensional grouped operations should be relatively straightforward -- the main complexity will be writing an N-dimensional concat that doesn't involve repetitively copying data.

The idea with group_over would be to support groupby operations that act on a single element from each of the given groups, rather than the unique values. For example, ds.group_over(['lat', 'lon']) would let you iterate over or apply to 2D slices of ds, no matter how many dimensions it has.

Roughly speaking (it's a little more complex for the case of non-dimension variables), ds.group_over(dims) would get translated into ds.groupby([d for d in ds.dims if d not in dims]).

Related: #266

API design groupby

Source

shoyer

👍5

Most helpful comment

@shoyer -

I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame:

data.groupby([lambda x: x.hour, lambda x: x.timetuple().tm_yday]).mean()

The motivation comes in making a Hovmoller diagram. What we need is this functionality:

da.groupby(['time.hour', 'time.dayofyear']).mean().plot()

If you can point me in the right direction, I'll see if I can put something together.

jhamman on 16 Aug 2015

👍6

All 9 comments

@shoyer -

I want to look into putting a PR together for this. I'm looking for the same functionality that you get with a pandas Series or DataFrame:

data.groupby([lambda x: x.hour, lambda x: x.timetuple().tm_yday]).mean()

The motivation comes in making a Hovmoller diagram. What we need is this functionality:

da.groupby(['time.hour', 'time.dayofyear']).mean().plot()

If you can point me in the right direction, I'll see if I can put something together.

jhamman on 16 Aug 2015

👍6

@jhamman For your use case, both hour and dayofyear are along the time dimension, so arguably the result should be 1D with a MultiIndex instead of 2D. So it might make more sense to start with that, and then layer on stack/unstack or pivot functionality.

I guess there are two related use cases here:

Multiple groupby arguments along a single dimension (pandas does this one already)
Multiple groupby arguments along different dimensions (pandas doesn't do this one).

shoyer on 17 Aug 2015

Agreed, we have two use cases here.

For (1), can we just use the pandas grouping infrastructure. We just need to allow xray.DataArray.groupby to support an iterable and pandas.Grouper objects. I personally don't like the MultiIndex format and prefer to unstack the grouper operations when possible. In xray, I think we can justify going that route since we support N-D labeled dimensions much better than pandas.

For (2), I'll need to think a bit more about how this would work. Do we add a groupby method to DataArrayGroupBy? That sounds messy. Maybe we need to write a N-D grouper object?

jhamman on 17 Aug 2015

For (2) I think it makes sense to extend the existing groupby to deal with multiple dimensions. Ie, let it take an iterable of dimension names.

>>> darray.groupby(['lat', 'lon'])

Then we'd have something similar to the SQL groupby, which is a good thing.

By the way, in #527 we were considering using this approach to make the faceted plots on both rows and columns.

clarkfitzg on 17 Aug 2015

👍4

In case it is of interest to anyone, the snippet below is a temporary and quite dirty solution I've used to do a multi-dimensional groupby...

It runs nested groupby-apply operations over each given dimension until no further grouping needs to be done, then applies the given function "apply_fn"

def nested_groupby_apply(dataarray, groupby, apply_fn):
    if len(groupby) == 1:
        return dataarray.groupby(groupby[0]).apply(apply_fn)
    else:
        return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)

Obviously performance can potentially be quite poor. Passing the dimensions to group over in order of increasing length will reduce your cost a little.

hottwaj on 7 Dec 2016

👍3

Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.

jjpr-mit on 16 Oct 2017

Is use case 1 (Multiple groupby arguments along a single dimension) being held back for use case 2 (Multiple groupby arguments along different dimensions)? Use case 1 would be very useful by itself.

No, I think the biggest issue is that grouping variables into a MultiIndex on the result sort of works (with the current PR https://github.com/pydata/xarray/pull/924), but it's very easy to end up with weird conflicts between coordinates / MultiIndex levels that are hard to resolve right now within the xarray data model. Probably it would be best to resolve https://github.com/pydata/xarray/issues/1603 first, which will make this much easier.

shoyer on 16 Oct 2017

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically