Xarray: Loss of coordinate information from groupby.apply() on a stacked object

Created on 19 Jul 2017 · 5Comments · Source: pydata/xarray

I use this stack, groupby, unstack quite frequently. e.g. here

An issue I have is that after groupby('allpoints').apply(), the coordinate names do not get carried through. i.e. the coordinate names are now: allpoints_level_0 and allpoints_level_1. Then after unstacking I rename them back to lat/lon etc. Do you ever encounter this?

Is there a way to carry them through and is this an issue for others?

import xarray as xr  
import numpy as np
ds = xr.DataArray(np.ndarray((180,360,2000)), coords={'lat':np.arange(90,-90,-1), 'lon':np.arange(-180,180), 'time':range(2000)})
ds

<xarray.DataArray (lat: 180, lon: 360, time: 2000)>
array([[[ 0.623891, -0.044304, ...,  1.015785,  0.009088],
        [-0.7375  ,  0.380369, ...,  0.788351, -0.69295 ],
        ..., 
        [ 0.171894,  0.517164, ..., -0.946908, -0.597802],
        [ 0.353743,  0.005539, ..., -1.436965, -0.190099]],
....
Coordinates:
  * lat      (lat) int32 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
  * lon      (lon) int32 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ...
  * time     (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ..

Now we stack the data by allpoints. Note that the info about original coordinates (lat / lon) is still there...
dst = ds.stack(allpoints=['lat','lon'])

<xarray.DataArray (time: 2000, allpoints: 64800)>
array([[ 0.623891, -0.7375  ,  0.053525, ...,  0.379701,  0.130618,  0.11094 ],
       [-0.044304,  0.380369, -0.410632, ..., -0.739881,  0.203219, -0.506303],
       [-1.762024, -1.019424,  2.580218, ...,  1.491677,  1.189149, -0.072223],
       ..., 
       [-0.896298,  0.333163, -1.751641, ...,  1.90315 ,  2.642813, -0.913787],
       [ 1.015785,  0.788351,  0.379997, ...,  0.864934,  0.889001, -1.363458],
       [ 0.009088, -0.69295 , -1.276184, ...,  1.220656,  0.895599,  0.848757]])
Coordinates:
  * time       (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * allpoints  (allpoints) MultiIndex
  - lat        (allpoints) int64 90 90 90 90 90 90 90 90 90 90 90 90 90 90 ...
  - lon        (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 -173 ...

Now apply groupby().apply()
dsg=dst.groupby('allpoints').apply(my_custom_function)

<xarray.DataArray (allpoints: 64800)>
array([ 0.013697,  0.006272,  0.009744, ..., -0.016265, -0.002108, -0.014733])
Coordinates:
  * allpoints          (allpoints) MultiIndex
  - allpoints_level_0  (allpoints) int64 -89 -89 -89 -89 -89 -89 -89 -89 -89 ...
  - allpoints_level_1  (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 ...

So now we have lost the 'lat','lon'. However if we skip the groupby part and go straight to unstack, this would be carried through.
dst.unstack('allpoints')

<xarray.DataArray (time: 2000, lat: 180, lon: 360)>
array([[[ 0.623891, -0.7375  , ...,  0.171894,  0.353743],
        [ 1.780691, -0.747431, ...,  0.038754,  0.615228],
        ..., 
Coordinates:
  * time     (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * lat      (lat) int64 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
  * lon      (lon) int64 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ...

bug groupby

Source

byersiiasa

All 5 comments

** Maybe not an issue for others or I am missing something...
Or perhaps this is intended behaviour?
Thanks for clarification!

byersiiasa on 19 Jul 2017

Instead of computing the mean over your non-stacked dimension by

dsg = dst.groupby('allpoints').mean()

why not just instead call

dsg = dst.mean('time', keep_attrs=True)

so that you just collapse the time dimension and preserve the attributes on your data? Then you can unstack() and everything should still be there. The idiom of stacking/applying/unstacking is really useful to fit your data to the interface of a numpy or scipy function that will do all the heavy lifting with a vectorized routine for you - isn't using groupby in this way really slow?

darothen on 19 Jul 2017

👍1

@darothen
yes you are right - this is definitely not a good way to apply mean - I was just using mean as a (poor) example trying not to over-complicate or distract from the issue.
But, as you suggest, this is what I do when needing to apply customised functions like from scipy... which, can end up being slow.

byersiiasa on 19 Jul 2017

This wasn't intentional. If we can fix it in a straightforward fashion, we definitely should.