I use this stack, groupby, unstack quite frequently. e.g. here
An issue I have is that after groupby('allpoints').apply(), the coordinate names do not get carried through. i.e. the coordinate names are now: allpoints_level_0 and allpoints_level_1. Then after unstacking I rename them back to lat/lon etc. Do you ever encounter this?
Is there a way to carry them through and is this an issue for others?
import xarray as xr
import numpy as np
ds = xr.DataArray(np.ndarray((180,360,2000)), coords={'lat':np.arange(90,-90,-1), 'lon':np.arange(-180,180), 'time':range(2000)})
ds
<xarray.DataArray (lat: 180, lon: 360, time: 2000)>
array([[[ 0.623891, -0.044304, ..., 1.015785, 0.009088],
[-0.7375 , 0.380369, ..., 0.788351, -0.69295 ],
...,
[ 0.171894, 0.517164, ..., -0.946908, -0.597802],
[ 0.353743, 0.005539, ..., -1.436965, -0.190099]],
....
Coordinates:
* lat (lat) int32 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
* lon (lon) int32 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ...
* time (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ..
Now we stack the data by allpoints. Note that the info about original coordinates (lat / lon) is still there...
dst = ds.stack(allpoints=['lat','lon'])
<xarray.DataArray (time: 2000, allpoints: 64800)>
array([[ 0.623891, -0.7375 , 0.053525, ..., 0.379701, 0.130618, 0.11094 ],
[-0.044304, 0.380369, -0.410632, ..., -0.739881, 0.203219, -0.506303],
[-1.762024, -1.019424, 2.580218, ..., 1.491677, 1.189149, -0.072223],
...,
[-0.896298, 0.333163, -1.751641, ..., 1.90315 , 2.642813, -0.913787],
[ 1.015785, 0.788351, 0.379997, ..., 0.864934, 0.889001, -1.363458],
[ 0.009088, -0.69295 , -1.276184, ..., 1.220656, 0.895599, 0.848757]])
Coordinates:
* time (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
* allpoints (allpoints) MultiIndex
- lat (allpoints) int64 90 90 90 90 90 90 90 90 90 90 90 90 90 90 ...
- lon (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 -173 ...
Now apply groupby().apply()
dsg=dst.groupby('allpoints').apply(my_custom_function)
<xarray.DataArray (allpoints: 64800)>
array([ 0.013697, 0.006272, 0.009744, ..., -0.016265, -0.002108, -0.014733])
Coordinates:
* allpoints (allpoints) MultiIndex
- allpoints_level_0 (allpoints) int64 -89 -89 -89 -89 -89 -89 -89 -89 -89 ...
- allpoints_level_1 (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 ...
So now we have lost the 'lat','lon'. However if we skip the groupby part and go straight to unstack, this would be carried through.
dst.unstack('allpoints')
<xarray.DataArray (time: 2000, lat: 180, lon: 360)>
array([[[ 0.623891, -0.7375 , ..., 0.171894, 0.353743],
[ 1.780691, -0.747431, ..., 0.038754, 0.615228],
...,
Coordinates:
* time (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* lat (lat) int64 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
* lon (lon) int64 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ...
** Maybe not an issue for others or I am missing something...
Or perhaps this is intended behaviour?
Thanks for clarification!
Instead of computing the mean over your non-stacked dimension by
dsg = dst.groupby('allpoints').mean()
why not just instead call
dsg = dst.mean('time', keep_attrs=True)
so that you just collapse the time dimension and preserve the attributes on your data? Then you can unstack() and everything should still be there. The idiom of stacking/applying/unstacking is really useful to fit your data to the interface of a numpy or scipy function that will do all the heavy lifting with a vectorized routine for you - isn't using groupby in this way really slow?
@darothen
yes you are right - this is definitely not a good way to apply mean - I was just using mean as a (poor) example trying not to over-complicate or distract from the issue.
But, as you suggest, this is what I do when needing to apply customised functions like from scipy... which, can end up being slow.
This wasn't intentional. If we can fix it in a straightforward fashion, we definitely should.
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically