For an unknown reason, the DataSet coordinates don't appear to be in the same order as the Variable dimension when the DataSet is created from a multi-level DataFrame generated by the concatenation of two DataSeries.
In this case, the DataSet coordinates have not been sorted by ascending order at the creation of the DataSet (using the DataFrame.to_xarray method). Interestingly, this problem doesnt occur if the original Multi-level DataFrame is generated using the grouby() method.
A notebook presenting the issue can be downloaded [here] (https://github.com/lhoupert/xarraytest_lh)
da1 = dfs1.to_xarray()
print(da1)
<xarray.Dataset>
Dimensions: (Staname: 60, Year: 15)
Coordinates:
* Staname (Staname) object '10G' '13G' '14G' '15G' '8G' ... 'Q1' 'R' 'S' 'T'
* Year (Year) int64 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017
Data variables:
U (Staname, Year) float64 nan nan nan nan ... 6.592e+04 6.592e+04 nan
V (Staname, Year) float64 nan nan nan ... -6.592e+04 -6.592e+04 nan
The current output is:
<xarray.Dataset>
Dimensions: (Staname: 60, Year: 15)
Coordinates:
* Staname (Staname) object 'IB23S' 'IB22S' 'IB21S' ... '10G' '9G' '8G'
* Year (Year) object 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017
Data variables:
U (Staname, Year) float64 nan nan nan nan ... 6.592e+04 6.592e+04 nan
V (Staname, Year) float64 nan nan nan ... -6.592e+04 -6.592e+04 nan
For an unknown reason, the DataSet created from the conversion of the DataFrame dfs1 is wrong.
For example, the data indexed as station IB23:
print(da1.V.loc['IB23S',:])
<xarray.DataArray 'V' (Year: 15)>
array([ nan, nan, nan, nan, -100910. ,
nan, nan, nan, -105910.1 , nan,
nan, nan, -105910.15, -105910.16, -105910.17])
Coordinates:
Staname <U5 'IB23S'
* Year (Year) object 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017
md5-fb3f7146df75e7a4b0cc7f33d3428655
Staname Year
IB23S 2005 -65969.05
2006 -65969.06
2010 -60969.10
2011 -60969.11
2014 -60969.14
2015 -60969.15
2016 -60969.16
2017 -55969.17
Name: V, dtype: float64
But it appears to be the data corresponding to Station 10G in the original DataFrame
```python
dfs1.V.loc['10G',:]
Staname Year
10G 2000 -100910.00
2010 -105910.10
2015 -105910.15
2016 -105910.16
2017 -105910.17
Name: V, dtype: float64
The problem appears to be in the DataSet coordinate Staname which has bot been sorted by ascending order while the Data Variable appear to have been sorted differently.
The original multi-level DataFrame has been generated by the concatenation of two DataSeries.
Interestingly, this problem doesnt occur if the original Multi-level DataFrame is generated using the grouby() method...
A notebook presenting the issue can be downloaded [here] (https://github.com/lhoupert/xarraytest_lh)
Output of xr.show_versions()
commit: None
python: 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 17:55:48)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4
xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: 2.4.0
bottleneck: None
dask: 2.14.0
distributed: 2.14.0
matplotlib: 3.2.1
cartopy: 0.17.0
seaborn: 0.10.0
numbagg: None
setuptools: 46.1.3.post20200325
pip: 20.0.2
conda: None
pytest: 5.4.1
IPython: 7.13.0
sphinx: None
Thanks for the report @lhoupert. It seems very similar to #4019
Simpler example, if it's any use:
df = pd.DataFrame(np.arange(4).reshape((2,2)), index=[-1,-2], columns=[-3,-4])
df.index.name, df.columns.name = 'row', 'column'
print('Original Series')
print(df.stack())
print('After going to DataArray and back')
print(df.stack().to_xarray().to_series())
Output
Original Series
row column
-1 -3 0
-4 1
-2 -3 2
-4 3
dtype: int32
After going to DataArray and back
row column
-1 -3 3
-4 2
-2 -3 1
-4 0
dtype: int32
It looks like the following line causes the input data to be reordered, so it no longer matches the original dimensions:
https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L4564
This would probably work if the dimensions were sorted before putting them into the dataset.
Indeed, if I change this line from
obj[dim] = (dim, lev)
to
obj[dim] = (dim, sorted(lev))
then this particular case works:
Original Series
row column
-1 -3 0
-4 1
-2 -3 2
-4 3
dtype: int32
After going to DataArray and back
row column
-2 -4 3
-3 2
-1 -4 1
-3 0
dtype: int32
It looks like this is fixed by 1eedc5c
Thanks for tracking this down @ignamv .