Xarray: Bug in the conversion of Pandas DataFrame into Xarray Dataset .

Created on 4 May 2020  路  6Comments  路  Source: pydata/xarray


For an unknown reason, the DataSet coordinates don't appear to be in the same order as the Variable dimension when the DataSet is created from a multi-level DataFrame generated by the concatenation of two DataSeries.

In this case, the DataSet coordinates have not been sorted by ascending order at the creation of the DataSet (using the DataFrame.to_xarray method). Interestingly, this problem doesnt occur if the original Multi-level DataFrame is generated using the grouby() method.

A notebook presenting the issue can be downloaded [here] (https://github.com/lhoupert/xarraytest_lh)

MCVE Code Sample

da1 = dfs1.to_xarray()
print(da1)

Expected Output

<xarray.Dataset>
Dimensions:  (Staname: 60, Year: 15)
Coordinates:
  * Staname  (Staname) object '10G' '13G' '14G' '15G' '8G' ... 'Q1' 'R' 'S' 'T'
  * Year     (Year) int64 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017
Data variables:
    U        (Staname, Year) float64 nan nan nan nan ... 6.592e+04 6.592e+04 nan
    V        (Staname, Year) float64 nan nan nan ... -6.592e+04 -6.592e+04 nan

Problem Description

The current output is:

<xarray.Dataset>
Dimensions:  (Staname: 60, Year: 15)
Coordinates:
  * Staname  (Staname) object 'IB23S' 'IB22S' 'IB21S' ... '10G' '9G' '8G'
  * Year     (Year) object 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017
Data variables:
    U        (Staname, Year) float64 nan nan nan nan ... 6.592e+04 6.592e+04 nan
    V        (Staname, Year) float64 nan nan nan ... -6.592e+04 -6.592e+04 nan

For an unknown reason, the DataSet created from the conversion of the DataFrame dfs1 is wrong.

For example, the data indexed as station IB23:

print(da1.V.loc['IB23S',:])
<xarray.DataArray 'V' (Year: 15)>
array([       nan,        nan,        nan,        nan, -100910.  ,
              nan,        nan,        nan, -105910.1 ,        nan,
              nan,        nan, -105910.15, -105910.16, -105910.17])
Coordinates:
    Staname  <U5 'IB23S'
  * Year     (Year) object 1996 1997 1998 1999 2000 ... 2013 2014 2015 2016 2017



md5-fb3f7146df75e7a4b0cc7f33d3428655



Staname Year
IB23S 2005 -65969.05
2006 -65969.06
2010 -60969.10
2011 -60969.11
2014 -60969.14
2015 -60969.15
2016 -60969.16
2017 -55969.17
Name: V, dtype: float64



But it appears to be the data corresponding to Station 10G in the original DataFrame

```python
dfs1.V.loc['10G',:]
Staname  Year
10G      2000   -100910.00
         2010   -105910.10
         2015   -105910.15
         2016   -105910.16
         2017   -105910.17
Name: V, dtype: float64

Notes

The problem appears to be in the DataSet coordinate Staname which has bot been sorted by ascending order while the Data Variable appear to have been sorted differently.

The original multi-level DataFrame has been generated by the concatenation of two DataSeries.

Interestingly, this problem doesnt occur if the original Multi-level DataFrame is generated using the grouby() method...

A notebook presenting the issue can be downloaded [here] (https://github.com/lhoupert/xarraytest_lh)

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 17:55:48)
[Clang 9.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4

xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: 2.4.0
bottleneck: None
dask: 2.14.0
distributed: 2.14.0
matplotlib: 3.2.1
cartopy: 0.17.0
seaborn: 0.10.0
numbagg: None
setuptools: 46.1.3.post20200325
pip: 20.0.2
conda: None
pytest: 5.4.1
IPython: 7.13.0
sphinx: None

bug

All 6 comments

Thanks for the report @lhoupert. It seems very similar to #4019

Simpler example, if it's any use:

df = pd.DataFrame(np.arange(4).reshape((2,2)), index=[-1,-2], columns=[-3,-4])
df.index.name, df.columns.name = 'row', 'column'
print('Original Series')
print(df.stack())
print('After going to DataArray and back')
print(df.stack().to_xarray().to_series())

Output

Original Series
row  column
-1   -3        0
     -4        1
-2   -3        2
     -4        3
dtype: int32
After going to DataArray and back
row  column
-1   -3        3
     -4        2
-2   -3        1
     -4        0
dtype: int32

It looks like the following line causes the input data to be reordered, so it no longer matches the original dimensions:

https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L4564

This would probably work if the dimensions were sorted before putting them into the dataset.

Indeed, if I change this line from

obj[dim] = (dim, lev)

to

obj[dim] = (dim, sorted(lev))

then this particular case works:

Original Series
row  column
-1   -3        0
     -4        1
-2   -3        2
     -4        3
dtype: int32
After going to DataArray and back
row  column
-2   -4        3
     -3        2
-1   -4        1
     -3        0
dtype: int32

It looks like this is fixed by 1eedc5c

Thanks for tracking this down @ignamv .

Was this page helpful?
0 / 5 - 0 ratings