Xarray: Behaviour change in xarray.Dataset.sortby/sel between dask==2.25.0 and dask==2.26.0

Created on 16 Sep 2020 · 8Comments · Source: pydata/xarray

What happened:
A project of mine suddenly broke with:

ValueError: Object has inconsistent chunks along dimension row. This can be fixed by calling unify_chunks().

where previously it had worked.

What you expected to happen:
There should have been no change.

Minimal Complete Verifiable Example:
This is very difficult to reproduce. I have tried, but it clearly isn't triggered for relatively simple xarray.Datasets. In my code, the Datasets in question are the result of multiple concatenations, selection and chunking operations. What I shall do instead is attempt to demonstrate the change, in the hopes that someone more knowledgeable has some intuition for what has gone wrong.

dask==2.25.0

I have a dataset, foo, with a number of different variables, most indexed by row. I will focus on one variable to demonstrate the change in behaviour, specifically FLAG. This is what flag looks like prior to a foo.sortby("row") call. Note that there is only a single chunk (this is intentional).

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr

After the foo.sortby("row") call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr

Note that the chunksize is unchanged.

dask==2.26.0

Repeating exactly the same experiment, prior to the call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr

After the foo.sortby("row") call:

<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(20355, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
  * row      (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr

Note the change in the chunksize.

Anything else we need to know?:
I have seen similar behaviour when using xarray.Dataset.sel.

Environment:

dask==2.25.0

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.25.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

dask==2.26.0

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

Source

JSKenyon

Most helpful comment

Sorry, my comment in https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 was incorrect in a couple ways

We still do the splitting, even when slicing with an out-of-order indexer. Checking on if that's appropriate.
I'm checking in on a logic bug when computing the number of chunks. I don't think we properly handle non-uniform chunking on the other axes.

TomAugspurger on 19 Oct 2020

👍2

All 8 comments

Finally managed to reproduce. Here it is:

import xarray
import dask.array as da
import numpy as np


if __name__ == "__main__":

    data = da.random.random([10000, 16, 4], chunks=(10000, 16, 4))

    dtype = np.float32

    xds = xarray.Dataset(
        data_vars={"DATA1": (("x", "y", "z"), data.astype(dtype))})

    upsample_factor = 1024//xds.dims["y"]

    # Create a selection which will upsample the y axis.
    selection = np.repeat(np.arange(xds.dims["y"]), upsample_factor)

    print("xarray.Dataset prior to resampling:\n", xds)

    xds = xds.sel({"y": selection})

    print("xarray.Dataset post resampling:\n", xds)

With dask==2.25.0 this gives:

xarray.Dataset prior to resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) float32 dask.array<chunksize=(10000, 1024, 4), meta=np.ndarray>

With dask==2.26.0 this gives:

xarray.Dataset prior to resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) float32 dask.array<chunksize=(10000, 512, 4), meta=np.ndarray>

And finally, the most distressing part - changing the dtype changes the chunking! With dtype = np.complex64, dask==2.26.0 gives:

xarray.Dataset prior to resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) complex64 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
 <xarray.Dataset>
Dimensions:  (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
    DATA1    (x, y, z) complex64 dask.array<chunksize=(10000, 342, 4), meta=np.ndarray>

JSKenyon on 16 Sep 2020

This looks like a consequence of https://github.com/dask/dask/pull/6514 . That change helps with cases like https://github.com/pydata/xarray/issues/4112

sortby is basically an isel indexing operation; so dask is automatically rechunking to make chunks with size < the default. You could fix this by setting an appropriate value in array.chunk-size either temporarily or permanently

with dask.config.set({"array.chunk-size": "256MiB"}):  # or appropriate value
    ...

dcherian on 16 Sep 2020

Thanks! I will definitely give that a go when I am back at my work PC. My personal take is that this level of automated rechunking is dangerous. I have constructed the chunking in my code with great care and for a reason. Having it changed "invisibly" by operations which didn't have this behaviour previously seems problematic to me.

JSKenyon on 16 Sep 2020

Hi. This change of behaviour broke an interpolation for me. The interpolation function does a sortby along the interpolated dimension. But then you can't interpolate along a chunked dimension. I would argue the interpolation function needs to rechunk after the sortby to the original values or stop people from interpolating without assume_sorted=True with a dask array.

ccarouge on 22 Sep 2020

Closing the loop here, with https://github.com/dask/dask/pull/6665 the behavior of Dask=2.25.0 should be restored (possibly with a warning about creating large chunks).

So this can probably be closed, though there may be parts of xarray that should be updated to avoid creating large chunks, or we could rely on the user to do that through the dask config system.

TomAugspurger on 15 Oct 2020

@TomAugspurger @jbusecke is seeing some funny behaviour in https://github.com/jbusecke/cmip6_preprocessing/issues/58

Here's a reproducer

import dask
import numpy as np
import xarray as xr

dask.config.set(
    **{
        "array.slicing.split_large_chunks": True,
        "array.chunk-size": "24 MiB",
    }
)

da = xr.DataArray(
    dask.array.random.random((10, 1000, 2000), chunks=(-1, -1, 200)),
    dims=["x", "y", "time"],
    coords={"x": [3, 4, 5, 6, 7, 9, 8, 0, 2, 1]},
)
da

Which is basically

da.data[np.argsort(da.x.data), ...]

I don't understand why its rechunking when we are indexing with a list along a dimension with a single chunk...

dcherian on 17 Oct 2020

👍1

I assume that the indices [np.argsort(da.x.data)] are not going to be monotonically increasing. That induces a different slicing pattern. The docs in https://docs.dask.org/en/latest/array-slicing.html#efficiency describe the case where the indices are sorted, but doesn't discuss the non-sorted case (yet).

TomAugspurger on 17 Oct 2020

Sorry, my comment in https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 was incorrect in a couple ways

We still do the splitting, even when slicing with an out-of-order indexer. Checking on if that's appropriate.
I'm checking in on a logic bug when computing the number of chunks. I don't think we properly handle non-uniform chunking on the other axes.

TomAugspurger on 19 Oct 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings