What happened:
A project of mine suddenly broke with:
ValueError: Object has inconsistent chunks along dimension row. This can be fixed by calling unify_chunks().
where previously it had worked.
What you expected to happen:
There should have been no change.
Minimal Complete Verifiable Example:
This is very difficult to reproduce. I have tried, but it clearly isn't triggered for relatively simple xarray.Datasets. In my code, the Datasets in question are the result of multiple concatenations, selection and chunking operations. What I shall do instead is attempt to demonstrate the change, in the hopes that someone more knowledgeable has some intuition for what has gone wrong.
dask==2.25.0
I have a dataset, foo, with a number of different variables, most indexed by row. I will focus on one variable to demonstrate the change in behaviour, specifically FLAG. This is what flag looks like prior to a foo.sortby("row") call. Note that there is only a single chunk (this is intentional).
<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
* row (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr
After the foo.sortby("row") call:
<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
* row (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr
Note that the chunksize is unchanged.
dask==2.26.0
Repeating exactly the same experiment, prior to the call:
<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<rechunk-merge, shape=(40710, 1024, 4), dtype=bool, chunksize=(40710, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
* row (row) int64 462991 462993 462994 462996 ... 505074 505075 505076
Dimensions without coordinates: chan, corr
After the foo.sortby("row") call:
<xarray.DataArray 'FLAG' (row: 40710, chan: 1024, corr: 4)>
dask.array<getitem, shape=(40710, 1024, 4), dtype=bool, chunksize=(20355, 1024, 4), chunktype=numpy.ndarray>
Coordinates:
* row (row) int64 462991 462993 462994 462996 ... 505076 505077 505078
Dimensions without coordinates: chan, corr
Note the change in the chunksize.
Anything else we need to know?:
I have seen similar behaviour when using xarray.Dataset.sel.
Environment:
dask==2.25.0
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.25.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None
dask==2.26.0
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-7648-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.15.1
pandas: 1.1.2
numpy: 1.19.2
scipy: 1.5.2
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.26.0
distributed: 2.26.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 50.3.0
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None
Finally managed to reproduce. Here it is:
import xarray
import dask.array as da
import numpy as np
if __name__ == "__main__":
data = da.random.random([10000, 16, 4], chunks=(10000, 16, 4))
dtype = np.float32
xds = xarray.Dataset(
data_vars={"DATA1": (("x", "y", "z"), data.astype(dtype))})
upsample_factor = 1024//xds.dims["y"]
# Create a selection which will upsample the y axis.
selection = np.repeat(np.arange(xds.dims["y"]), upsample_factor)
print("xarray.Dataset prior to resampling:\n", xds)
xds = xds.sel({"y": selection})
print("xarray.Dataset post resampling:\n", xds)
With dask==2.25.0 this gives:
xarray.Dataset prior to resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 1024, 4), meta=np.ndarray>
With dask==2.26.0 this gives:
xarray.Dataset prior to resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) float32 dask.array<chunksize=(10000, 512, 4), meta=np.ndarray>
And finally, the most distressing part - changing the dtype changes the chunking! With dtype = np.complex64, dask==2.26.0 gives:
xarray.Dataset prior to resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 16, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) complex64 dask.array<chunksize=(10000, 16, 4), meta=np.ndarray>
xarray.Dataset post resampling:
<xarray.Dataset>
Dimensions: (x: 10000, y: 1024, z: 4)
Dimensions without coordinates: x, y, z
Data variables:
DATA1 (x, y, z) complex64 dask.array<chunksize=(10000, 342, 4), meta=np.ndarray>
This looks like a consequence of https://github.com/dask/dask/pull/6514 . That change helps with cases like https://github.com/pydata/xarray/issues/4112
sortby is basically an isel indexing operation; so dask is automatically rechunking to make chunks with size < the default. You could fix this by setting an appropriate value in array.chunk-size either temporarily or permanently
with dask.config.set({"array.chunk-size": "256MiB"}): # or appropriate value
...
Thanks! I will definitely give that a go when I am back at my work PC. My personal take is that this level of automated rechunking is dangerous. I have constructed the chunking in my code with great care and for a reason. Having it changed "invisibly" by operations which didn't have this behaviour previously seems problematic to me.
Hi. This change of behaviour broke an interpolation for me. The interpolation function does a sortby along the interpolated dimension. But then you can't interpolate along a chunked dimension. I would argue the interpolation function needs to rechunk after the sortby to the original values or stop people from interpolating without assume_sorted=True with a dask array.
Closing the loop here, with https://github.com/dask/dask/pull/6665 the behavior of Dask=2.25.0 should be restored (possibly with a warning about creating large chunks).
So this can probably be closed, though there may be parts of xarray that should be updated to avoid creating large chunks, or we could rely on the user to do that through the dask config system.
@TomAugspurger @jbusecke is seeing some funny behaviour in https://github.com/jbusecke/cmip6_preprocessing/issues/58
Here's a reproducer
import dask
import numpy as np
import xarray as xr
dask.config.set(
**{
"array.slicing.split_large_chunks": True,
"array.chunk-size": "24 MiB",
}
)
da = xr.DataArray(
dask.array.random.random((10, 1000, 2000), chunks=(-1, -1, 200)),
dims=["x", "y", "time"],
coords={"x": [3, 4, 5, 6, 7, 9, 8, 0, 2, 1]},
)
da


Which is basically
da.data[np.argsort(da.x.data), ...]

I don't understand why its rechunking when we are indexing with a list along a dimension with a single chunk...
I assume that the indices [np.argsort(da.x.data)] are not going to be monotonically increasing. That induces a different slicing pattern. The docs in https://docs.dask.org/en/latest/array-slicing.html#efficiency describe the case where the indices are sorted, but doesn't discuss the non-sorted case (yet).
Sorry, my comment in https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 was incorrect in a couple ways
Most helpful comment
Sorry, my comment in https://github.com/pydata/xarray/issues/4428#issuecomment-711034128 was incorrect in a couple ways