Xarray: Automatic chunking of arrays ?

Created on 13 May 2020  路  11Comments  路  Source: pydata/xarray

Hi there,

Hopefully this turns out to be a basic issue, but I was wondering why the chunks='auto' that dask seems to provide (https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking) isn't an option for xarray? I'm not 100% sure of how dask decides how to automatically chunk its arrays, so maybe there's a compatibility issue?

I get the impression that the dask method automatically tries to prevent the issues of "too many chunks" or "too few chunks" which can sometimes happen when choosing chunk sizes automatically. If so, it would maybe be a useful thing to include in future versions?

Happy to be corrected if I've misunderstood something here though, still getting my head around how the dask/xarray compatibility really works...

Cheers!

dask

All 11 comments

so da.chunk({dim_name: "auto"}) works but da.chunk("auto") does not. The latter is a relatively easy fix. We just need to update the condition here:
https://github.com/pydata/xarray/blob/bd84186acbd84bd386134a5b60111596cee2d8ec/xarray/core/dataset.py#L1736-L1737

A PR would be very welcome if you have the time, @AndrewWilliams3142

Oh ok I didn't know about this, I'll take a look and read the contribution docs tomorrow ! It'll be my first PR so may need a bit of hand-holding when it comes to tests. Willing to try though!

Awesome! Please see https://xarray.pydata.org/en/stable/contributing.html for docs on contributing

Agreed, this would be very welcome!

chunks='auto' isn't supported only because xarray support for dask predates it :)

Cheers! Just had a look, is it as simple as just changing this line to the following, @dcherian ?

if isinstance(chunks, Number) or chunks=='auto':
            chunks = dict.fromkeys(self.dims, chunks)

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

One option is to automatically use self=xr.decode_cf(self) if the input dataset is cftime? Or could just throw an error.

Also, the contributing docs have been super clear so far! Thanks! :)

is_scalar(chunks) might be the appropriate condition. is_scalar is already imported from .utils in dataset.py

This seems to work fine in a lot of cases, except automatic chunking isn't implemented for object dtypes at the moment, so it fails if you pass a cftime coordinate, for example.

Can we catch this error and re-raise specifying "automatic chunking fails for object arrays. These include cftime DataArrays" or something like that?

Nice, that's neater! Would this work, in the maybe_chunk() call? Sorry about the basic questions!

def maybe_chunk(name, var, chunks):
    chunks = selkeys(chunks, var.dims)
    if not chunks:
           chunks = None
    if var.ndim > 0:
           # when rechunking by different amounts, make sure dask names change
           # by provinding chunks as an input to tokenize.
           # subtle bugs result otherwise. see GH3350
           token2 = tokenize(name, token if token else var._data, chunks)
           name2 = f"{name_prefix}{name}-{token2}"
           try:
               return var.chunk(chunks, name=name2, lock=lock)
           except NotImplementedError as err:
               raise Exception("Automatic chunking fails for object arrays."
                                           + "These include cftime DataArrays.")
    else:
        return var

The error message from dask is already pretty descriptive:
NotImplementedError: Can not use auto rechunking with object dtype. We are unable to estimate the size in bytes of object data

I don't think we have much to add on top of that?

I also thought that, after the dask error message it's pretty easy to then look at the dataset and check what the problem dimension is.

In general though, is that the type of layout you'd suggest for catching and re-raising errors? Using raise Exception() ?

If we think can improve an error message by adding additional context, the right solution is to use raise Exception(...) from original_error:
https://stackoverflow.com/a/16414892/809705

On the other hand, if xarray doesn't have anything more to add on top of the original error message, it is best not to add any wrapper at all. Users will just see the original error from dask.

Was this page helpful?
0 / 5 - 0 ratings