Xarray: Can apply_ufunc be used on arrays with different dimension sizes

Created on 18 Oct 2019  路  2Comments  路  Source: pydata/xarray

We have an application where we want to use apply_ufunc to apply a function that takes two 1-D arrays and returns a scalar value (basically a reduction over the only axis). We start with two DataArrays that share all the same dimensions - except for the lengths of the dimension we'll be reducing along (t in this case):

def diff_mean(X, y):
    ''' a function that only works on 1d arrays that are different lengths'''
    assert X.ndim == 1, X.ndim
    assert y.ndim == 1, y.ndim
    assert len(X) != len(y), X
    return X.mean() - y.mean()

X = np.random.random((10, 4, 5))
y = np.random.random((6, 4, 5))

Xda = xr.DataArray(X, dims=('t', 'x', 'y')).chunk({'t': -1, 'x': 2, 'y': 2})
yda = xr.DataArray(y, dims=('t', 'x', 'y')).chunk({'t': -1, 'x': 2, 'y': 2})

Then, we'd like to use apply_ufunc to apply our function (e.g. diff_mean):

out = xr.apply_ufunc(
    diff_mean,
    Xda,
    yda,
    vectorize=True,
    dask="parallelized",
    output_dtypes=[np.float],
    input_core_dims=[['t'], ['t']],
)

This fails with an error when aligning the t dimensions:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-e90cf6fba482> in <module>
      9     dask="parallelized",
     10     output_dtypes=[np.float],
---> 11     input_core_dims=[['t'], ['t']],
     12 )

~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
   1042             join=join,
   1043             exclude_dims=exclude_dims,
-> 1044             keep_attrs=keep_attrs
   1045         )
   1046     elif any(isinstance(a, Variable) for a in args):

~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    222     if len(args) > 1:
    223         args = deep_align(
--> 224             args, join=join, copy=False, exclude=exclude_dims, raise_on_invalid=False
    225         )
    226 

~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    403         indexes=indexes,
    404         exclude=exclude,
--> 405         fill_value=fill_value
    406     )
    407 

~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    321                     "arguments without labels along dimension %r cannot be "
    322                     "aligned because they have different dimension sizes: %r"
--> 323                     % (dim, sizes)
    324                 )
    325 

ValueError: arguments without labels along dimension 't' cannot be aligned because they have different dimension sizes: {10, 6}

https://nbviewer.jupyter.org/gist/jhamman/0e52d9bb29f679e26b0878c58bb813d2

I'm curious if this can be made to work with apply_ufunc or if we should pursue other options here. Advice and suggestions appreciated.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.14.0
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.3.2
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.0.1
IPython: 7.8.0
sphinx: 2.2.0

usage question

All 2 comments

It's working as intended.

apply_ufunc verifies that indices are aligned. Note the optional parameter join='exact'. You have two implicit pd.RangeIndex on dimension t that have a different number of elements - which means they are not aligned. Hence, when apply_ufunc internally calls xarray.align(Xda, yda, join="exact"), it falls over.

You have two options:

  1. add join='outer' to the apply_ufunc call, which will cause the shorter of the two variables to be padded with NaNs. You'll also need to replace mean() with nanmean() in your kernel. This however is horribly inefficient.
  2. rename one of the two dimensions to tell apply_ufunc that they aren't meant to be aligned:
out = xr.apply_ufunc(
    diff_mean,
    Xda,
    yda.rename({"t": "t2"}),
    dask="parallelized",
    input_core_dims=[['t'], ['t2']],
    output_core_dims=[[]],
    output_dtypes=[np.float],
)

While on the topic, note that vectorize=True is asking xarray to slice the numpy array, do a for loop in pure python applying your kernel multiple times, and then concatenate the output back together - that is, horribly slow. If you can avoid it, you should, and your kernel definitely can be changed to process arbitrary unknown dimensions:

def diff_mean(X, y):
    assert X.shape[-1] != y.shape[-1]
    return X.mean(axis=-1) - y.mean(axis=-1)

@jhamman are you happy to close the ticket?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

blaylockbk picture blaylockbk  路  4Comments

ray306 picture ray306  路  4Comments

tomchor picture tomchor  路  4Comments

Zac-HD picture Zac-HD  路  3Comments

equaeghe picture equaeghe  路  4Comments