We have an application where we want to use apply_ufunc to apply a function that takes two 1-D arrays and returns a scalar value (basically a reduction over the only axis). We start with two DataArrays that share all the same dimensions - except for the lengths of the dimension we'll be reducing along (t in this case):
def diff_mean(X, y):
''' a function that only works on 1d arrays that are different lengths'''
assert X.ndim == 1, X.ndim
assert y.ndim == 1, y.ndim
assert len(X) != len(y), X
return X.mean() - y.mean()
X = np.random.random((10, 4, 5))
y = np.random.random((6, 4, 5))
Xda = xr.DataArray(X, dims=('t', 'x', 'y')).chunk({'t': -1, 'x': 2, 'y': 2})
yda = xr.DataArray(y, dims=('t', 'x', 'y')).chunk({'t': -1, 'x': 2, 'y': 2})
Then, we'd like to use apply_ufunc to apply our function (e.g. diff_mean):
out = xr.apply_ufunc(
diff_mean,
Xda,
yda,
vectorize=True,
dask="parallelized",
output_dtypes=[np.float],
input_core_dims=[['t'], ['t']],
)
This fails with an error when aligning the t dimensions:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-e90cf6fba482> in <module>
9 dask="parallelized",
10 output_dtypes=[np.float],
---> 11 input_core_dims=[['t'], ['t']],
12 )
~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
1042 join=join,
1043 exclude_dims=exclude_dims,
-> 1044 keep_attrs=keep_attrs
1045 )
1046 elif any(isinstance(a, Variable) for a in args):
~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
222 if len(args) > 1:
223 args = deep_align(
--> 224 args, join=join, copy=False, exclude=exclude_dims, raise_on_invalid=False
225 )
226
~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
403 indexes=indexes,
404 exclude=exclude,
--> 405 fill_value=fill_value
406 )
407
~/miniconda3/envs/xarray-ml/lib/python3.7/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
321 "arguments without labels along dimension %r cannot be "
322 "aligned because they have different dimension sizes: %r"
--> 323 % (dim, sizes)
324 )
325
ValueError: arguments without labels along dimension 't' cannot be aligned because they have different dimension sizes: {10, 6}
https://nbviewer.jupyter.org/gist/jhamman/0e52d9bb29f679e26b0878c58bb813d2
I'm curious if this can be made to work with apply_ufunc or if we should pursue other options here. Advice and suggestions appreciated.
xr.show_versions()commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.14.0
pandas: 0.25.1
numpy: 1.17.1
scipy: 1.3.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.3.2
matplotlib: 3.1.1
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: 5.0.1
IPython: 7.8.0
sphinx: 2.2.0
It's working as intended.
apply_ufunc verifies that indices are aligned. Note the optional parameter join='exact'. You have two implicit pd.RangeIndex on dimension t that have a different number of elements - which means they are not aligned. Hence, when apply_ufunc internally calls xarray.align(Xda, yda, join="exact"), it falls over.
You have two options:
join='outer' to the apply_ufunc call, which will cause the shorter of the two variables to be padded with NaNs. You'll also need to replace mean() with nanmean() in your kernel. This however is horribly inefficient.out = xr.apply_ufunc(
diff_mean,
Xda,
yda.rename({"t": "t2"}),
dask="parallelized",
input_core_dims=[['t'], ['t2']],
output_core_dims=[[]],
output_dtypes=[np.float],
)
While on the topic, note that vectorize=True is asking xarray to slice the numpy array, do a for loop in pure python applying your kernel multiple times, and then concatenate the output back together - that is, horribly slow. If you can avoid it, you should, and your kernel definitely can be changed to process arbitrary unknown dimensions:
def diff_mean(X, y):
assert X.shape[-1] != y.shape[-1]
return X.mean(axis=-1) - y.mean(axis=-1)
@jhamman are you happy to close the ticket?