I am having some troubles understanding apply_ufunc broadcasting rules. As I had some trouble understanding the docs, I am not 100% sure it is a bug, but I am quite sure. I will try to explain why with the following really simple example.
import xarray as xr
import numpy as np
a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
c = xr.DataArray(data=np.random.normal(size=(5, 6)), dims=["dim3", "dim4"])
def func(x,y):
print(x.shape)
print(y.shape)
return
The function defined always raises an error when trying to call apply_ufunc, but this is intended, as the shapes have already been printed by then, and this keeps the example as simple as possible.
xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (5, 6)
Here, a has been kind of broadcasted, but I would expect the shapes of a and c to be the same as when calling xr.broadcast, as there are no input core dims, so all dimensions are broadcasted. However:
print([ary.shape for ary in xr.broadcast(a,c)])
# [(7, 3, 5, 6), (7, 3, 5, 6)]
Using different input core dims does not get rid of the problem, instead I believe it shows some more issues:
xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],[]])
# (3, 1, 1, 7), expected (3, 5, 6, 7)
# (5, 6), expected (3, 5, 6)
xr.apply_ufunc(func, a, c, input_core_dims=[[],["dim3"]])
# (7, 3, 1), expected (7, 3, 6)
# (6, 5), expected (7, 3, 6, 5)
xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],["dim3"]])
# (3, 1, 7), expected (3, 6, 7)
# (6, 5), expected (3, 6, 5)
Is this current behaviour what should be expected?
xr.show_versions()commit: None
python: 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
python-bits: 64
OS: Linux
OS-release: 4.15.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.2
libnetcdf: 4.6.3
xarray: 0.12.1
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.1.0
cartopy: None
seaborn: None
setuptools: 41.0.0
pip: 19.1.1
conda: None
pytest: 4.5.0
IPython: 7.5.0
sphinx: 2.0.1
Thanks for the issue & code sample @OriolAbril
IIUC, func needs to do this broadcasting itself; from the apply_ufunc docstring:
func : callable
Function to call like ``func(*args, **kwargs)`` on unlabeled arrays
(``.data``) that returns an array or tuple of arrays. If multiple
arguments with non-matching dimensions are supplied, this function is
expected to vectorize (broadcast) over axes of positional arguments in
the style of NumPy universal functions [1]_ (if this is not the case,
set ``vectorize=True``). If this function returns multiple outputs, you
must set ``output_core_dims`` as well.
Then shouldn't a in the first example keep its original shape?
Because func receives unlabelled arrays, apply_ufunc aligns the axes _order_. So if they share a dimension:
In [8]: import xarray as xr
...: import numpy as np
...:
...: a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
...: c = xr.DataArray(data=np.random.normal(size=(7, 6)), dims=["dim1", "dim4"]) # <- change here
...:
...: def func(x,y):
...: print(x.shape)
...: print(y.shape)
...: return x
...:
In [9]: xr.apply_ufunc(func, a, c)
(7, 3, 1)
(7, 1, 6)
...otherwise func wouldn't know how to align.
Another option would be for your original example to put lengths of 1 in _all_ axes, rather than only 'forward filling', e.g.
xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (1, 1, 5, 6) # <- change here
I _think_ it operates without that step because functions 'in the wild' generally will handle that themselves, but that's a guess and needs someone who knows this better to weight in
For what it's worth, I agree that this behavior is a little surprising. We should probably make apply_ufunc() explicitly broadcast arrays first.
We should probably make apply_ufunc() explicitly broadcast arrays first.
To confirm, so that we have something like this?
xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 5, 6)
# (7, 3, 5, 6)
Yes, exactly.
I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?
I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?
Even if there were a performance cost (compared to the actual behaviour), it could be easily avoided by using all dims as input_core_dims couldn't it? IIUC, all dims should be broadcasted unless they are in input core dims, so it broadcasting could still be avoided without problem.
With NumPy arrays at least, there is no cost for broadcasting, because it
can always be done with views. But even for other array types, inserting
size 1 dimensions in the correct location should be basically free, and
would be more helpful than what we currently do
On Wed, Jun 19, 2019 at 9:25 PM Oriol Abril notifications@github.com
wrote:
I'm trying to think whether there would be any performance cost there -
i.e. are there any arrays where preemptive broadcasting would be both
expensive and unnecessary?Even if there were a performance cost (compared to the actual behaviour),
it could be easily avoided by using all dims as input_core_dims couldn't
it? IIUC, all dims should be broadcasted unless they are in input core
dims, so it broadcasting could still be avoided without problem.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/3032?email_source=notifications&email_token=AAJJFVUDJUKKVSXB5DQONF3P3J2YXA5CNFSM4HZEHCZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCXX4Y#issuecomment-503675891,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJJFVTEQBAYTYROYMQNGC3P3J2YXANCNFSM4HZEHCZA
.
@shoyer thanks for the clarity
@OriolAbril would you mind if we changed this issue to "apply_ufunc should preemptively broadcast"
@max-sixty Not at all, whatever is best. I actually opened the issue without being 100% it was one.
Most helpful comment
Yes, exactly.