Xarray: apply_ufunc should preemptively broadcast

Created on 19 Jun 2019  Â·  11Comments  Â·  Source: pydata/xarray

Code Sample

I am having some troubles understanding apply_ufunc broadcasting rules. As I had some trouble understanding the docs, I am not 100% sure it is a bug, but I am quite sure. I will try to explain why with the following really simple example.

import xarray as xr
import numpy as np

a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
c = xr.DataArray(data=np.random.normal(size=(5, 6)), dims=["dim3", "dim4"])

def func(x,y):
    print(x.shape)
    print(y.shape)
    return

The function defined always raises an error when trying to call apply_ufunc, but this is intended, as the shapes have already been printed by then, and this keeps the example as simple as possible.

Problem description

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (5, 6)

Here, a has been kind of broadcasted, but I would expect the shapes of a and c to be the same as when calling xr.broadcast, as there are no input core dims, so all dimensions are broadcasted. However:

print([ary.shape for ary in xr.broadcast(a,c)])
# [(7, 3, 5, 6), (7, 3, 5, 6)]

Using different input core dims does not get rid of the problem, instead I believe it shows some more issues:

xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],[]])
# (3, 1, 1, 7), expected (3, 5, 6, 7)
# (5, 6), expected (3, 5, 6)

xr.apply_ufunc(func, a, c, input_core_dims=[[],["dim3"]])
# (7, 3, 1), expected (7, 3, 6)
# (6, 5), expected (7, 3, 6, 5)

xr.apply_ufunc(func, a, c, input_core_dims=[["dim1"],["dim3"]])
# (3, 1, 7), expected (3, 6, 7)
# (6, 5), expected (3, 6, 5)

Is this current behaviour what should be expected?

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
python-bits: 64
OS: Linux
OS-release: 4.15.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.2
libnetcdf: 4.6.3

xarray: 0.12.1
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.1.0
cartopy: None
seaborn: None
setuptools: 41.0.0
pip: 19.1.1
conda: None
pytest: 4.5.0
IPython: 7.5.0
sphinx: 2.0.1

Most helpful comment

Yes, exactly.

All 11 comments

Thanks for the issue & code sample @OriolAbril

IIUC, func needs to do this broadcasting itself; from the apply_ufunc docstring:

func : callable
    Function to call like ``func(*args, **kwargs)`` on unlabeled arrays
    (``.data``) that returns an array or tuple of arrays. If multiple
    arguments with non-matching dimensions are supplied, this function is
    expected to vectorize (broadcast) over axes of positional arguments in
    the style of NumPy universal functions [1]_ (if this is not the case,
    set ``vectorize=True``). If this function returns multiple outputs, you
    must set ``output_core_dims`` as well.

Then shouldn't a in the first example keep its original shape?

Because func receives unlabelled arrays, apply_ufunc aligns the axes _order_. So if they share a dimension:

In [8]: import xarray as xr
   ...: import numpy as np
   ...:
   ...: a = xr.DataArray(data=np.random.normal(size=(7, 3)), dims=["dim1", "dim2"])
   ...: c = xr.DataArray(data=np.random.normal(size=(7, 6)), dims=["dim1", "dim4"])  # <- change here
   ...:
   ...: def func(x,y):
   ...:     print(x.shape)
   ...:     print(y.shape)
   ...:     return x
   ...:

In [9]: xr.apply_ufunc(func, a, c)
(7, 3, 1)
(7, 1, 6)

...otherwise func wouldn't know how to align.

Another option would be for your original example to put lengths of 1 in _all_ axes, rather than only 'forward filling', e.g.

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 1, 1)
# (1, 1, 5, 6) # <- change here

I _think_ it operates without that step because functions 'in the wild' generally will handle that themselves, but that's a guess and needs someone who knows this better to weight in

For what it's worth, I agree that this behavior is a little surprising. We should probably make apply_ufunc() explicitly broadcast arrays first.

We should probably make apply_ufunc() explicitly broadcast arrays first.

To confirm, so that we have something like this?

xr.apply_ufunc(func, a, c)
# Out
# (7, 3, 5, 6)
# (7, 3, 5, 6) 

Yes, exactly.

I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?

I'm trying to think whether there would be any performance cost there - i.e. are there any arrays where preemptive broadcasting would be both expensive and unnecessary?

Even if there were a performance cost (compared to the actual behaviour), it could be easily avoided by using all dims as input_core_dims couldn't it? IIUC, all dims should be broadcasted unless they are in input core dims, so it broadcasting could still be avoided without problem.

With NumPy arrays at least, there is no cost for broadcasting, because it
can always be done with views. But even for other array types, inserting
size 1 dimensions in the correct location should be basically free, and
would be more helpful than what we currently do

On Wed, Jun 19, 2019 at 9:25 PM Oriol Abril notifications@github.com
wrote:

I'm trying to think whether there would be any performance cost there -
i.e. are there any arrays where preemptive broadcasting would be both
expensive and unnecessary?

Even if there were a performance cost (compared to the actual behaviour),
it could be easily avoided by using all dims as input_core_dims couldn't
it? IIUC, all dims should be broadcasted unless they are in input core
dims, so it broadcasting could still be avoided without problem.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/3032?email_source=notifications&email_token=AAJJFVUDJUKKVSXB5DQONF3P3J2YXA5CNFSM4HZEHCZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCXX4Y#issuecomment-503675891,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJJFVTEQBAYTYROYMQNGC3P3J2YXANCNFSM4HZEHCZA
.

@shoyer thanks for the clarity

@OriolAbril would you mind if we changed this issue to "apply_ufunc should preemptively broadcast"

@max-sixty Not at all, whatever is best. I actually opened the issue without being 100% it was one.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mathause picture mathause  Â·  4Comments

tfurf picture tfurf  Â·  4Comments

blaylockbk picture blaylockbk  Â·  4Comments

zxdawn picture zxdawn  Â·  3Comments

equaeghe picture equaeghe  Â·  4Comments