xarray.DataArray.expand_dims() can only expand dimension for a point coordinate

Created on 25 Jan 2019  路  14Comments  路  Source: pydata/xarray

Current expand_dims functionality

Apparently, expand_dims can only create a dimension for a point coordinate, i.e. it promotes a scalar coordinate into 1D coordinate. Here is an example:

>>> coords = {"b": range(5), "c": range(3)}
>>> da = xr.DataArray(np.ones([5, 3]), coords=coords, dims=list(coords.keys()))
>>> da
<xarray.DataArray (b: 5, c: 3)>
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
>>> da["a"] = 0  # create a point coordinate
>>> da
<xarray.DataArray (b: 5, c: 3)>
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
    a        int64 0
>>> da.expand_dims("a")  # create a new dimension "a" for the point coordinated
<xarray.DataArray (a: 1, b: 5, c: 3)>
array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
  * a        (a) int64 0
>>>

Problem description

I want to be able to do 2 more things with expand_dims or maybe a related/similar method:
1) broadcast the data across 1 or more new dimensions
2) expand an existing dimension to include 1 or more new coordinates

Here is the code I currently use to accomplish this

from collections import OrderedDict

import xarray as xr


def expand_dimensions(data, fill_value=np.nan, **new_coords):
    """Expand (or add if it doesn't yet exist) the data array to fill in new
    coordinates across multiple dimensions.

    If a dimension doesn't exist in the dataarray yet, then the result will be
    `data`, broadcasted across this dimension.

    >>> da = xr.DataArray([1, 2, 3], dims="a", coords=[[0, 1, 2]])
    >>> expand_dimensions(da, b=[1, 2, 3, 4, 5])
    <xarray.DataArray (a: 3, b: 5)>
    array([[ 1.,  1.,  1.,  1.,  1.],
           [ 2.,  2.,  2.,  2.,  2.],
           [ 3.,  3.,  3.,  3.,  3.]])
    Coordinates:
      * a        (a) int64 0 1 2
      * b        (b) int64 1 2 3 4 5

    Or, if `dim` is already a dimension in `data`, then any new coordinate
    values in `new_coords` that are not yet in `data[dim]` will be added,
    and the values corresponding to those new coordinates will be `fill_value`.

    >>> da = xr.DataArray([1, 2, 3], dims="a", coords=[[0, 1, 2]])
    >>> expand_dimensions(da, a=[1, 2, 3, 4, 5])
    <xarray.DataArray (a: 6)>
    array([ 1.,  2.,  3.,  0.,  0.,  0.])
    Coordinates:
      * a        (a) int64 0 1 2 3 4 5

    Args:
        data (xarray.DataArray):
            Data that needs dimensions expanded.
        fill_value (scalar, xarray.DataArray, optional):
            If expanding new coords this is the value of the new datum.
            Defaults to `np.nan`.
        **new_coords (list[int | str]):
            The keywords are arbitrary dimensions and the values are
            coordinates of those dimensions that the data will include after it
            has been expanded.
    Returns:
        xarray.DataArray:
            Data that had its dimensions expanded to include the new
            coordinates.
    """
    ordered_coord_dict = OrderedDict(new_coords)
    shape_da = xr.DataArray(
        np.zeros(list(map(len, ordered_coord_dict.values()))),
        coords=ordered_coord_dict,
        dims=ordered_coord_dict.keys())
    expanded_data = xr.broadcast(data, shape_da)[0].fillna(fill_value)
    return expanded_data

Here's an example of broadcasting data across a new dimension:

>>> coords = {"b": range(5), "c": range(3)}
>>> da = xr.DataArray(np.ones([5, 3]), coords=coords, dims=list(coords.keys()))
>>> expand_dimensions(da, a=[0, 1, 2])
<xarray.DataArray (b: 5, c: 3, a: 3)>
array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
  * a        (a) int64 0 1 2

Here's an example of expanding an existing dimension to include new coordinates:

>>> expand_dimensions(da, b=[5, 6])
<xarray.DataArray (b: 7, c: 3)>
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [nan, nan, nan],
       [nan, nan, nan]])
Coordinates:
  * b        (b) int64 0 1 2 3 4 5 6
  * c        (c) int64 0 1 2

Final Note

If no one else is already working on this, and if it seems like a useful addition to XArray, then I would more than happy to work on this. Please let me know.

Thank you,
Martin

API design

Most helpful comment

Well then I think they should be different.

Currently, da.expand_dims('a') gives

<xarray.DataArray (a: 1, b: 5, c: 3)>
array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
Dimensions without coordinates: a

da.expand_dims(a=3) should give

<xarray.DataArray (a: 3, b: 5, c: 3)>
...
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
Dimensions without coordinates: a

da.expand_dims(a=[9, 10, 11]) should give

<xarray.DataArray (a: 3, b: 5, c: 3)>
...
Coordinates:
  * a        (a) int64 9 10 11
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2

i.e. in this last case, the user has specified co-ordinate labels and so the returned DataArray has a new co-ordinate a.

All 14 comments

broadcast the data across 1 or more new dimensions

Yes, this feels in scope for expand_dims(). But I think there are two separate features here:

  1. Support inserting/broadcasting dimensions with size > 1.
  2. Specify the size of the new dimension implicitly, by providing coordinate labels.

I think we would want both to be supported -- you should not be required to supply coordinate labels in order to expand to a dimension of size > 1. We can imagine the first being spelled like da.expand_dims({'a': 3}) or da.expand_dims(a=3).

expand an existing dimension to include 1 or more new coordinates

This feels a little different from expand_dims to me. Here the fundamental operation is alignment/reindexing, not broadcasting across a new dimension. The result also looks different, because you get all the NaN values.

I would probably write this with reindex, e.g.,

In [12]: da.reindex(b=list(da.b.values)+[5, 6])
Out[12]:
<xarray.DataArray (b: 7, c: 3)>
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [nan, nan, nan],
       [nan, nan, nan]])
Coordinates:
  * b        (b) int64 0 1 2 3 4 5 6
  * c        (c) int64 0 1 2

Hi,
Thanks for replying. I see what you mean about the 2 separate features.

Would it be alright if I opened a PR sometime soon that upgraded expand_dims to support the inserting/broadcasting dimensions with size > 1 (the first feature)?

I would use your suggested API, i.e. not requiring explicit coordinate names -- that makes sense. However, it feels like the dimension kwargs (i.e. the new dimension/dimensions), should be allowed to be given implicit or explicit coordinates, in case the user doesn't want 0-based integer coordinates for the new dimension. For example,

da.expand_dims(a=3)

is equivalent to

da.expand_dims(a=[0, 1, 2])   

but this will also work

da.expand_dims(a=['w', 'x', 'y', 'z'])

where da is
```

coords = {"b": range(5), "c": range(3)}
da = xr.DataArray(np.ones([5, 3]), coords=coords, dims=list(coords.keys()))
da
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
Coordinates:

  • b (b) int64 0 1 2 3 4
  • c (c) int64 0 1 2
    ````
    Does that make sense?

Thank you!
Martin

da.expand_dims(a=3) should not be equivalent to da.expand_dims(a=[0, 1, 2]) because the latter will also create a co-ordinate a. Am I understanding this right?

Those _would_ be equivalent, I think, assuming they're both manipulating the same da object (I meant for them to be separate calls not sequential, but even if they were sequential, expand_dims doesn't and wouldn't alter da, but instead return a new xarray object). I edited my above post to clarify what da is.

Well then I think they should be different.

Currently, da.expand_dims('a') gives

<xarray.DataArray (a: 1, b: 5, c: 3)>
array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
Dimensions without coordinates: a

da.expand_dims(a=3) should give

<xarray.DataArray (a: 3, b: 5, c: 3)>
...
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
Dimensions without coordinates: a

da.expand_dims(a=[9, 10, 11]) should give

<xarray.DataArray (a: 3, b: 5, c: 3)>
...
Coordinates:
  * a        (a) int64 9 10 11
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2

i.e. in this last case, the user has specified co-ordinate labels and so the returned DataArray has a new co-ordinate a.

Oh I see what you're saying. Yeah, that makes sense.

To get the equivalent of da.expand_dims(a=[9, 10, 11]), you'd do

>>> new = da.expand_dims(a=3)
>>> new
<xarray.DataArray (a: 3, b: 5, c: 3)>
...
Coordinates:
  * b        (b) int64 0 1 2 3 4
  * c        (c) int64 0 1 2
Dimensions without coordinates: a
>>> new["a"] = [9, 10, 11]

Would it be alright if I opened a PR sometime soon that upgraded expand_dims to support the inserting/broadcasting dimensions with size > 1 (the first feature)?

Yes, that sounds welcome to me!

I think much of the underlying logic should already exist on the Variable.set_dims() method. See also the either_dict_or_kwargs utility in xarray.core.utils.

Unfortunately this most recent change has broken my workflow. I was using expand_dims to add a named dimension back onto a DataArray, when the dimension had been previously removed with the sel method. I realize this may not be the best way of doing things, but I wanted to point out that there is a loss of functionality here.

import xarray as xr
da = xr.DataArray([0,1,2], dims=['dim1'], coords={'dim1':['a','b','c']})
print(da.dims) # returns ('dim1',)
da = da.sel({'dim1':'a'})
print(da.dims) # returns ()
da = da.expand_dims(da.coords) # fails in 0.12.1
print(da.dims) # returns ('dim1',) in 0.12.0

@barkls I think da.expand_dims(list(da.coords)) should work for this use-case.

Previously, we only used the argument to expand_dims() as a sequence, but now we distinguish between mappings and other sequences.

I don't know what the best resolution would be here, but this seems to be a hazard of duck-typing. I did not anticipate that some users would already be iterating over mappings like .coords.

Another solution could be adding support for da.sel(dim1='a', squeeze=False) to avoid losing the dim1 dimension/coordinate in the first place

Another solution could be adding support for da.sel(dim1='a', squeeze=False) to avoid losing the dim1 dimension/coordinate in the first place

Or equivalently, you could just do

da.sel(dim1=['a'])

@pletchm that is the solution I found as well. Thanks all for the suggestions!

@pletchm was this issue closed by #2757?

Yes, @TomNicholas. My PR got merged but I forgot to close the issue -- closing it now. Thanks for checking.

Was this page helpful?
0 / 5 - 0 ratings