Xarray: Support flexible DataArray shapes in Dataset

Created on 19 Apr 2020  Â·  4Comments  Â·  Source: pydata/xarray

I always use Pandas to deal with my neuroscience data (multi-dimension). It is annoying to stack and unstack all the time and I heard Xarray is designed for multi-dimension data.

In neuroscience research, we usually have multiple participants and we will test them different times, which means the data may look like this:

  • participant A:

    • 2*5*100 matrix

  • participant B:

    • 2*5*101 matrix

(100 and 101 are the testing times)

But Dataset doesn't support to have 2*5*100 DataArray and 2*5*101 DataArray together. Is there any solution to deal with that kind of data in Xarray?

documentation usage question

Most helpful comment

this ultimately depends on how the last dimension of A and B are related (or rather, how you want to model the relationship). If they are not related at all, simply use different dimension names:

In [2]: da1 = xr.DataArray(np.empty(shape=(2, 5, 100)), dims=("x", "y", "z1")) 
   ...: da2 = xr.DataArray(np.empty(shape=(2, 5, 101)), dims=("x", "y", "z2")) 
   ...: ds = xr.Dataset({"a": da1, "b": da2}) 
   ...: ds
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 5, z1: 100, z2: 101)
Dimensions without coordinates: x, y, z1, z2
Data variables:
    a        (x, y, z1) float64 6.901e-310 6.901e-310 4.67e-310 ... 0.0 0.0 0.0
    b        (x, y, z2) float64 6.901e-310 6.901e-310 4.67e-310 ... 0.0 0.0 0.0

If they are related, assign coordinates to the dimensions:

In [3]: da1 = xr.DataArray(
   ...:     np.empty(shape=(2, 5, 100)),
   ...:     dims=("x", "y", "z"),
   ...:     coords={"z": np.arange(100)},
   ...: ) 
   ...: da2 = xr.DataArray(
   ...:     np.empty(shape=(2, 5, 101)),
   ...:     dims=("x", "y", "z"),
   ...:     coords={"z": np.arange(101)},
   ...: ) 
   ...: ds = xr.Dataset({"a": da1, "b": da2}) 
   ...: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 5, z: 101)
Coordinates:
  * z        (z) int64 0 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
Dimensions without coordinates: x, y
Data variables:
    a        (x, y, z) float64 6.901e-310 6.901e-310 ... 6.917e-323 nan
    b        (x, y, z) float64 6.901e-310 6.901e-310 ... 6.901e-310 -6.35e+53

In this case, A does not have the label z=100, so it is treated as missing (you should be familiar with the concept of "missing values" since you know pandas).

All 4 comments

this ultimately depends on how the last dimension of A and B are related (or rather, how you want to model the relationship). If they are not related at all, simply use different dimension names:

In [2]: da1 = xr.DataArray(np.empty(shape=(2, 5, 100)), dims=("x", "y", "z1")) 
   ...: da2 = xr.DataArray(np.empty(shape=(2, 5, 101)), dims=("x", "y", "z2")) 
   ...: ds = xr.Dataset({"a": da1, "b": da2}) 
   ...: ds
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 5, z1: 100, z2: 101)
Dimensions without coordinates: x, y, z1, z2
Data variables:
    a        (x, y, z1) float64 6.901e-310 6.901e-310 4.67e-310 ... 0.0 0.0 0.0
    b        (x, y, z2) float64 6.901e-310 6.901e-310 4.67e-310 ... 0.0 0.0 0.0

If they are related, assign coordinates to the dimensions:

In [3]: da1 = xr.DataArray(
   ...:     np.empty(shape=(2, 5, 100)),
   ...:     dims=("x", "y", "z"),
   ...:     coords={"z": np.arange(100)},
   ...: ) 
   ...: da2 = xr.DataArray(
   ...:     np.empty(shape=(2, 5, 101)),
   ...:     dims=("x", "y", "z"),
   ...:     coords={"z": np.arange(101)},
   ...: ) 
   ...: ds = xr.Dataset({"a": da1, "b": da2}) 
   ...: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 5, z: 101)
Coordinates:
  * z        (z) int64 0 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
Dimensions without coordinates: x, y
Data variables:
    a        (x, y, z) float64 6.901e-310 6.901e-310 ... 6.917e-323 nan
    b        (x, y, z) float64 6.901e-310 6.901e-310 ... 6.901e-310 -6.35e+53

In this case, A does not have the label z=100, so it is treated as missing (you should be familiar with the concept of "missing values" since you know pandas).

I dont try it, but i know your problem.
If you try to create from dataarray df.to_dataset(name='participant_A')
df.to_dataset(name='participant_B')
and after merge them?

xr.merge([ds1, ds2], compat='no_conflicts')

http://xarray.pydata.org/en/stable/combining.html

In potter case you could create nan values to create the same dimensions.

But i have never tried. I found another solution for my data, but it was my
alternative.

El dom., 19 abr. 2020 20:57, (Ray) Jinbiao Yang notifications@github.com
escribió:

I always use Pandas to deal with my neuroscience data (multi-dimension).
It is annoying to stack and unstack all the time and I heard Xarray is
designed for multi-dimension data.

In neuroscience research, we usually have multiple participants and we
will test them different times, which means the data may look like this:

  • participant A:

    • 25100 matrix

  • participant B:

    • 25101 matrix

(100 and 101 are the testing times)

But Dataset doesn't support to have 25100 DataArray and 25101
DataArray together
. Is there any solution to deal with that kind of data
in Xarray?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/3984, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AIGDFO4X4KQA5WPOVUEQQVLRNNCRJANCNFSM4ML467MA
.

@keewis your answer (and a clarification that we can't do real "ragged" arrays) would make a useful cookbook or StackOverflow answer, since I suspect a lot of people have this question.

Your both methods worked! Thank you!

Was this page helpful?
0 / 5 - 0 ratings