Xarray: How to broadcast along dayofyear

Created on 19 Jan 2018 · 14Comments · Source: pydata/xarray

Ok so here is the problem I'm trying to solve, for which I did not find any solution:
I have a spatial dataset with the following format (tempeature over time on a spatial grid):

Dimensions:    (depth: 1, latitude: 481, longitude: 781, time: 730)
Coordinates:
  * time       (time) datetime64[ns] 2016-01-01T11:59:37.961193472 ...
  * longitude  (longitude) float32 -75.0 -74.9167 -74.8333 -74.75 -74.6667 ...
  * latitude   (latitude) float32 25.0 25.0833 25.1667 25.25 25.3333 25.4167 ...
  * depth      (depth) float32 0.494025
Data variables:
    thetao     (time, depth, latitude, longitude) float64 ...
Attributes:
    title:            daily mean fields from Global Ocean Physics Analysis an...
    institution:      MERCATOR OCEAN
    references:       http://www.mercator-ocean.fr
    source:           MERCATOR PSY4QV3R1
    Conventions:      CF-1.0
    history:          Data extracted from dataset http://opendap-glo.mercator...
    time_min:         578556.0
    time_max:         596052.0
    julian_day_unit:  hours since 1950-01-01 00:00:00
    z_min:            0.494024991989
    z_max:            0.494024991989
    latitude_min:     25.0
    latitude_max:     65.0
    longitude_min:    -75.0
    longitude_max:    -10.0

These all contain temperture values. From another source I receive specially calibrated mean and standard deviation of temperatures for every day of the year. The mean dataset (std is the same) looks like this:

Dimensions:    (dayofyear: 366, depth: 1, latitude: 481, longitude: 781)
Coordinates:
  * longitude  (longitude) float32 -75.0 -74.9167 -74.8333 -74.75 -74.6667 ...
  * latitude   (latitude) float32 25.0 25.0833 25.1667 25.25 25.3333 25.4167 ...
  * depth      (depth) float32 0.494025
  * dayofyear  (dayofyear) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
Data variables:
    thetao     (dayofyear, depth, latitude, longitude) float64 25.06 25.16 ...

What I want to achieve is to construct from the temperatures dataset a new one with standardized temperatures. The issue is I do not know how. The first thing I thought is to do something like:

new_dset = (dset.groupby("time.dateofyear") - mean) / std

However, the issue now is that I can't undo the "groupby". At this point I did not manage to figure out of how to either undo this, or in general the right approach for performing the operation I want.

documentation usage question

Source

botev

Most helpful comment

No worries @chiaral; I agree on the xarray side this isn't so well documented (you have to follow the link to the pandas description of the datetime components).

Unfortunately there is not a simple attribute for grouping by matching month and day. It is possible to define your own vector of integers for this purpose, however. Perhaps you've already found a workaround, but just in case, here is one way to define a "modified ordinal day" that you can use in a groupby call:

In [1]: import xarray as xr

In [2]: from datetime import datetime

In [3]: dates = [datetime(1999, 1, 1), datetime(1999, 3, 1),
   ...:          datetime(2000, 1, 1), datetime(2000, 3, 1)]
   ...:

In [4]: da = xr.DataArray([1, 2, 3, 4], coords=[dates], dims=['time'])

In [5]: not_leap_year = xr.DataArray(~da.indexes['time'].is_leap_year, coords=da.coords)

In [6]: march_or_later = da.time.dt.month >= 3

In [7]: ordinal_day = da.time.dt.dayofyear

In [8]: modified_ordinal_day = ordinal_day + (not_leap_year & march_or_later)

In [9]: modified_ordinal_day = modified_ordinal_day.rename('modified_ordinal_day')

In [10]: modified_ordinal_day
Out[10]:
<xarray.DataArray 'modified_ordinal_day' (time: 4)>
array([ 1, 61,  1, 61])
Coordinates:
  * time     (time) datetime64[ns] 1999-01-01 1999-03-01 2000-01-01 2000-03-01

In [11]: da.groupby(modified_ordinal_day).mean('time')
Out[11]:
<xarray.DataArray (modified_ordinal_day: 2)>
array([2., 3.])
Coordinates:
  * modified_ordinal_day  (modified_ordinal_day) int64 1 61

Note if we use the standard ordinal day we get three groups, because of the difference between non-leap and leap years:

In [12]: ordinal_day
Out[12]:
<xarray.DataArray 'dayofyear' (time: 4)>
array([ 1, 60,  1, 61])
Coordinates:
  * time     (time) datetime64[ns] 1999-01-01 1999-03-01 2000-01-01 2000-03-01

In [13]: da.groupby(ordinal_day).mean('time')
Out[13]:
<xarray.DataArray (dayofyear: 3)>
array([2., 2., 4.])
Coordinates:
  * dayofyear  (dayofyear) int64 1 60 61

spencerkclark on 3 Sep 2018

👍4

All 14 comments

So you got a two-year temperature field with dimension [730, 1, 481, 781], and another mean, and std data arrays of [366, 1, 481, 781] and you want to normalize the temperature field.

Sorry I'm not familiar with the Xarray's groupby functions, I'll try several things before some experts jumping in.

Concat two std/mean fields along dayofyear, and reindex to the time index from the temperature data. Then you can do the (dset-mean)/std
Separate the temperature fields into two one-year chunks, reindex time to dayofyear, then do the calculation.
Flatten the spatial grid then use numpy to do the trick.

I'm also interested in the right way to do it using built-in Xarray functions. I'm pretty sure there are some more clever ways to do this.

fischcheng on 19 Jan 2018

Thanks for the suggestion. However, option 2 and 3 are not really options, as after this, I need to provide the standardized field with the original time index. I'm using Xarray for the first time but will try to do the reindexing.

botev on 19 Jan 2018

I end up doing the following:

# dset, mean, std - all XArray objects as explained above
time_index = dset.time.dt.dayofyear
dset_mean = mean.sel(dayofyear=time_index)
dset_std = std.sel(dayofyear=time_index)
new_dset = ((dset - dset_mean) / dset_std).drop("dayofyear")

One issue though is that this quite bad on memory as it constructs 3 arrays in memmory as large as the original one. If anoyne has any suggestion on how to improve this I would be very grateful. Also is it possible to compute and store new_dset simutlanously so I don't create it in memory?

botev on 19 Jan 2018

You can do this in a single step with xarray.apply_ufunc(), which is a sort of more flexible/powerful interface to xarray's broadcasting arithmetic. Extending the toy weather example from the docs:

import xarray as xr
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library

np.random.seed(123)

times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (np.array(times.dayofyear) / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset({'tmin': (('time', 'location'), tmin_values),
                 'tmax': (('time', 'location'), tmax_values)},((62, 3), (3,), (3,))
                {'time': times, 'location': ['IA', 'IN', 'IL']})

# new code
ds_mean = ds.groupby('time.month').mean('time')
ds_std = ds.groupby('time.month').std('time')

xarray.apply_ufunc(lambda x, m, s: (x - m) / s, ds.groupby('time.month'), ds_mean, ds_std)

The other way (about twice as slow) is to chain two calls to groupby():

(ds.groupby('time.month') - ds_mean).groupby('time.month') / ds_std

I'll mark this as a documentation issue in case anyone wants to add an example to the docs.

shoyer on 20 Jan 2018

👍1

Example for the docs proposed here:
https://github.com/pydata/xarray/pull/1848

braaannigan on 22 Jan 2018

Thanks a lot for the help!

botev on 22 Jan 2018

I am commenting on this issue, because my findings seem relevant to this example.

I have just encountered an unexpected (to me) behavior of dayofyear.

I have a dataset, ds:

<xarray.Dataset>
Dimensions:  (L: 45, S: 1168)
Coordinates:
 * S        (S) datetime64[ns] 1999-01-01T12:00:00 1999-01-06T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 2.0625568e-05 3.5336856e-05 5.2443047e-05 ...
    truth    (S, L) float32 2.0625568e-05 3.5336856e-05 5.2443047e-05 ...

S is my time coordinate. It is daily, but not continuous

<xarray.DataArray 'S' (S: 1168)>
array(['1999-01-01T12:00:00.000000000', '1999-01-06T12:00:00.000000000',
       '1999-01-11T12:00:00.000000000', ..., '2014-12-17T12:00:00.000000000',
       '2014-12-22T12:00:00.000000000', '2014-12-27T12:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * S        (S) datetime64[ns] 1999-01-01T12:00:00 1999-01-06T12:00:00 ...

For example for 1999 first three months:

ds.S.sel(S=slice('1999-01-01','1999-03-05'))

<xarray.DataArray 'S' (S: 13)>
array(['1999-01-01T12:00:00.000000000', '1999-01-06T12:00:00.000000000',
       '1999-01-11T12:00:00.000000000', '1999-01-16T12:00:00.000000000',
       '1999-01-21T12:00:00.000000000', '1999-01-26T12:00:00.000000000',
       '1999-01-31T12:00:00.000000000', '1999-02-05T12:00:00.000000000',
       '1999-02-10T12:00:00.000000000', '1999-02-15T12:00:00.000000000',
       '1999-02-20T12:00:00.000000000', '1999-02-25T12:00:00.000000000',
       '1999-03-02T12:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * S        (S) datetime64[ns] 1999-01-01T12:00:00 1999-01-06T12:00:00 ...

and for 2008:

broadcasted_data.S.sel(S=slice('2008-01-01','2008-03-05'))

<xarray.DataArray 'S' (S: 13)>
array(['2008-01-01T12:00:00.000000000', '2008-01-06T12:00:00.000000000',
       '2008-01-11T12:00:00.000000000', '2008-01-16T12:00:00.000000000',
       '2008-01-21T12:00:00.000000000', '2008-01-26T12:00:00.000000000',
       '2008-01-31T12:00:00.000000000', '2008-02-05T12:00:00.000000000',
       '2008-02-10T12:00:00.000000000', '2008-02-15T12:00:00.000000000',
       '2008-02-20T12:00:00.000000000', '2008-02-25T12:00:00.000000000',
       '2008-03-02T12:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * S        (S) datetime64[ns] 2008-01-01T12:00:00 2008-01-06T12:00:00 ...

Please note, within the non leap (1999) or leap (2008) years, the days are the same. There are 73 S values per year.

However when I groupby('S.dayofyear') things are not aligned anymore starting from March.

For example, if I groupby() and print the value of dayofyear and the grouped values:

for k, gg in ds.groupby('S.dayofyear'):
    print(k)
    print(gg)

.....
51  ## 51st day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 16)
Coordinates:
  * S        (S) datetime64[ns] 1999-02-20T12:00:00 2000-02-20T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 2.8822698e-05 3.1478736e-05 3.707411e-05 ...
    truth    (S, L) float32 2.8387214e-05 2.8993465e-05 2.8109233e-05 ...
56 ## 56st day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 16)
Coordinates:
  * S        (S) datetime64[ns] 1999-02-25T12:00:00 2000-02-25T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 3.5827405e-05 2.27847e-05 2.8826753e-05 ...
    truth    (S, L) float32 2.9589286e-05 2.6589936e-05 2.7626802e-05 ...

up to here everything looks good, I have 16 values (one for each year of data) for each day of the year, but starting with March 2nd, they start getting split in two groups:

61 ## 61st day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 12)
Coordinates:
  * S        (S) datetime64[ns] 1999-03-02T12:00:00 2001-03-02T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 2.2245076e-05 2.9928206e-05 3.2708682e-05 ...
    truth    (S, L) float32 2.5899697e-05 2.5815236e-05 2.6628013e-05 ...
62## 62nd day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 4)
Coordinates:
  * S        (S) datetime64[ns] 2000-03-02T12:00:00 2004-03-02T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 2.3905726e-05 2.1646814e-05 1.5209519e-05 ...
    truth    (S, L) float32 2.4452387e-05 2.5048954e-05 2.5876538e-05 ...
66## 66th day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 12)
Coordinates:
  * S        (S) datetime64[ns] 1999-03-07T12:00:00 2001-03-07T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 2.60827e-05 4.9364742e-05 3.838778e-05 ...
    truth    (S, L) float32 2.6537613e-05 2.7840171e-05 2.7700215e-05 ...
67## 67th day of the year
<xarray.Dataset>
Dimensions:  (L: 45, S: 4)
Coordinates:
  * S        (S) datetime64[ns] 2000-03-07T12:00:00 2004-03-07T12:00:00 ...
  * L        (L) float64 0.0 24.0 48.0 72.0 96.0 120.0 144.0 168.0 192.0 ...
Data variables:
    pr       (S, L) float32 1.59269e-05 2.7056101e-05 1.8332774e-05 ...
    truth    (S, L) float32 2.1952277e-05 2.7667278e-05 2.5342364e-05 ...

and so on.

This was unexpected to me. And not well document. It means that, especially when we calculate anomalies, we might not be aligning things correctly? or am I wrong?
Is there a way to group the data by the day of the year so that everything is grouped on 366 days?

chiaral on 30 Aug 2018

@chiaral You should take a look at CFTimeIndex which specifically was designed to solve this problem: http://xarray.pydata.org/en/stable/time-series.html#non-standard-calendars-and-dates-outside-the-timestamp-valid-range

shoyer on 31 Aug 2018

👍1

@chiaral if I understand correctly, your data does use a standard calendar, but the issue is that you would like to group values based on matching month and day numbers (e.g. all January 1st's, all January 6th's, ..., all March 2nd's etc.) rather than matching "days since December 31st the preceding year," which is what the dayofyear attribute corresponds with. Is that right?

spencerkclark on 1 Sep 2018

Yes, @spencerkclark that was my initial intent. I - for some reasons, and I understand I was wrong about it, - thought that dayoftheyear would align the days always on the same grid. To be honest I have never used it until now, so I wasn't sure how it worked. I was just surprised by that behavior, which I understand is intended. It is just not explained well IMHO. If we calculate the daily climatology, the 366th day is the 31st of december of every 4 years, right? it just wasn't exactly what I expected, so I thought to put a note in this issue, which popped up when I was looking for some more details about this attribute.

Said so - is there a more suitable attribute for what I want to do? This is maybe not the best place to discuss about that, I can send an email to the mailing list.

chiaral on 3 Sep 2018

No worries @chiaral; I agree on the xarray side this isn't so well documented (you have to follow the link to the pandas description of the datetime components).

In [1]: import xarray as xr

In [2]: from datetime import datetime

In [3]: dates = [datetime(1999, 1, 1), datetime(1999, 3, 1),
   ...:          datetime(2000, 1, 1), datetime(2000, 3, 1)]
   ...:

In [4]: da = xr.DataArray([1, 2, 3, 4], coords=[dates], dims=['time'])

In [5]: not_leap_year = xr.DataArray(~da.indexes['time'].is_leap_year, coords=da.coords)

In [6]: march_or_later = da.time.dt.month >= 3

In [7]: ordinal_day = da.time.dt.dayofyear

In [8]: modified_ordinal_day = ordinal_day + (not_leap_year & march_or_later)

In [9]: modified_ordinal_day = modified_ordinal_day.rename('modified_ordinal_day')

In [10]: modified_ordinal_day
Out[10]:
<xarray.DataArray 'modified_ordinal_day' (time: 4)>
array([ 1, 61,  1, 61])
Coordinates:
  * time     (time) datetime64[ns] 1999-01-01 1999-03-01 2000-01-01 2000-03-01

In [11]: da.groupby(modified_ordinal_day).mean('time')
Out[11]:
<xarray.DataArray (modified_ordinal_day: 2)>
array([2., 3.])
Coordinates:
  * modified_ordinal_day  (modified_ordinal_day) int64 1 61

Note if we use the standard ordinal day we get three groups, because of the difference between non-leap and leap years:

In [12]: ordinal_day
Out[12]:
<xarray.DataArray 'dayofyear' (time: 4)>
array([ 1, 60,  1, 61])
Coordinates:
  * time     (time) datetime64[ns] 1999-01-01 1999-03-01 2000-01-01 2000-03-01

In [13]: da.groupby(ordinal_day).mean('time')
Out[13]:
<xarray.DataArray (dayofyear: 3)>
array([2., 2., 4.])
Coordinates:
  * dayofyear  (dayofyear) int64 1 60 61

spencerkclark on 3 Sep 2018

👍4

Building on the above example, if you're OK with using a coordinate of strings, the following might be a little simpler way of defining the labels to use for grouping (this is perhaps closer to a single attribute solution):

In [14]: month_day_str = xr.DataArray(da.indexes['time'].strftime('%m-%d'), coords=da.coords,
    ...:                              name='month_day_str')
    ...:

In [15]: da.groupby(month_day_str).mean('time')
Out[15]:
<xarray.DataArray (month_day_str: 2)>
array([2., 3.])
Coordinates:
  * month_day_str  (month_day_str) object '01-01' '03-01'

Note #2090 / #2144 would make this more straightforward.

spencerkclark on 3 Sep 2018

👍2

Thanks - i will give this a try!
And thanks for the clarifications.

chiaral on 4 Sep 2018

For anyone stumbling upon this thread in the future, I would like to mention that I used the above grouping approach suggested by @spencerkclark for my dataset to calculate climatology with calendar day and it works smoothly. The only thing one should be careful is that you can't directly plot the data using

In[1]: da.groupby(month_day_str).mean('time').plot()
Out[1]: TypeError: Plotting requires coordinates to be numeric or dates of type np.datetime64 or datetime.datetime.

To get around it, either use group by the

modified_ordinal _day

Or convert back the grouped coordinate month_day_str to numeric. However, after doing all this I found out that the CDO function also calculates climatology by the ordinal day of the year. So, to be consistent I would stick to that method but it's anyway good to know that there is a way around to group by day and month if required in Xarray.