Xarray: Undesired decoding to timedelta64 (was: units of "seconds" translated to time coordinate)

Created on 10 Oct 2017  路  14Comments  路  Source: pydata/xarray

When using open_dataset( ), it is translating data variables with units of "seconds" to time coordinates. For example, measurements of wave period. I don't believe xarray should treat variables as time coordinates unless their units are of "seconds since...". I have noticed that changing my units to "second" or "sec" or "s" prevents xarray from translating the variable to datetime64 and keeps it float64, as desired. More details and an OPeNDAP example posted on github here: https://stackoverflow.com/questions/46552078/xarray-wave-period-in-seconds-ingested-as-timedelta64

CF conventions help wanted

Most helpful comment

Ok, either @aurghs or I will take a shot at implementing decode_timedelta=False in the near future.

All 14 comments

Thanks for filing an issue -- I usually follow the xarray tag on StackOverflow but missed this one.

I wrote an answer on StackOverflow to explain why this works this way and how to work around it, but as I say in my answer although this is intended behavior I'm open to ideas for improvement. You are not the first person to complain about this, specifically for handling of wave periods. See https://github.com/pydata/xarray/issues/843 and links therein for prior discussion (CC @ocefpaf).

Thanks @shoyer. I understand the issue better now. Had not considered timedeltas. That does conflate the issue. No sure-fire solution, then, so I'll continue with the workaround. Cheers.

On https://stackoverflow.com/a/46675990/2005869, @shoyer explains:

My understanding of CF standard names is that forecast_period should be equal to the difference between time and forecast_reference_time, i.e., forecast_period = time - forecast_reference_time. If you specified your time_offset variable with units in the form "hours", then it would be decoded to timedelta64, along with datetime64 for time and time_run, so xarray's arithmetic would actually satisfy this identity. You might find this useful if you only wanted to include two of these variables and wanted to calculate the third on the fly. On the other hand, you probably don't want to convert the Tper variable to timedelta64. Technically, it is also a time period, but it's not a variable that makes sense to compare to time.

I understand the potential issue here, but I think Xarray should follow CF conventions for time, and only treat variables as time coordinates if they have valid CF time units (<time unit> since <date>).

We know of thousands of datasets (every dataset with waves!) where the current Xarray behavior is a problem.

I understand the potential issue here, but I think Xarray should follow CF conventions for time, and only treat variables as time coordinates if they have valid CF time units (

Rich, I know you've been involved in CF conventions and standard names, but I don't think the CF conventions on time apply directly here. These are "time difference" units, which are a distinct type of quantity. And assuredly the period of a wave is a type of "time difference" as well -- it's just not one that is useful to decode into an array with dtype np.timedelta64. Really this is a limitation of the non-ideal support for units in the NumPy ecosystem.

Is there some other sort of metadata we could use to make this distinction of "physical" vs. "human" time differences? Now is a good time to make changes, since we are on the verge of making a major release (v0.10).

Throwing out some ideas, none of which I particularly like (but to be clear, I don't like the status quo either):

  1. We could unilaterally stop automatic decoding into timedelta64. (We would need to add a separate helper function that could be called to do these conversions afterwards.)
  2. We could add look-up tables that recognize standard_name attributes, either a white-list or black-list for timedelta64 compatible variables. (This would be a first for xarray, and is not something I'm particularly looking forward to maintaining.)
  3. We could decode coordinates and data variables differently, only converting coordinates into timedelta64. (This is not entirely ideal either, since it's easy to switch between the two in xarray's data model.)

I vote for 1, plus a verbose warning message.

I have never found timedelta64 indices to be particularly useful.

I have never found timedelta64 indices to be particularly useful.

Same here. :+1: for 1

PS: 2 could be the start of a nice "CF-addon" package for xarray but I don't think it should be in the xarray code.

I don't have a strong opinion here but 1 seems best.

I vote for 1 also. How many makes a quorum? :smile_cat:

OK, sounds like there is consensus on removing this. I would still like to there to be an option for doing this sort of decoding, because I'm sure somebody finds this useful (at least I did, back when I wrote it!).

In particular, it would be nice to have some way to round-trip the timedelta64 dtype. A simple way to do this would be to recognize the attribute dtype='timedelta64[ns] (as an xarray-specific convention) and use that for decoding/encoding timedelta64 dtypes.

My suggested path forward:

  1. Add decoding support for recognizing dtype='timedelta64[ns]' and decoding it into the NumPy dtype. We have some very similar examples already (e.g., for dtype=bool), so this should not be hard.
  2. Write all timedelta64 dtype data in netCDF files by saving the dtype attribute instead of units.
  3. Issue a FutureWarning about what's going on that is triggered whenever unit='time_unit' is detected.
  4. In the next major release of xarray, stop decoding time units.

Anyone interested in taking this on? All the logic can be found in xarray/conventions.py.

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@shoyer I'm hitting this on xarray 0.15.1 so I guess the effort had stalled.

I'm quite happy with coordinates being decoded to timedelta64, but I'd love to have the option to not decode time interval in data variables.

Is there any plan to come back to this? And would the preferred solution still be the abovementioned one?

I'll need to spend some effort fixing our use case, possibly with a horrible work-around, so depending on the complexity of the preferred solution, I may be able to work on a PR for xarray instead.

I still think the ideal path forward is https://github.com/pydata/xarray/issues/1621#issuecomment-339116478, but clearly nobody was excited about taking on this effort :).

I do still think we _probably_ should retain a way to serialize/unserialize timedelta64 data before we switch the default behavior, rather than breaking existing users without any recourse.

That said, we certainly could add an optional flag for disabling decoding to timedelta64 (e.g., decode_timedelta=False in open_dataset/decode_cf) now, without changing anything else in xarray. The default flag switch could be saved until later, when the new timedelta64 serialization (steps 1 and 2 from above) works.

Ok, either @aurghs or I will take a shot at implementing decode_timedelta=False in the near future.

Now is a good time to make changes, since we are on the verge of making a major release (v0.10).

That ship has clearly sailed quite a while ago! 馃ぃ But I think I speak for many when I say THANK YOU @alexamici for taking this up again. Many people will still be very happy if this gets implemented.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

d-chambers picture d-chambers  路  4Comments

mathause picture mathause  路  4Comments

jhamman picture jhamman  路  5Comments

benbovy picture benbovy  路  3Comments

zxdawn picture zxdawn  路  3Comments