Xarray: Feature Request: Hierarchical storage and processing in xarray

Created on 1 Jun 2020  路  5Comments  路  Source: pydata/xarray

I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:

  • Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.

  • When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, nan is used as a substitute which results in memory wastage.

I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.

To meet these requirements, I have implemented a data structure that also supports the below capabilities:

  • Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it dt) with child nodes: weather, satellite image and population. Each of these nodes has data arrays/subtrees under it.

Screenshot 2020-06-02 at 2 10 28 AM

The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time') which applies the function to sea_surface_temperature, dew_point_temperature, wind_speed and pressure.

  • It can be encoded into the netCDF format, like xarray Datasets.
  • It supports item assignment at all hierarchical levels.

I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.

Most helpful comment

The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.

The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.

Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.

All 5 comments

@emilbiju - thanks for opening an issue here. You may want to take a look at the conversation in #1092.

Thanks @jhamman for sharing the link. Here are my thoughts on the same:

For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.

Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.

I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.

Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process:

  • Combine all the data arrays at the leaves of a Datatree object into a dataset.
  • Add an additional data array to the dataset that would contain an ancestor matrix (or any other array-like representation) that can encode the hierarchical structure with a coordinate set containing names of the tree nodes.
  • Use the xarray.Dataset.to_netcdf method to store it in a netCDF file.

Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?

Thanks for writing this up @emilbiju . These are very interesting ideas

  1. The nice thing about using NetCDF groups (or HDF5?) is that it is a standard and your data files are readable using other software.

  2. So far, xarray has been reluctant to add "groups" or this kind of hierarchical organization because of all the additional complexity involved (#1092)

  3. That said, there is definitely interest in a package that provides a high-level object composed of multiple xarray datasets (again #1092). So I encourage you to post your code online so others can try it out and iterate.

    a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

image

I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields.

The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.

The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.

Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ray306 picture ray306  路  4Comments

aseyboldt picture aseyboldt  路  5Comments

Yefee picture Yefee  路  4Comments

zxdawn picture zxdawn  路  3Comments

tfurf picture tfurf  路  4Comments