I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:
Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.
When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, nan is used as a substitute which results in memory wastage.
I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.
To meet these requirements, I have implemented a data structure that also supports the below capabilities:
dt) with child nodes: weather, satellite image and population. Each of these nodes has data arrays/subtrees under it. The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time') which applies the function to sea_surface_temperature, dew_point_temperature, wind_speed and pressure.
I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.
@emilbiju - thanks for opening an issue here. You may want to take a look at the conversation in #1092.
Thanks @jhamman for sharing the link. Here are my thoughts on the same:
For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.
Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.
I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.
Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process:
Datatree object into a dataset.xarray.Dataset.to_netcdf method to store it in a netCDF file. Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?
Thanks for writing this up @emilbiju . These are very interesting ideas
The nice thing about using NetCDF groups (or HDF5?) is that it is a standard and your data files are readable using other software.
So far, xarray has been reluctant to add "groups" or this kind of hierarchical organization because of all the additional complexity involved (#1092)
That said, there is definitely interest in a package that provides a high-level object composed of multiple xarray datasets (again #1092). So I encourage you to post your code online so others can try it out and iterate.
a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields.
The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.
The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.
Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.
Most helpful comment
The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.
The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g.,
xarray.TreeDataset) or add in a new feature likegroupsto the existingxarray.Dataset.Probably a new data structure would be easier at this point, because would keep
Datasetsimpler and wouldn't break existing code that works onxarray.Dataset.