Xarray: Creating unlimited dimensions with xarray.Dataset.to_netcdf

Created on 29 Aug 2016 · 18Comments · Source: pydata/xarray

@shoyer you wrote in a comment on another issue

xray doesn't use or set unlimited dimensions. (It's pretty irrelevant for us, given that NumPy arrays can be stored in either row-major or column-major order.)

I see that xarray does not need UNLIMITED dimensions internally. But I need to create a netCDF file that I subsequently can append to (along the time dimension, in this case). Can this be done?

Source

j08lue

👍4

Most helpful comment

All - I'm actively working on this. I have an initial implementation and will get it cleaned up here in the next few days. Stay tuned.

jhamman on 15 Dec 2016

👍4

All 18 comments

Currently it's not supported, but yes, we could absolutely add it as an option. I would be happy to add this functionality if someone makes a pull request. This won't be very useful for editing the file with xarray of course because we don't support editing netcdf files without making a complete copy.

shoyer on 29 Aug 2016

👍 from me. In the broader netCDF ecosystem, it can be pretty important to have record dimensions for various reasons, appending being the main one.

rabernat on 29 Aug 2016

xref: https://github.com/pydata/xarray/issues/678

I'm also a +1 on this.

jhamman on 29 Aug 2016

OK, I'd be up for taking a shot at it.

Since it is per-variable and specific to netCDF, I guess the perfect place to add this is in the encoding dictionary that you can pass to to_netcdf, right? Maybe as key unlimited? E.g.

ds.to_netcdf(encoding={'time': dict(unlimited=True)})

I need to look up whether netCDF allows for defining more than one unlimited dimension, otherwise that must throw an error.

And then it is just about passing None as length to CreateDimension, at least in netCDF4 and scipy.io.netcdf. But I did not look into how xarray handles that under the hood.

j08lue on 29 Aug 2016

Yes, we could put this in encoding if we want to preserve through
reading/writing files. NetCDF4 supports multiple unlimited dimensions.
Netcdf3 does not.
On Mon, Aug 29, 2016 at 1:56 PM Jonas [email protected] wrote:

OK, I'd be up for taking a shot at it.

Since it is per-variable and specific to netCDF, I guess the perfect place
to add this is in the encoding dictionary that you can pass to to_netcdf
https://github.com/pydata/xarray/blob/606e1d9c7efd72e10b530a688d6ef870e8ec1843/xarray/backends/api.py#L316,
right? Maybe as key unlimited? E.g.

ds.to_netcdf(encoding={'time': dict(unlimited=True)})

I need to look up whether netCDF allows for defining more than one
unlimited dimension, otherwise that must throw an error.

And then it is just about passing None as length to CreateDimension, at
least in netCDF4 and scipy.io.netcdf. But I did not look into how xarray
handles that under the hood.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/992#issuecomment-243253510, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1mTPcRqf3JI-yRk4wnIQ_xgKE3Wvks5qk0eXgaJpZM4JveCA
.

shoyer on 30 Aug 2016

I think it makes sense to preserve the UNLIMITED state through read/write. In my case, I subset a netCDF file along lat and lon dimensions, leaving the time dimension untouched and would therefore expect it to pass unchanged through xarray IO (staying UNLIMITED).

However, when the dataset is indexed/subset/resampled along the unlimited dimension, it would make sense that its state is dropped. But that would require a lot of ifs and buts, so I suggest we leave that aside for now.

j08lue on 30 Aug 2016

But maybe the encoding dict is not the way to go after all, since it contains entries per variable, while it is the dimension that must be unlimited.

Currently the dataset variables can be created in any order and their necessary dimensions created whenever needed (in the set_necessary_dimensions function). I would not like to change that logic (e.g. towards creating all dimensions required by all variables first, before adding the data variables).

So how about a new kw argument to to_netcdf, like

ds.to_netcdf(unlimited_dimensions=['time'])

ds.to_netcdf(dimension_unlimited={'time': True})

(the second option better for explicitly setting {'time': False})?

j08lue on 30 Aug 2016

The above solution would not require much more than that,
in set_necessary_dimensions,

    def set_necessary_dimensions(self, variable):
        for d, l in zip(variable.dims, variable.shape):
            if d not in self.dimensions:
                self.set_dimension(d, l)

would become

    def set_necessary_dimensions(self, variable):
        for d, l in zip(variable.dims, variable.shape):
            if d in self._unlimited_dimensions:
                l = None
            if d not in self.dimensions:
                self.set_dimension(d, l)

j08lue on 30 Aug 2016

@jhamman sorry, I only now saw that you pointed to a previous issue on the same topic (with exactly the same considerations). I did not find that issue when I searched (for "unlimited").

You were against changing to_netcdf then. Are you still?

j08lue on 30 Aug 2016

However, when the dataset is indexed/subset/resampled along the unlimited dimension, it would make sense that its state is dropped. But that would require a lot of ifs and buts, so I suggest we leave that aside for now.

This is exactly how Variable.encoding currently works: any operation that creates a new variable from the original variable drops the encoding.

If we put this encoding information on the variable corresponding to the dimension, any time you save a Dataset using that exact same dimension variable, it would be saved as unlimited size. So if you only modify other dimensions (e.g., with resampling or indexing), the unlimited dimension would indeed persist, as you desire.

shoyer on 30 Aug 2016

@shoyer -

I played around with this a bit yesterday. I have two implementation questions:

for a dataset with multiple variables, putting the dimension encoding on the Variable easily leads to conflicts. For example, one variable's encoding may say that dimension x is unlimited while another doesn't. How should we handle these conflicts.
are we apposed to a encoding attribute on the dataset object?

jhamman on 7 Nov 2016

Agreed, it's awkward to have this information on variables.

I was somewhat opposed to adding more state to the Dataset object but it seems like the necessary solution here. I'm not sure we need it in the Dataset constructor though -- could just have encoding as an attribute you need to modify. Honestly, could probably do the same for DataArray.encoding -- it's pretty low level.

shoyer on 7 Nov 2016

Great to see you are pushing forward on this issue @jhamman and @shoyer. I would really have liked to contribute here, but it seems like there are quite some design choices to make, which are better left in your hands.

j08lue on 7 Nov 2016

The ability to set unlimited dimension would be really useful for me as well.