Xarray: Creating unlimited dimensions with xarray.Dataset.to_netcdf

Created on 29 Aug 2016  ·  18Comments  ·  Source: pydata/xarray

@shoyer you wrote in a comment on another issue

xray doesn't use or set unlimited dimensions. (It's pretty irrelevant for us, given that NumPy arrays can be stored in either row-major or column-major order.)

I see that xarray does not need UNLIMITED dimensions internally. But I need to create a netCDF file that I subsequently can append to (along the time dimension, in this case). Can this be done?

Most helpful comment

All - I'm actively working on this. I have an initial implementation and will get it cleaned up here in the next few days. Stay tuned.

All 18 comments

Currently it's not supported, but yes, we could absolutely add it as an option. I would be happy to add this functionality if someone makes a pull request. This won't be very useful for editing the file with xarray of course because we don't support editing netcdf files without making a complete copy.

👍 from me. In the broader netCDF ecosystem, it can be pretty important to have record dimensions for various reasons, appending being the main one.

xref: https://github.com/pydata/xarray/issues/678

I'm also a +1 on this.

OK, I'd be up for taking a shot at it.

Since it is per-variable and specific to netCDF, I guess the perfect place to add this is in the encoding dictionary that you can pass to to_netcdf, right? Maybe as key unlimited? E.g.

ds.to_netcdf(encoding={'time': dict(unlimited=True)})

I need to look up whether netCDF allows for defining more than one unlimited dimension, otherwise that must throw an error.

And then it is just about passing None as length to CreateDimension, at least in netCDF4 and scipy.io.netcdf. But I did not look into how xarray handles that under the hood.

Yes, we could put this in encoding if we want to preserve through
reading/writing files. NetCDF4 supports multiple unlimited dimensions.
Netcdf3 does not.
On Mon, Aug 29, 2016 at 1:56 PM Jonas [email protected] wrote:

OK, I'd be up for taking a shot at it.

Since it is per-variable and specific to netCDF, I guess the perfect place
to add this is in the encoding dictionary that you can pass to to_netcdf
https://github.com/pydata/xarray/blob/606e1d9c7efd72e10b530a688d6ef870e8ec1843/xarray/backends/api.py#L316,
right? Maybe as key unlimited? E.g.

ds.to_netcdf(encoding={'time': dict(unlimited=True)})

I need to look up whether netCDF allows for defining more than one
unlimited dimension, otherwise that must throw an error.

And then it is just about passing None as length to CreateDimension, at
least in netCDF4 and scipy.io.netcdf. But I did not look into how xarray
handles that under the hood.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pydata/xarray/issues/992#issuecomment-243253510, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKS1mTPcRqf3JI-yRk4wnIQ_xgKE3Wvks5qk0eXgaJpZM4JveCA
.

I think it makes sense to preserve the UNLIMITED state through read/write. In my case, I subset a netCDF file along lat and lon dimensions, leaving the time dimension untouched and would therefore expect it to pass unchanged through xarray IO (staying UNLIMITED).

However, when the dataset is indexed/subset/resampled along the unlimited dimension, it would make sense that its state is dropped. But that would require a lot of ifs and buts, so I suggest we leave that aside for now.

But maybe the encoding dict is not the way to go after all, since it contains entries per variable, while it is the dimension that must be unlimited.

Currently the dataset variables can be created in any order and their necessary dimensions created whenever needed (in the set_necessary_dimensions function). I would not like to change that logic (e.g. towards creating all dimensions required by all variables first, before adding the data variables).

So how about a new kw argument to to_netcdf, like

ds.to_netcdf(unlimited_dimensions=['time'])

or

ds.to_netcdf(dimension_unlimited={'time': True})

(the second option better for explicitly setting {'time': False})?

The above solution would not require much more than that,
in set_necessary_dimensions,

    def set_necessary_dimensions(self, variable):
        for d, l in zip(variable.dims, variable.shape):
            if d not in self.dimensions:
                self.set_dimension(d, l)

would become

    def set_necessary_dimensions(self, variable):
        for d, l in zip(variable.dims, variable.shape):
            if d in self._unlimited_dimensions:
                l = None
            if d not in self.dimensions:
                self.set_dimension(d, l)

@jhamman sorry, I only now saw that you pointed to a previous issue on the same topic (with exactly the same considerations). I did not find that issue when I searched (for "unlimited").

You were against changing to_netcdf then. Are you still?

However, when the dataset is indexed/subset/resampled along the unlimited dimension, it would make sense that its state is dropped. But that would require a lot of ifs and buts, so I suggest we leave that aside for now.

This is exactly how Variable.encoding currently works: any operation that creates a new variable from the original variable drops the encoding.

If we put this encoding information on the variable corresponding to the dimension, any time you save a Dataset using that exact same dimension variable, it would be saved as unlimited size. So if you only modify other dimensions (e.g., with resampling or indexing), the unlimited dimension would indeed persist, as you desire.

@shoyer -

I played around with this a bit yesterday. I have two implementation questions:

  • for a dataset with multiple variables, putting the dimension encoding on the Variable easily leads to conflicts. For example, one variable's encoding may say that dimension x is unlimited while another doesn't. How should we handle these conflicts.
  • are we apposed to a encoding attribute on the dataset object?

Agreed, it's awkward to have this information on variables.

I was somewhat opposed to adding more state to the Dataset object but it seems like the necessary solution here. I'm not sure we need it in the Dataset constructor though -- could just have encoding as an attribute you need to modify. Honestly, could probably do the same for DataArray.encoding -- it's pretty low level.

Great to see you are pushing forward on this issue @jhamman and @shoyer. I would really have liked to contribute here, but it seems like there are quite some design choices to make, which are better left in your hands.

The ability to set unlimited dimension would be really useful for me as well.

Useful for me too.

For the time being there's also this hack (I haven't tested)..
http://stackoverflow.com/questions/28598485/how-to-convert-fixed-size-dimension-to-unlimited-in-a-netcdf-file

All - I'm actively working on this. I have an initial implementation and will get it cleaned up here in the next few days. Stay tuned.

@jhamman : Great!
@chrisb13 : I've been using the StackOverflow hack and it does work.

For those of you who are interested in this feature, I'd appreciate your feedback on #1170.

Was this page helpful?
0 / 5 - 0 ratings