Xarray: Add public API for Dataset._copy_listed

Created on 26 Mar 2020  Â·  13Comments  Â·  Source: pydata/xarray

In my data pipelines, I have been repeatedly burned using indexing notation to grab a few variables from a dataset in the following way:

ds = xr.Dataset(...)
vars = ('a' , 'b', 'c')
ds[vars] # this errors
ds[list(vars)] # this is ok

Moreover, because Dataset__getitem__ is type unstable, it makes it hard to detect this kind of error using mypy, so it often appears 30 minutes into a long data pipeline. It would be great to have a type-stable method that can take any sequence of variable names and return the Dataset consisting of those variables and their coordinates only. In fact, this method already exists, but it currently not public API. Could we make it so? Thanks.

Most helpful comment

IIUC, what we're discussing here is adding a new method that treats all sequences the same (__getitem__ won't be touched, except maybe the type hints).

All 13 comments

Would that be different from ensuring the input is a list?

Moreover, because Dataset__getitem__ is type unstable,

I very much empathize with the pain from methods being type unstable; indeed I think that's one of the biggest benefits of xarray over pandas. Here, it's stable _over the same typed inputs_. i.e. if supplied with a list, it returns with a dataset, otherwise it returns a DataArray. (or am I missing something?)

it makes it hard to detect this kind of error using mypy

Is there a way in mypy we could use something like overload to specify the above contract here, as an alternative to another method?

What's the reasoning for not returning a Dataset when __getitem__ is passed an Iterable like _copy_listed?

I agree, this API is too overloaded. It would be better to have an explicit method for subsetting Dataset variables. Maybe subset or sel_vars?

In early versions of xarray (back when it was called xray), we actually had a select method but I was concerned it was too confusing with indexing: http://xarray.pydata.org/en/v0.1.1/generated/xray.Dataset.select.html#xray.Dataset.select

What's the reasoning for not returning a Dataset when __getitem__ is passed an Iterable like _copy_listed?

The current check uses hashability to determine whether to try to make a DataArray. In theory, you could put a variable with the name ('a', 'b', 'c') into a Dataset.

Is there a way in mypy we could use something like overload to specify the above contract here, as an alternative to another method?

That is correct. The output type is predictable from the inputs types. With #4144, mypy may have a chance at detecting this kind of error. Still, I don't know how many users use type-checking. I expect most will only discover this problem at runtime.

It would be better to have an explicit method for subsetting Dataset variables.

I agree. sel_vars is more clear IMO since subset could apply to the coordinates too e.g. a spatial subset.

Or maybe "get" since it's a synonym of "select" that isn't overloaded with spatial indexing in the code base.

NVM, get is already a method. Maybe we could overload it?

It would be better to have an explicit method for subsetting Dataset variables.

I agree. sel_vars is more clear IMO since subset could apply to the coordinates too e.g. a spatial subset.

We did a similar splitting of functionality recently with drop, into drop_vars and drop_sel.

So this would leave us with:

  • sel/drop_sel for indices
  • sel_vars/drop_vars for variables

The naming doesn't have an obvious pattern here, which seems non-ideal. I can't think of anything much better at the moment, but perhaps it would help to avoid reusing sel. Maybe get_vars or subset_vars?

I think avoiding sel is a good idea (but no strong thoughts here).

how about pick_vars or take_vars?

I do think having a Dataset behave similarly to a dict / Mapping is generally good, and allows new users to bring existing understandings around those data structures to the xarray data model.

I recognize that a hashable iterable (e.g. ('a', 'b', 'c') is an annoying corner case, though.

A level down, re the name — I thought .vars might be decent — but potentially it's too similar to .variables — which is a mapping to the actual variables (and can't take multiple selections)

Would this proposal mean that subsetting variables with __getitem__ would no longer work? If so, I'd make the humble request as a downstream user/library contributor for it to have a very generous deprecation period, since it is core functionality I rely on a lot (and include in many tutorials).

IIUC, what we're discussing here is adding a new method that treats all sequences the same (__getitem__ won't be touched, except maybe the type hints).

At most, I would require using the new method if you want your code to type-check properly.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

benbovy picture benbovy  Â·  3Comments

Yefee picture Yefee  Â·  4Comments

d-chambers picture d-chambers  Â·  4Comments

jbusecke picture jbusecke  Â·  4Comments

zxdawn picture zxdawn  Â·  3Comments