Just curious if anyone here has thoughts on this.
For more context: Awkward is like numpy but for arrays of very arbitrary (dynamic) structure.
I don't know much yet about that library (I've just seen this SciPy 2020 presentation), but now I could imagine using xarray for dealing with labelled collections of geometrical / geospatial objects like polylines or polygons.
At this stage, any integration between xarray and awkward arrays would be something highly experimental, but I think this might be an interesting case for flexible arrays (and possibly flexible indexes) mentioned in the roadmap. There is some discussion here: https://github.com/scikit-hep/awkward-1.0/issues/27.
Does anyone see any other potential use case?
cc @pydata/xarray
I'm linking myself here, to follow this: @jpivarski.
I think that xarray should offer a "compatibility test toolkit" to any numpy-like, NEP18-compatible library that wants to integrate with it.
Instead of having a module full of tests specifically for pint, one for sparse, one for cupy, one for awkward, etc. etc. etc. those projects could just write a minimal test module like this:
import xarray
import sparse
xarray.testing.test_nep18_module(
sparse,
# TODO: lambda to create an array
# TODO: list of xfails
)
which would automatically expand into a comprehensive suite of tests thanks to pytest parameterize/fixture magic.
this would allow developers of numpy-like libraries to just test their package vs what's expected from a generic NEP-18 compliant package.
Copied from https://gitter.im/pangeo-data/Lobby :
I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional Identities which would be used to track such information through Awkward manipulations鈥擨dentities would have a bijection with externally supplied indexes. They haven't been used for anything yet.
Although the elevator pitch for Xarray is "n-dimensional Pandas," it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray.
One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is Datashape plus union types. There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types:
>>> import awkward1 as ak
>>> array = ak.Array([1, 2, [3, 4, 5], [[6, 7, 8]]])
>>> array
<Array [1, 2, [3, 4, 5], [[6, 7, 8]]] type='4 * union[int64, var * union[int64, ...'>
>>> array.type
4 * union[int64, var * union[int64, var * int64]]
>>> array.ndim
-1
So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.
Most helpful comment
Copied from https://gitter.im/pangeo-data/Lobby :
I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional Identities which would be used to track such information through Awkward manipulations鈥擨dentities would have a bijection with externally supplied indexes. They haven't been used for anything yet.
Although the elevator pitch for Xarray is "n-dimensional Pandas," it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray.
One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is Datashape plus union types. There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types:
So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.