Xarray: Awkward array backend?

Created on 29 Jul 2020  路  3Comments  路  Source: pydata/xarray

Just curious if anyone here has thoughts on this.

For more context: Awkward is like numpy but for arrays of very arbitrary (dynamic) structure.

I don't know much yet about that library (I've just seen this SciPy 2020 presentation), but now I could imagine using xarray for dealing with labelled collections of geometrical / geospatial objects like polylines or polygons.

At this stage, any integration between xarray and awkward arrays would be something highly experimental, but I think this might be an interesting case for flexible arrays (and possibly flexible indexes) mentioned in the roadmap. There is some discussion here: https://github.com/scikit-hep/awkward-1.0/issues/27.

Does anyone see any other potential use case?

cc @pydata/xarray

arrays

Most helpful comment

Copied from https://gitter.im/pangeo-data/Lobby :

I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional Identities which would be used to track such information through Awkward manipulations鈥擨dentities would have a bijection with externally supplied indexes. They haven't been used for anything yet.

Although the elevator pitch for Xarray is "n-dimensional Pandas," it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray.

One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is Datashape plus union types. There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types:

>>> import awkward1 as ak
>>> array = ak.Array([1, 2, [3, 4, 5], [[6, 7, 8]]])
>>> array
<Array [1, 2, [3, 4, 5], [[6, 7, 8]]] type='4 * union[int64, var * union[int64, ...'>
>>> array.type
4 * union[int64, var * union[int64, var * int64]]
>>> array.ndim
-1

So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.

All 3 comments

I'm linking myself here, to follow this: @jpivarski.

I think that xarray should offer a "compatibility test toolkit" to any numpy-like, NEP18-compatible library that wants to integrate with it.
Instead of having a module full of tests specifically for pint, one for sparse, one for cupy, one for awkward, etc. etc. etc. those projects could just write a minimal test module like this:

import xarray
import sparse

xarray.testing.test_nep18_module(
    sparse,
    # TODO: lambda to create an array
    # TODO: list of xfails 
)

which would automatically expand into a comprehensive suite of tests thanks to pytest parameterize/fixture magic.
this would allow developers of numpy-like libraries to just test their package vs what's expected from a generic NEP-18 compliant package.

Copied from https://gitter.im/pangeo-data/Lobby :

I've been using Xarray with argopy recently, and the immediate value I see is the documentation of columns, which is semi-lacking in Awkward (one user has been passing this information through an Awkward tree as a scikit-hep/awkward-1.0#422). I should also look into Xarray's indexing, which I've always seen as being the primary difference between NumPy and Pandas; Awkward Array has no indexing, though every node has an optional Identities which would be used to track such information through Awkward manipulations鈥擨dentities would have a bijection with externally supplied indexes. They haven't been used for anything yet.

Although the elevator pitch for Xarray is "n-dimensional Pandas," it's rather different, isn't it? The contextual metadata is more extensive than anything I've seen in Pandas, and Xarray can be partitioned for out-of-core analysis: Xarray wraps Dask, unlike Dask's array collection, which wraps NumPy. I had troubles getting Pandas to wrap Awkward array (scikit-hep/awkward-1.0#350 ), but maybe these won't be issues for Xarray.

One last thing (in this very rambly message): the main difficulty I think we would have in that is that Awkward Arrays don't have shape and dtype, since those define a rectilinear array of numbers. The data model is Datashape plus union types. There is a sense in which ndim is defined: the number of nested lists before reaching the first record, which may split it into different depths for each field, but even this can be ill-defined with union types:

>>> import awkward1 as ak
>>> array = ak.Array([1, 2, [3, 4, 5], [[6, 7, 8]]])
>>> array
<Array [1, 2, [3, 4, 5], [[6, 7, 8]]] type='4 * union[int64, var * union[int64, ...'>
>>> array.type
4 * union[int64, var * union[int64, var * int64]]
>>> array.ndim
-1

So if we wanted to have an Xarray of Awkward Arrays, we'd have to take stock of all the assumptions Xarray makes about the arrays it contains.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

duncanwp picture duncanwp  路  4Comments

tomchor picture tomchor  路  4Comments

Zac-HD picture Zac-HD  路  3Comments

jhamman picture jhamman  路  5Comments

ray306 picture ray306  路  4Comments