Xarray: Performance: numpy indexes small amounts of data 1000 faster than xarray

Created on 4 Mar 2019  Â·  28Comments  Â·  Source: pydata/xarray

Machine learning applications often require iterating over every index along some of the dimensions of a dataset. For instance, iterating over all the (lat, lon) pairs in a 4D dataset with dimensions (time, level, lat, lon). Unfortunately, this is very slow with xarray objects compared to numpy (or h5py) arrays. When the Pangeo machine learning working group met today, we found that several of us have struggled with this.

I made some simplified benchmarks, which show that xarray is about 1000 times slower than numpy when repeatedly grabbing a small amount of data from an array. This is a problem with both isel or [] indexing. After doing some profiling, the main culprits seem to be xarray routines like _validate_indexers and _broadcast_indexes.

While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.

performance

Most helpful comment

I simplified the benchmark:

from itertools import product

import numpy as np
import xarray as xr


shape = (10, 10, 10, 10)
index = (0, 0, 0, 0)
np_arr = np.ones(shape)
arr = xr.DataArray(np_arr)
named_index = dict(zip(arr.dims, index))

print(index)
print(named_index)

%timeit -n 1000 arr[index]
%timeit -n 1000 arr.isel(**named_index)
%timeit -n 1000 np_arr[index]
(0, 0, 0, 0)
{'dim_0': 0, 'dim_1': 0, 'dim_2': 0, 'dim_3': 0}
90.8 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
88.5 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
115 ns ± 6.71 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



md5-c5f4b9f64f84b3ae233eea205758fe9e



5680003 function calls (5630003 primitive calls) in 1.890 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.890 1.890 {built-in method builtins.exec}
1 0.009 0.009 1.890 1.890 :1()
10000 0.011 0.000 1.881 0.000 dataarray.py:629(__getitem__)
10000 0.030 0.000 1.801 0.000 dataarray.py:988(isel)
10000 0.084 0.000 1.567 0.000 dataset.py:1842(isel)
10000 0.094 0.000 0.570 0.000 dataset.py:1746(_validate_indexers)
10000 0.029 0.000 0.375 0.000 variable.py:960(isel)
10000 0.013 0.000 0.319 0.000 variable.py:666(__getitem__)
20000 0.014 0.000 0.251 0.000 dataset.py:918(_replace_with_new_dims)
50000 0.028 0.000 0.245 0.000 variable.py:272(__init__)
10000 0.035 0.000 0.211 0.000 variable.py:487(_broadcast_indexes)
1140000/1100000 0.100 0.000 0.168 0.000 {built-in method builtins.isinstance}
10000 0.050 0.000 0.157 0.000 dataset.py:1802(_get_indexers_coords_and_indexes)
20000 0.025 0.000 0.153 0.000 dataset.py:868(_replace)
50000 0.085 0.000 0.152 0.000 variable.py:154(as_compatible_data)
```

Time breakdown:

Total | 1.881
-- | --
DataArray.__getitem__ | 0.080
DataArray.isel (_to_temp_dataset roundtrip) | 0.234
Dataset.isel | 0.622
Dataset._validate_indexers | 0.570
Variable.isel | 0.056
Variable.__getitem__ | 0.319

I can spot a few low-hanging fruits there:

  • huge amount of time spent on _validate_indexers
  • Why is variable__init__ being called 5 times?!? I expected 0.
  • The bench strongly hints at the fact that we're creating on the fly dummy IndexVariables
  • We're casting the DataArray to a Dataset, converting the positional index to a dict, then converting it back to positional for each variable. Maybe it's a good idea to rewrite DataArray.sel/isel so that they don't use _to_temp_dataset?

So in short while I don't think we can feasibly close the order-of-magnitude gap (800x) with numpy, I suspect we could get at least a 5x speedup here.

All 28 comments

cc @rabernat

While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.

I'm sure it's possible to optimize this significantly, but short of rewriting this logic in a lower level language it's pretty much impossible to match the speed of NumPy.

This benchmark might give some useful context:

def dummy_isel(*args, **kwargs):
  pass

def index_dummy(named_indices, arr):
    for named_index in named_indices:
        dummy_isel(arr, **named_index)
%%timeit -n 10
index_dummy(named_indices, arr)

On my machine, this is already twice as slow as your NumPy benchmark (497 µs vs 251 µs) , and all it's doing is parsing *args and **kwargs! Every Python function/method call involving keyword arguments adds about 0.5 ns of overhead, because the highly optimized dict is (relatively) slow compared to positional arguments. In my experience it is almost impossible to get the overhead of a Python function call below a few microseconds.

Right now we're at about 130 µs per indexing operation. In the best case, we might make this 10x faster but even that would be quite challenging, e.g., consider that even creating a DataArray takes about 20 µs.

Thanks so much @shoyer. I didn't realize there was that much overhead for a single function call. OTOH, 2x slower than numpy would be way better than 1000x.

After looking at the profiling info more, I tend to agree with your 10x maximum speed-up. A couple of particularly slow functions (e.g. Dataset._validate_indexers) account for about 75% of run time. However, the remaining 25% is split across several other pure python routines.

To be clear, pull requests improving performance (without significantly loss of readability) would be very welcome. Be sure to include a new benchmark in our benchmark suite.

Thanks for the benchmarks @nbren12, and for the clear explanation @shoyer

While we could do some performance work on that loop, I think we're likely to see a material change by enabling the external library to access directly from the array, without a looped python call. That's consistent with the ideas @jhamman had a few days ago.

@max-sixty I tend to agree this use case could be outside of the scope of xarray. It sounds like significant progress might require re-implementing core xarray objects in C/Cython. Without more than 10x improvement, I would probably just continue using numpy arrays.

You can always use xarray to process the data, and then extract the underlying array (da.values) for passing into something expecting an numpy array / for running fast(ish) loops (we do this frequently).

Sure, I've been using that as a workaround as well. Unfortunately, that approach throws away all the nice info (e.g. metadata, coordinate) that xarray objects have and requires duplicating much of xarray's indexing logic.

To put the relative speed of numpy access into perspective, I found this insightful: https://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/ (it's now a few years out of date, but I think the fundamentals still stand)

Pasted from there:

Summary
Here are the timing results we've seen above:

Python + numpy: 6510 ms
Cython + numpy: 668 ms
Cython + memviews (slicing): 22 ms
Cython + raw pointers: 2.47 ms
Cython + memviews (no slicing): 2.45 ms

So if we're running an inner loop on an array, accessing it using numpy in python is an order of magnitude slower than accessing it using numpy in C (and that's an order of magnitude slower than using a slice, and that's an order of magnitude slower than using raw pointers)

So - let's definitely speed xarray up (your benchmarks are excellent, thank you again, and I think you're right there are opportunities for significant increases). But where speed is paramount above all else, we shouldn't use _any_ access in python, let alone the niceties of xarray access.

Cython + memoryviews isn't quite the right comparison here. I'm sure ordering here is correct, but relative magnitude of the performance difference should be smaller.

Xarray's core is bottlenecked on:

  1. Overhead of abstraction with normal Python operations (e.g., function calls) in non-numeric code (all the heavy numerics is offloaded to NumPy or pandas).
  2. The dynamic nature of our APIs, which means we need to do lots of type checking. Notice how high up builtins.isinstance appears in that performance profile!

C++ offers very low-cost abstraction but dynamism is still slow. Even then, compilers are much better at speeding up tight numeric loops than complex domain logic.

As a point of reference, it would be interesting to see these performance numbers running pypy, which I think should be able to handle everything in xarray. You'll note that pypy is something like 7x faster than CPython in their benchmark suite, which I suspect is closer to what we'd see if we wrote xarray's core in a language like C++, e.g., as Python interface to xframe.

Cython + memoryviews isn't quite the right comparison here.

Right, tbc, I'm only referring to the top two lines of the pasted benchmark; i.e. once we enter python (even if only to access a numpy array) we're already losing a lot of the speed relative to the loop staying in C / Cython. So even if xarray were a python front-end to a C++ library, it still wouldn't be competitive if performance were paramount.
...unless pypy sped that up; I'd be v interested to see.

It might be interesting to see, if pythran is an alternative to Cython. It seems like it handles high level numpy quite well, and would retain the readability of Python. Of course, it has its own issues...

But it seems like other libraries like e.g. scikit-image made some good experience with it.

Sadly I can't be of much help, as I lack experience (and most importantly time).

Pythran supports Python 2.7 and also has a decent Python 3 support.
[...]
Pythran now supports Python3 and can be installed as a regular Python3 program. Note however that Python3 support is still in early stage and compilation failure may happen. Report them!

This is _not_ a great start :(

It's the first time I hear about Pythran. At first sight it looks somewhat like a hybrid between Cython (for the ahead-of-time transpiling to C++) and numba (for having python-compatible syntax).

That said, I didn't see anything that hints at potential speedups on the python boilerplate code.

I already had experience with compiling pure-python code (tight __iter__ methods) with Cython, and got around 30% performance boost which - while nothing to scoff at - is not life-changing either.

This said, I'd have to spend more time on it to get a more informed opinion.

At first sight it looks somewhat like a hybrid between Cython (for the ahead-of-time transpiling to C++) and numba (for having python-compatible syntax).

Not really. Pythran always releases the GIL and does a bunch of optimizations between transpilation and compilations.

A good approach would be try out different compilers and see what performance is obtained, without losing readability (https://github.com/pydata/xarray/issues/2799#issuecomment-469444519). See scikit-image/scikit-image/issues/4199 where the package transonic was being experimentally tested to replace Cython-only code with python code + type hints. As a bonus, you get to switch between Cython, Pythran and Numba,

I simplified the benchmark:

from itertools import product

import numpy as np
import xarray as xr


shape = (10, 10, 10, 10)
index = (0, 0, 0, 0)
np_arr = np.ones(shape)
arr = xr.DataArray(np_arr)
named_index = dict(zip(arr.dims, index))

print(index)
print(named_index)

%timeit -n 1000 arr[index]
%timeit -n 1000 arr.isel(**named_index)
%timeit -n 1000 np_arr[index]
(0, 0, 0, 0)
{'dim_0': 0, 'dim_1': 0, 'dim_2': 0, 'dim_3': 0}
90.8 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
88.5 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
115 ns ± 6.71 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)



md5-c5f4b9f64f84b3ae233eea205758fe9e



5680003 function calls (5630003 primitive calls) in 1.890 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.890 1.890 {built-in method builtins.exec}
1 0.009 0.009 1.890 1.890 :1()
10000 0.011 0.000 1.881 0.000 dataarray.py:629(__getitem__)
10000 0.030 0.000 1.801 0.000 dataarray.py:988(isel)
10000 0.084 0.000 1.567 0.000 dataset.py:1842(isel)
10000 0.094 0.000 0.570 0.000 dataset.py:1746(_validate_indexers)
10000 0.029 0.000 0.375 0.000 variable.py:960(isel)
10000 0.013 0.000 0.319 0.000 variable.py:666(__getitem__)
20000 0.014 0.000 0.251 0.000 dataset.py:918(_replace_with_new_dims)
50000 0.028 0.000 0.245 0.000 variable.py:272(__init__)
10000 0.035 0.000 0.211 0.000 variable.py:487(_broadcast_indexes)
1140000/1100000 0.100 0.000 0.168 0.000 {built-in method builtins.isinstance}
10000 0.050 0.000 0.157 0.000 dataset.py:1802(_get_indexers_coords_and_indexes)
20000 0.025 0.000 0.153 0.000 dataset.py:868(_replace)
50000 0.085 0.000 0.152 0.000 variable.py:154(as_compatible_data)
```

Time breakdown:

Total | 1.881
-- | --
DataArray.__getitem__ | 0.080
DataArray.isel (_to_temp_dataset roundtrip) | 0.234
Dataset.isel | 0.622
Dataset._validate_indexers | 0.570
Variable.isel | 0.056
Variable.__getitem__ | 0.319

I can spot a few low-hanging fruits there:

  • huge amount of time spent on _validate_indexers
  • Why is variable__init__ being called 5 times?!? I expected 0.
  • The bench strongly hints at the fact that we're creating on the fly dummy IndexVariables
  • We're casting the DataArray to a Dataset, converting the positional index to a dict, then converting it back to positional for each variable. Maybe it's a good idea to rewrite DataArray.sel/isel so that they don't use _to_temp_dataset?

So in short while I don't think we can feasibly close the order-of-magnitude gap (800x) with numpy, I suspect we could get at least a 5x speedup here.

All those integer indexes were cast into Variables. #3375 stops that.

After #3375:

1.371 | TOTAL
-- | --
0.082 | DataArray.__getitem__
0.217 | DataArray.isel (_to_temp_dataset roundtrip)
0.740 | Dataset.isel
0.056 | Variable.isel
0.276 | Variable.__getitem__

The offending lines in Dataset.isel are these, and I strongly suspect they are improvable:

https://github.com/pydata/xarray/blob/4254b4af33843f711459e5242018cd1d678ad3a0/xarray/core/dataset.py#L1922-L1930

Great analysis, thanks

Do we have any idea of which of those lines are offending? I used a tool line_profiler a while ago, but maybe we know already (I'm guessing it's the two _replace_with_new_dims lines?)

I tried playing around with pypy 3.6.
Big fat disclaimer: I did not run any of the xarray unit tests. Expect trouble if you do.

1.

#!/bin/bash

set -o errexit
set -o pipefail
set -o nounset
set -o xtrace

tar -xvjf Downloads/pypy3.6-v7.1.1-linux64.tar.bz2
cd pypy3.6-v7.1.1-linux64/bin
./pypy3 -m ensurepip
./pip3.6 install -U pip wheel
./pip list | awk 'NR > 2 {print $1}' | grep -v greenlet | xargs ./pip install -U

# sudo apt-get install libopenblas-dev gfortran
./pip install numpy pandas xarray
  1. to work around https://bitbucket.org/pypy/pypy/issues/3087/collectionsabc-__init_subclass__-failure, edit xarray/core/common.py and delete AttrAccessMixin.__init_subclass__

  2. timeit is unreliable in pypy. I modified the benchmark as follows:

import time

import numpy as np
import xarray as xr


shape = (10, 10, 10, 10)
index = (0, 0, 0, 0)
np_arr = np.ones(shape)
arr = xr.DataArray(np_arr)


N = 10000

def bench_slice(obj):
    for _ in range(4):
        t0 = time.time()
        for _ in range(N):
            obj[index]
        t1 = time.time()
        t_ns = (t1 - t0) / N * 1e9
        print(f"{t_ns:6.0f} ns {obj.__class__.__name__}")

bench_slice(arr)
bench_slice(np_arr)

Benchmark outputs:
CPython 3.7:

 93496 ns DataArray
 92732 ns DataArray
 92560 ns DataArray
 93427 ns DataArray
   119 ns ndarray
   121 ns ndarray
   122 ns ndarray
   119 ns ndarray

PyPy 7.1 3.6:

113273 ns DataArray
 38543 ns DataArray
 34797 ns DataArray
 39453 ns DataArray
   386 ns ndarray
   289 ns ndarray
   329 ns ndarray
   413 ns ndarray

Big important reminder: all results are for a very small array. I would expect the gap between CPython and pypy to get narrower in % (both for numpy and xarray) as the array size gets larger and more time is spent in the pure C numpy code.

I suspect system jitter in the profiling as the time for Dataset.isel went up. It would be useful to run sudo python -m pyperf system tune before running profiler/benchmarks.

Hmm, slicing should basically be a no-op.

The fact that xarray makes it about 100x slower is a real killer. It seems from this conversation that it might be hard to workaround

import xarray as xr
import numpy as np

n = np.zeros(shape=(1024, 1024))

x = xr.DataArray(n, dims=('y', 'x'))

the_slice = np.s_[256:512, 256:512]

%timeit n[the_slice]
%timeit x[the_slice]
186 ns ± 0.778 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
70.3 µs ± 593 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

TBC I think there's plenty we could do with relatively little complexity to speed up indexing operations on DataArrays. As an example, we could avoid the roundtrip to a temporary Dataset.

That's a different problem from making xarray as fast as indexing a numpy array, or allowing libraries to iterate through a DataArray in a hot loop.

Sure, I just wanted to make the note that this operation should be more or less constant time, as opposed to dependent on the size of the array. Somebody had mentionned it should increase with the size of the array.

Sure, I just wanted to make the note that this operation should be more or less constant time, as opposed to dependent on the size of the array.

Yes, I think this is still the case for slicing in xarray. There's just much larger constant overhead than in NumPy. (And this is difficult to fix short of rewriting xarray's core in C.)

One note: if you're indexing into a dataarray and don't care about the coords, index into the variable. 2x numpy time, rather than 30x:

In [26]: da = xr.tutorial.open_dataset('air_temperature')['air']

In [27]: da
Out[27]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]



In [20]: %timeit da.variable[0]
28.2 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit da[0]
459 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [22]: %timeit da.variable.values[0]
14.1 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This variable workaround is awesome @max-sixty. Are there any guidelines on when to use Variable vs DataArray? Some calculations (e.g. fast difference and derivatives/stencil operations) seem cleaner without explicit coordinate labels.

That's great that's helpful @nbren12 . Maybe we should add to docs (we don't really have a performance section at the moment, maybe we start something on performance tips?)

There's some info on the differences in the Terminology that @gwgundersen wrote: https://github.com/pydata/xarray/blob/master/doc/terminology.rst#L18

Essentially: by indexing on the variable, you ignore the coordinates, and so skip a bunch of code that takes the object apart and puts it back together. A variable is much more similar to a numpy array, so you can't do sel, for example.

3533 closes the gap between DataArray and numpy from 500x slower to "just" 100x slower :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jhamman picture jhamman  Â·  5Comments

zxdawn picture zxdawn  Â·  3Comments

phausamann picture phausamann  Â·  3Comments

andrewpauling picture andrewpauling  Â·  3Comments

d-chambers picture d-chambers  Â·  4Comments