Xarray: groupby very slow compared to pandas

Created on 16 Nov 2015 · 8Comments · Source: pydata/xarray

import timeit
import numpy as np
from pandas import DataFrame
from xray import Dataset, DataArray

df = DataFrame({"a": np.r_[np.arange(500.), np.arange(500.)],
                "b": np.arange(1000.)})
print(timeit.repeat('df.groupby("a").agg("mean")', globals={"df": df}, number=10))
print(timeit.repeat('df.groupby("a").agg(np.mean)', globals={"df": df, "np": np}, number=10))

ds = Dataset({"a": DataArray(np.r_[np.arange(500.), np.arange(500.)]),
              "b": DataArray(np.arange(1000.))})
print(timeit.repeat('ds.groupby("a").mean()', globals={"ds": ds}, number=10))

This outputs

[0.010462284000823274, 0.009770361997652799, 0.01081446700845845]
[0.02622630601399578, 0.024328112005605362, 0.018717073995503597]
[2.2804569930012804, 2.1666158599982737, 2.2688316510029836]

i.e. xray's groupby is ~100 times slower than pandas' one (and 200 times slower than passing "mean" to pandas' groupby, which I assume involves some specialization).

(This is the actual order or magnitude of the data size and redundancy I want to handle, i.e. thousands of points with very limited duplication.)

performance

Source

anntzer

Most helpful comment

In case anyone gets here by Googling something like "xarray groupby slow" and you loaded data from a netCDF file, be aware that slowness you see in groupby aggregation on a Dataset or DataArray may actually be due not to this issue but to the lazy loading that's done by default. This can be fixed by calling .load() on the Dataset or DataArray. See the Tip about lazy loading at http://xarray.pydata.org/en/stable/io.html#netcdf.

jjpr-mit on 4 Oct 2017

👍4 ❤1 🎉1

All 8 comments

Yes, I'm afraid this is a known issue. Grouped aggregations are currently implemented with a loop in pure Python, which, of course, is pretty slow.

I've done some exploratory work to rewrite them in Numba, which shows some encouraging preliminary results:

from numba import guvectorize, jit
import pandas as pd
import numpy as np


@guvectorize(['(float64[:], int64[:], float64[:])'],
             '(x),(x),(y)', nopython=True)
def _grouped_mean(values, int_labels, target):
    count = np.zeros(len(target), np.int64)
    for i in range(len(values)):
        val = values[i]
        if not np.isnan(val):
            lab = int_labels[i]
            target[lab] += val
            count[lab] += 1
    target /= count


def move_axis_to_end(array, axis):
    array = np.asarray(array)
    return np.rollaxis(array, axis, start=array.ndim)


def grouped_mean(values, by, axis=-1):
    int_labels, uniques = pd.factorize(by, sort=True)
    values = move_axis_to_end(values, axis)
    target = np.zeros(values.shape[:-1] + uniques.shape)
    _grouped_mean(values, int_labels, target)
    return target, uniques


values = np.random.RandomState(0).rand(int(1e6))
values[::50] = np.nan
by = np.random.randint(50, size=int(1e6))
df = pd.DataFrame({'x': values, 'y': by})

np.testing.assert_allclose(grouped_mean(values, by)[0], df.groupby('y')['x'].mean())

%timeit grouped_mean(values, by)  # 100 loops, best of 3: 15.3 ms per loop
%timeit df.groupby('y').mean()  # 10 loops, best of 3: 21.4 ms per loop

Unfortunately, I'm unlikely to have time to work on this in the near future. If you or anyone else is interested in taking the lead on this, it would be greatly appreciated!

Note that we can't reuse the routines from pandas because they are only designed for 1D or at most 2D data.

shoyer on 16 Nov 2015

In my case I could just switch to pandas, so I'll leave it as it is for now.

anntzer on 16 Nov 2015

Yes, switching to pandas for these operations is certainly a recommended approach :).

shoyer on 16 Nov 2015

Perhaps worth mentioning in the docs? The difference turned out to be a major bottleneck in my code.

anntzer on 16 Nov 2015

Agreed! If you'd like to make a pull request that would be greatly
appreciated

On Sun, Nov 15, 2015 at 10:10 PM, Antony Lee [email protected]
wrote:

Perhaps worth mentioning in the docs? The difference turned out to be a
major bottleneck in my code.

—
Reply to this email directly or view it on GitHub
https://github.com/xray/xray/issues/659#issuecomment-156925589.

shoyer on 16 Nov 2015

Another approach here (rather than writing something new with Numba) would be to write a pure NumPy engine for groupby that relies on reordering data and np.add.accumulate. This could yield performance within a factor of 2-3x slower than pandas. See this comment for an example:
https://github.com/numpy/numpy/issues/7265#issuecomment-198796408

shoyer on 23 Mar 2016

jjpr-mit on 4 Oct 2017

👍4 ❤1 🎉1

I gave a look to functions such as "np.add.at" which can be highly faster than home-made solution. The aggregate function of the "numpy-groupies" package is even faster (25 x faster than np.add.at in my case).
Maybe xarray groupby functionalities can rely on such effective package.