Xarray: open_mfdataset in v.0.11.1 is very slow

Created on 8 Jan 2019 · 6Comments · Source: pydata/xarray

I have several repositories corresponding to different time periods of time.
Each repository contains several netCDF files with different variables.
Here is a simplified example: path/period1/ contains varsABC_period1.nc and varsDEF_period1.nc, and path/period2/ contains varsABC_period2.nc and varsDEF_period2.nc. All variables have the dimension T (time), but the other dimensions can be different (e.g., A has dimensions [T,X1,Y1], B has dimensions [T,X1,Y2], ...).
Before version v0.11.1 a was able to easily open a dataset as follow:

ds=xr.open_mfdataset('path\period*\*.nc', concat_dim='T')

However, open_mfdataset now fails. It actually takes forever without returning any output, so my guess is that it's just very very slow.
I've noticed that I'm able to create the dataset concatenating the files with the same variables first, then merging them:

dsABC=xr.open_mfdataset('path\period*\varsABC*.nc', concat_dim='T')
dsDEF=xr.open_mfdataset('path\period*\varsDEF*.nc', concat_dim='T')
ds=xr.merge([dsABC, dsDEF])

I can't find what's the update in v0.11.1 that is causing this behavior. Any idea?

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.3.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.11.2
pandas: 0.23.4
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 1.0.0
distributed: 1.25.2
matplotlib: 3.0.0
cartopy: None
seaborn: 0.9.0
setuptools: 40.4.3
pip: 18.1
conda: 4.5.12
pytest: 3.8.2
IPython: 7.0.1
sphinx: 1.8.1

performance

Source

malmans2

All 6 comments

I'm not sure what might be causing this, but I wonder if you could help narrow it down a bit?

Can you for example see if it's making it past here? That would at least tell us if it is opening each of the datasets okay.

(Or even better: post some example datasets which will cause this problem?)

TomNicholas on 8 Jan 2019

I've created a little script to reproduce the problem.
@TomNicholas it looks like datasets are opened correctly. The problem arises when open_mfdatasets calls _auto_combine. Indeed, _auto_combine was introduced in v0.11.1

import numpy as np
import xarray as xr
import os
Tsize=100; T = np.arange(Tsize);
Xsize=900; X = np.arange(Xsize);
Ysize=800; Y = np.arange(Ysize) 
data = np.random.randn(Tsize, Xsize, Ysize)
for i in range(2):

    # Create 2 datasets with different variables
    dsA = xr.Dataset({'A': xr.DataArray(data, coords={'T': T+i*Tsize}, dims=('T', 'X', 'Y'))})
    dsB = xr.Dataset({'B': xr.DataArray(data, coords={'T': T+i*Tsize}, dims=('T', 'X', 'Y'))})

    # Save datasets in one folder
    dsA.to_netcdf('dsA'+str(i)+'.nc')
    dsB.to_netcdf('dsB'+str(i)+'.nc')

    # Save datasets in two folders
    dirname='rep'+str(i)
    os.mkdir(dirname)
    dsA.to_netcdf(dirname+'/'+'dsA'+str(i)+'.nc')
    dsB.to_netcdf(dirname+'/'+'dsB'+str(i)+'.nc')

Fast if netCDFs are stored in one folder:

%%time
ds_1folder = xr.open_mfdataset('*.nc',  concat_dim='T')

CPU times: user 49.9 ms, sys: 5.06 ms, total: 55 ms
Wall time: 59.7 ms

Slow if netCDFs are stored in several folders:

%%time
ds_2folders = xr.open_mfdataset('rep*/*.nc', concat_dim='T')

CPU times: user 8.6 s, sys: 5.95 s, total: 14.6 s
Wall time: 10.3 s

Fast if files containing different variables are opened separately, then merged:

%%time
ds_A = xr.open_mfdataset('rep*/dsA*.nc', concat_dim='T')
ds_B = xr.open_mfdataset('rep*/dsB*.nc', concat_dim='T')
ds_merged = xr.merge([ds_A, ds_B])

CPU times: user 33.8 ms, sys: 3.7 ms, total: 37.5 ms
Wall time: 34.5 ms

malmans2 on 14 Jan 2019

❤1

@malmans2 thanks for this reproducible test case!

From xarray's perspective, the difference is the order in which the arrays are concatenated/processed. This is determined by sorting the (globbed) file names:

In [16]: sorted(glob.glob('rep*/*.nc'))
Out[16]: ['rep0/dsA0.nc', 'rep0/dsB0.nc', 'rep1/dsA1.nc', 'rep1/dsB1.nc']

In [17]: sorted(glob.glob('*.nc'))
Out[17]: ['dsA0.nc', 'dsA1.nc', 'dsB0.nc', 'dsB1.nc']

It appears that the slow case [A0, B0, A1, B1] now requires computing data with dask, whereas [A0, A1, B0, B1] does not.

I suspect the issue is that we're now using some different combination of merge/concat. In particular it looks like the compute is being triggered from within merge. This sort of makes sense: if we're using merge instead of concat for joining along the dimension T, that is super slow because that goes through a path that checks arrays for conflicting values by loading data into memory (even though in this case that isn't possible, because the original coordinates were not overlapping).

We could (and should) optimize this path in merge to avoid eagerly loading data, but the immediate fix here is probably to make sure we're using concat instead of merge.

shoyer on 15 Jan 2019

Yes thankyou @malmans2, this is very helpful!

I suspect the issue is that we're now using some different combination of merge/concat.

This was very puzzling because the code is supposed to split the datasets up according to their data variables, which means merge won't be used to concatenate and this should be fast, as before.

But I found the problem! In _auto_combine_1d I should have sorted the datasets before attempting to group them by data variable, i.e. I needed a line

sorted_datasets = sorted(datasets, key=lambda ds: tuple(sorted(ds)))

before

grouped = itertools.groupby(sorted_datasets, key=lambda ds: tuple(sorted(ds)))

With this change then I get

# No longer slow if netCDFs are stored in several folders:    
%timeit ds_2folders = xr.open_mfdataset('rep*/*.nc', concat_dim='T')

9.35 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Without this pre-sorting, itertools.groupby isn't guaranteed to do the grouping properly (in contrast to itertoolz.groupby which is), and as a result wasn't necessarily grouping the datasets by their variables. Then it wouldn't have finished concatenating along the dimension 'T' before it tried to merge everything back together.

Whether or not groupby sorted properly depended on the order of datasets in the input to groupby, which eventually depended on the way they were loaded (as the example in this issue makes clear).

The reason this mistake got past the unit tests is that auto_combine still gives the correct result in every case! Merge will still combine these datasets, but it will load their values in first to check that's okay, which was why it was ~1000 times slower. None of the unit tests checked performance though, and the tests I wrote were all supposed to be very fast, so this slowdown wasn't noticeable in any of them.

TomNicholas on 15 Jan 2019

❤1

I checked PR #2678 with the data that originated the issue and it fixes the problem!

malmans2 on 15 Jan 2019

🎉1

We have airspeedvelocity performance tests. I don't know there's one for auto_combine but maybe you can add one @TomNicholas