I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets.
I start with an xarray dataset created by xgcm with the following repr:
<xarray.Dataset>
Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566)
Coordinates:
iter (time) int64 8294400 8294976 8295552 8296128 ...
* time (time) int64 8294400 8294976 8295552 8296128 ...
* XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ...
* YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ...
* XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ...
* YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ...
* Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ...
* Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ...
* Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ...
* Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ...
rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ...
PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ...
drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ...
drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ...
hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
* layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ...
* layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ...
* layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ...
Data variables:
T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ...
An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid.
I save this dataset to a multi-file netCDF dataset as follows:
iternums, datasets = zip(*ds.groupby('time'))
paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums]
xr.save_mfdataset(datasets, paths)
This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.)
Then I try to re-load this dataset
ds_nc = xr.open_mfdataset('xmitgcm_data.*.nc')
This raises an error:
ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}
I need to specify concat_dim='time' in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates.
I just thought I would document this, because 18 minutes seems way too long to load a dataset.
My strong suspicion is that the bottleneck here is xarray checking all the coordinates for equality in concat, when deciding whether to add a "time" dimension or not.
Try passing coords='minimal' and see if that speeds things up. See the concat documentation for details:
http://xarray.pydata.org/en/stable/generated/xarray.concat.html#xarray.concat
This was a convenient check for small/in-memory datasets but possibly it's not a good one going forward. It's generally slow to load all the coordinate data for comparisons, but it's even worse with the current implementation, which computes pair-wise comparisons of arrays with dask instead of doing them in parallel all at once.
coords is not a valid kwarg for open_mfdataset
http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html#xarray-open-mfdataset
Indeed, it's not. We should add some way to pipe this arguments through auto_combine on to concat.
This sounds like the kind of thing I could manage.
I'm running into the same problem as Ryan, regarding the ValueError. However, when I try the same fix
ds = xr.open_mfdataset(files[:4], concat_dim = 'time')
I get the error
TypeError: Must pass list-like asnames.
Apologies if this should be on a different chain, but any idea what might be going on?
@karenamckinnon could you please share a traceback for the error?
TypeErrorTraceback (most recent call last)
<ipython-input-152-52564f498ac3> in <module>()
----> 1 ds = xr.open_mfdataset(files[:2], concat_dim='time')
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/backends/api.pyc in open_mfdataset(paths, chunks, concat_dim, preprocess, engine, lock, **kwargs)
304 datasets = [preprocess(ds) for ds in datasets]
305
--> 306 combined = auto_combine(datasets, concat_dim=concat_dim)
307 combined._file_obj = _MultiFileCloser(file_objs)
308 return combined
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in auto_combine(datasets, concat_dim)
376 grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)),
377 datasets).values()
--> 378 concatenated = [_auto_concat(ds, dim=concat_dim) for ds in grouped]
379 merged = merge(concatenated)
380 return merged
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in _auto_concat(datasets, dim)
327 'explicitly')
328 dim, = concat_dims
--> 329 return concat(datasets, dim=dim)
330
331
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over)
115 raise TypeError('can only concatenate xarray Dataset and DataArray '
116 'objects, got %s' % type(first_obj))
--> 117 return f(objs, dim, data_vars, coords, compat, positions)
118
119
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in _dataset_concat(datasets, dim, data_vars, coords, compat, positions)
206 dim, coord = _calc_concat_dim_coord(dim)
207 datasets = [as_dataset(ds) for ds in datasets]
--> 208 datasets = align(*datasets, join='outer', copy=False, exclude=[dim])
209
210 concat_over = _calc_concat_over(datasets, dim, data_vars, coords)
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/alignment.pyc in align(*objects, **kwargs)
78 all_indexes = defaultdict(list)
79 for obj in objects:
---> 80 for dim, index in iteritems(obj.indexes):
81 if dim not in exclude:
82 all_indexes[dim].append(index)
/glade/apps/opt/python/2.7.7/gnu-westmere/4.8.2/lib/python2.7/_abcoll.pyc in iteritems(self)
385 'D.iteritems() -> an iterator over the (key, value) items of D'
386 for key in self:
--> 387 yield (key, self[key])
388
389 def keys(self):
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/coordinates.pyc in __getitem__(self, key)
245 def __getitem__(self, key):
246 if key in self:
--> 247 return self._variables[key].to_index()
248 else:
249 raise KeyError(key)
/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/variable.pyc in to_index(self)
1173 index = index.set_names(valid_level_names)
1174 else:
-> 1175 index = index.set_names(self.name)
1176 return index
1177
/glade/apps/opt/pandas/0.14.0/gnu/4.8.2/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/index.pyc in set_names(self, names, inplace)
380 """
381 if not com.is_list_like(names):
--> 382 raise TypeError("Must pass list-like as `names`.")
383 if inplace:
384 idx = self
TypeError: Must pass list-like as `names`.
@karenamckinnon From your traceback, it looks like you're using pandas 0.14, but xarray requires at least pandas 0.15.
Got it, thanks @shoyer ! In case this happens again, which component of the traceback provided that information to you?
@karenamckinnon In this case, it was in the file paths, i.e., /glade/apps/opt/pandas/0.14.0/gnu/4.8.2/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/index.pyc
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically
Most helpful comment
Indeed, it's not. We should add some way to pipe this arguments through
auto_combineon toconcat.