Describe the bug
Inconsistent behavior of the std aggregation. Results in the following error for large dataframe, but works with smaller data.
ValueError: Length mismatch: Expected axis has 104 elements, new values have 114131 elements
df.groupby("halo_id_mock").normv.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-2d9f7e6c940e> in <module>
----> 1 res.compute()
/conda/envs/rapids/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
163 dask.base.compute
164 """
--> 165 (result,) = compute(self, traverse=False, **kwargs)
166 return result
167
/conda/envs/rapids/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
434 keys = [x.__dask_keys__() for x in collections]
435 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 436 results = schedule(dsk, keys, **kwargs)
437 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
438
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2574 should_rejoin = False
2575 try:
-> 2576 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2577 finally:
2578 for f in futures.values():
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1872 direct=direct,
1873 local_worker=local_worker,
-> 1874 asynchronous=asynchronous,
1875 )
1876
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
767 else:
768 return sync(
--> 769 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
770 )
771
/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
333 if error[0]:
334 typ, exc, tb = error[0]
--> 335 raise exc.with_traceback(tb)
336 else:
337 return result[0]
/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in f()
317 if callback_timeout is not None:
318 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319 result[0] = yield future
320 except Exception as exc:
321 error[0] = sys.exc_info()
/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1728 exc = CancelledError(key)
1729 else:
-> 1730 raise exception.with_traceback(traceback)
1731 raise exc
1732 if errors == "skip":
/conda/envs/rapids/lib/python3.6/site-packages/dask/utils.py in apply()
29 return func(*args, **kwargs)
30 else:
---> 31 return func(*args)
32
33
/conda/envs/rapids/lib/python3.6/site-packages/dask/dataframe/groupby.py in _var_chunk()
312
313 # ipdb.set_trace()
--> 314 x2.index = x.index
315 ret = concat([x, x2, n1], axis=1)
316 return ret
/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py in __setattr__()
308 try:
309 object.__getattribute__(self, key)
--> 310 object.__setattr__(self, key, col)
311 return
312 except AttributeError:
/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py in index()
1195 "have %d elements" % (old_length, new_length)
1196 )
-> 1197 raise ValueError(msg)
1198
1199 # try to build an index from generic _index
ValueError: Length mismatch: Expected axis has 104 elements, new values have 114131 elements
Steps/Code to reproduce bug
Works on synthetic data, but breaks in the actual workflow.
Expected behavior
Compute the std for the column
# In the same container, this works
df = cudf.DataFrame({'a': [1.,3,4,1.],'b': [4.,5,6,-10], 'c': [6., 7., 5., 10.]})
ddf = dask_cudf.from_cudf(df, npartitions=2)
ddf.groupby('a').b.std()
Environment overview (please complete the following information)
Environment details
***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Linux aff06c35ee06 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Tue Dec 17 23:27:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 51C P0 29W / 70W | 5392MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 58C P0 28W / 70W | 14379MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 50C P0 29W / 70W | 14393MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 48C P0 29W / 70W | 14403MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
***CMake***
/conda/envs/rapids/bin/cmake
cmake version 3.16.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).
***g++***
/usr/bin/g++
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
***Python***
/conda/envs/rapids/bin/python
Python 3.6.7
Hello - the short answer is that SeriesGroupBy does not support std at this time, only DataFrameGroupBy
df.groupby("halo_id_mock").normv is a SeriesGroupBy, which is supposed to be throwing a NotImplementedError - I鈥檓 not sure why it isn鈥檛, but it is some esoterica in how dask calls groupby in comparison to how cudf implements it. I鈥檓 still looking into it, but if you can do df[["halo_id_mock", "normv"]].groupby("halo_id_mock").std().normv then you should get the same results.
@thomcom Dask doesn't use the underlying cudf std aggregation at all. They have their own implementation to calculate variance/std by using the sum of each chunk and doing some operations.
Dask std and variance and var on chunks and combining them seem to be some lines of interest.
Side note: I found this link on how groupby aggregates work across different chunks and seems like variance is implemented in a similar way. https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
Thanks @ayushdg, you're right of course. :) This is a dask issue that originates in the groupby.py::_var_chunk function I think. Is there a way we can debug this during the .compute() call @quasiben ?
I would recommend running with compute(scheduler='single-threaded') in an ipython session When you hit the error you can execute %debug and move back through the code and see what is going on.
Rule!
The problem is integer overflow. :guitar:
_var_chunk squares the columns of df of course. If those happen to be very large ints, say, from a product ID database, their square becomes int64::max and the groupby of that dataframe is no longer the same size as the unsquared dataframe groupby. If you convert the column with .astype('str') first the problem will go away.
I'm not sure if we should handle this at another level, also, as it may occur frequently.
Suggest
# pseudocode
if cudf.Series(df.max() >= int64::max ** (1/2)).sum() > 0:
raise ValueException
PR for dask coming up.
Most helpful comment
The problem is integer overflow. :guitar: