Cudf: [BUG] Inconsistent behavior of std aggregation

Created on 19 Dec 2019 · 10Comments · Source: rapidsai/cudf

Describe the bug

Inconsistent behavior of the std aggregation. Results in the following error for large dataframe, but works with smaller data.
ValueError: Length mismatch: Expected axis has 104 elements, new values have 114131 elements

df.groupby("halo_id_mock").normv.std()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-2d9f7e6c940e> in <module>
----> 1 res.compute()
/conda/envs/rapids/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    163         dask.base.compute
    164         """
--> 165         (result,) = compute(self, traverse=False, **kwargs)
    166         return result
    167 
/conda/envs/rapids/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    434     keys = [x.__dask_keys__() for x in collections]
    435     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 436     results = schedule(dsk, keys, **kwargs)
    437     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    438 
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2574                     should_rejoin = False
   2575             try:
-> 2576                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2577             finally:
   2578                 for f in futures.values():
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1872                 direct=direct,
   1873                 local_worker=local_worker,
-> 1874                 asynchronous=asynchronous,
   1875             )
   1876 
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    767         else:
    768             return sync(
--> 769                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    770             )
    771 
/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    333     if error[0]:
    334         typ, exc, tb = error[0]
--> 335         raise exc.with_traceback(tb)
    336     else:
    337         return result[0]
/conda/envs/rapids/lib/python3.6/site-packages/distributed/utils.py in f()
    317             if callback_timeout is not None:
    318                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319             result[0] = yield future
    320         except Exception as exc:
    321             error[0] = sys.exc_info()
/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()
/conda/envs/rapids/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1728                             exc = CancelledError(key)
   1729                         else:
-> 1730                             raise exception.with_traceback(traceback)
   1731                         raise exc
   1732                     if errors == "skip":
/conda/envs/rapids/lib/python3.6/site-packages/dask/utils.py in apply()
     29         return func(*args, **kwargs)
     30     else:
---> 31         return func(*args)
     32 
     33 
/conda/envs/rapids/lib/python3.6/site-packages/dask/dataframe/groupby.py in _var_chunk()
    312 
    313 #     ipdb.set_trace()
--> 314     x2.index = x.index
    315     ret = concat([x, x2, n1], axis=1)
    316     return ret
/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py in __setattr__()
    308             try:
    309                 object.__getattribute__(self, key)
--> 310                 object.__setattr__(self, key, col)
    311                 return
    312             except AttributeError:
/conda/envs/rapids/lib/python3.6/site-packages/cudf/core/dataframe.py in index()
   1195                 "have %d elements" % (old_length, new_length)
   1196             )
-> 1197             raise ValueError(msg)
   1198 
   1199         # try to build an index from generic _index
ValueError: Length mismatch: Expected axis has 104 elements, new values have 114131 elements

Steps/Code to reproduce bug
Works on synthetic data, but breaks in the actual workflow.

Expected behavior
Compute the std for the column

# In the same container, this works
df = cudf.DataFrame({'a': [1.,3,4,1.],'b': [4.,5,6,-10], 'c': [6., 7., 5., 10.]})
ddf = dask_cudf.from_cudf(df, npartitions=2)
ddf.groupby('a').b.std()

Environment overview (please complete the following information)

Environment location: Docker

Method of cuDF install: From source

Environment details

***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=16.04
     DISTRIB_CODENAME=xenial
     DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
     NAME="Ubuntu"
     VERSION="16.04.6 LTS (Xenial Xerus)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 16.04.6 LTS"
     VERSION_ID="16.04"
     VERSION_CODENAME=xenial
     UBUNTU_CODENAME=xenial
     Linux aff06c35ee06 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Tue Dec 17 23:27:36 2019       
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
     | N/A   51C    P0    29W /  70W |   5392MiB / 15079MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
     | N/A   58C    P0    28W /  70W |  14379MiB / 15079MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+
     |   2  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
     | N/A   50C    P0    29W /  70W |  14393MiB / 15079MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+
     |   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
     | N/A   48C    P0    29W /  70W |  14403MiB / 15079MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+

     ***CMake***
     /conda/envs/rapids/bin/cmake
     cmake version 3.16.1

     CMake suite maintained and supported by Kitware (kitware.com/cmake).

     ***g++***
     /usr/bin/g++
     g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
     Copyright (C) 2015 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


     ***nvcc***
     /usr/local/cuda/bin/nvcc
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2018 NVIDIA Corporation
     Built on Sat_Aug_25_21:08:01_CDT_2018
     Cuda compilation tools, release 10.0, V10.0.130

     ***Python***
     /conda/envs/rapids/bin/python
     Python 3.6.7

? - Needs Triage bug

Source

Nanthini10

Most helpful comment

The problem is integer overflow. :guitar:

thomcom on 19 Dec 2019

🚀2 🎉1

All 10 comments

Hello - the short answer is that SeriesGroupBy does not support std at this time, only DataFrameGroupBy

df.groupby("halo_id_mock").normv is a SeriesGroupBy, which is supposed to be throwing a NotImplementedError - I’m not sure why it isn’t, but it is some esoterica in how dask calls groupby in comparison to how cudf implements it. I’m still looking into it, but if you can do df[["halo_id_mock", "normv"]].groupby("halo_id_mock").std().normv then you should get the same results.

thomcom on 19 Dec 2019

@thomcom Dask doesn't use the underlying cudf std aggregation at all. They have their own implementation to calculate variance/std by using the sum of each chunk and doing some operations.
Dask std and variance and var on chunks and combining them seem to be some lines of interest.

Side note: I found this link on how groupby aggregates work across different chunks and seems like variance is implemented in a similar way. https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate

ayushdg on 19 Dec 2019

Thanks @ayushdg, you're right of course. :) This is a dask issue that originates in the groupby.py::_var_chunk function I think. Is there a way we can debug this during the .compute() call @quasiben ?

thomcom on 19 Dec 2019

I would recommend running with compute(scheduler='single-threaded') in an ipython session When you hit the error you can execute %debug and move back through the code and see what is going on.

quasiben on 19 Dec 2019

Rule!

thomcom on 19 Dec 2019

The problem is integer overflow. :guitar:

thomcom on 19 Dec 2019

🚀2 🎉1

_var_chunk squares the columns of df of course. If those happen to be very large ints, say, from a product ID database, their square becomes int64::max and the groupby of that dataframe is no longer the same size as the unsquared dataframe groupby. If you convert the column with .astype('str') first the problem will go away.

I'm not sure if we should handle this at another level, also, as it may occur frequently.

thomcom on 19 Dec 2019

Suggest
# pseudocode if cudf.Series(df.max() >= int64::max ** (1/2)).sum() > 0: raise ValueException

thomcom on 19 Dec 2019

PR for dask coming up.

thomcom on 19 Dec 2019

Fixed by https://github.com/dask/dask/pull/5737