Cudf: [BUG] group_by aggregation is causing out of memory

Created on 24 Oct 2019  路  13Comments  路  Source: rapidsai/cudf

I create a data frame this way (it has only 3 groups):

import cudf
import numpy as np

df = cudf.DataFrame()

for i in range(5): df[f'col_{i}'] = np.random.randint(3, size=100_000_000, dtype='int8')

df.dtypes
Out[1]: 
col_0    int8
col_1    int8
col_2    int8
col_3    int8
col_4    int8
dtype: object

And after that, I am trying to:

agg = {
    'col_1': ['mean'],
    'col_2': ['mean'],
    'col_3': ['mean'],
}

df.groupby('col_0').agg(agg)
Out[2]: 
          col_1     col_2     col_3
col_0                              
0      1.000180  1.000014  0.999990
1      1.000027  1.000018  0.999688
2      0.999865  0.999915  0.999913

And I end up with 0.8 GB used memory in GPU.

But if I try to get mean for the col_4 as well I get an out-of-memory error:

agg = {
    'col_1': ['mean'],
    'col_2': ['mean'],
    'col_3': ['mean'],
    'col_4': ['mean'],
}

%time df.groupby('col_0').agg(agg)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed eval> in <module>

~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in agg(self, func)
     44 
     45     def agg(self, func):
---> 46         return self._apply_aggregation(func)
     47 
     48     def size(self):

~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in _apply_aggregation(self, agg)
     97         Applies the aggregation function(s) ``agg`` on all columns
     98         """
---> 99         result = self._groupby.compute_result(agg)
    100         nvtx_range_pop()
    101         return result

~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in compute_result(self, agg)
    233 
    234         out_key_columns, out_value_columns = _groupby_engine(
--> 235             self.key_columns, self.value_columns, aggs_as_list, self.sort
    236         )
    237 

~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in _groupby_engine(key_columns, value_columns, aggs, sort)
    413     """
    414     out_key_columns, out_value_columns = cpp_apply_groupby(
--> 415         key_columns, value_columns, aggs
    416     )
    417 

cudf/bindings/groupby/groupby.pyx in cudf.bindings.groupby.groupby.apply_groupby()

RuntimeError: RMM error encountered at: /conda/conda-bld/libcudf_1566415000697/work/cpp/src/column/legacy/column.cpp:222: 4 RMM_ERROR_OUT_OF_MEMORY

Does anybody know why additional int8 column in aggregation blows up memory requirements from 0.8 GB to 11+ GB?


My wooden workstation:

CPU: Intel i7-4790
RAM: 16GB 1600MHz
GPU: NVidia GTX 1080 Ti

? - Needs Triage bug

All 13 comments

Hi @GarrisonD; I'm not able to reproduce locally.

Could you please post the output of the script print_env.sh:

https://github.com/rapidsai/cudf/blob/branch-0.11/print_env.sh

(just download the script, and run bash /path/to/print_env.sh)

I had this issue with cudf-0.9. I will try it with cudf-0.10 now and let you know if I still have it.

With cudf-0.10 I still have this issue but instead of the error, I get "dead" Jupyter Notebook kernel now.

@shwina Which GPU do you use? Probably your GPU has more memory thus you can't reproduce it...

***git***
Not inside a git repository

***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Linux data-scientist-pc 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

***GPU Information***
Sun Oct 27 11:22:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8    19W / 250W |      0MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

***CPU***
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               60
Model name:          Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Stepping:            3
CPU MHz:             3889.560
CPU max MHz:         4000.0000
CPU min MHz:         800.0000
BogoMIPS:            7200.21
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

***CMake***

***g++***
/usr/bin/g++
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


***nvcc***

***Python***
/home/data-scientist/miniconda3/envs/rapids-ai/bin/python
Python 3.7.3

***Environment Variables***
PATH                            : /home/data-scientist/miniconda3/envs/rapids-ai/bin:/home/data-scientist/miniconda3/condabin:/home/data-scientist/bin:/home/data-scientist/miniconda3/bin:/bin:/usr/local/bin:/home/data-scientist/bin:/home/data-scientist/miniconda3/bin:/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
LD_LIBRARY_PATH                 :
NUMBAPRO_NVVM                   :
NUMBAPRO_LIBDEVICE              :
CONDA_PREFIX                    : /home/data-scientist/miniconda3/envs/rapids-ai
PYTHON_PATH                     :

***conda packages***
/home/data-scientist/miniconda3/condabin/conda
# packages in environment at /home/data-scientist/miniconda3/envs/rapids-ai:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
arrow-cpp                 0.14.1           py37h5ac5442_4    conda-forge
attrs                     19.3.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
bleach                    3.1.0                      py_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1000    conda-forge
bzip2                     1.0.8                h516909a_1    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2019.9.11            hecc5488_0    conda-forge
certifi                   2019.9.11                py37_0    conda-forge
cudatoolkit               10.1.168                      0
cudf                      0.10.0                   py37_0    rapidsai
cython                    0.29.13          py37he1b5a44_0    conda-forge
decorator                 4.4.0                      py_0    conda-forge
defusedxml                0.6.0                      py_0    conda-forge
dlpack                    0.2                  he1b5a44_1    conda-forge
double-conversion         3.1.5                he1b5a44_1    conda-forge
entrypoints               0.3                   py37_1000    conda-forge
fastavro                  0.22.5           py37h516909a_0    conda-forge
fsspec                    0.5.2                      py_0    conda-forge
gflags                    2.2.2             he1b5a44_1001    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
importlib_metadata        0.23                     py37_0    conda-forge
ipykernel                 5.1.3            py37h5ca1d4c_0    conda-forge
ipython                   7.9.0            py37h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.15.1                   py37_0    conda-forge
jinja2                    2.10.3                     py_0    conda-forge
json5                     0.8.5                      py_0    conda-forge
jsonschema                3.1.1                    py37_0    conda-forge
jupyter_client            5.3.3                    py37_1    conda-forge
jupyter_core              4.5.0                      py_0    conda-forge
jupyterlab                1.1.4                      py_0    conda-forge
jupyterlab_server         1.0.6                      py_0    conda-forge
jupytext                  1.2.4                         0    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libcudf                   0.10.0               cuda10.1_0    rapidsai
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_2    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libllvm8                  8.0.1                hc9558a2_0    conda-forge
libnvstrings              0.10.0               cuda10.1_0    rapidsai
libopenblas               0.3.7                h6e990d7_2    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.10.0               cuda10.1_0    rapidsai
libsodium                 1.0.17               h516909a_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0
llvmlite                  0.30.0           py37h8b12597_0    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
markupsafe                1.1.1            py37h14c3975_0    conda-forge
mistune                   0.8.4           py37h14c3975_1000    conda-forge
more-itertools            7.2.0                      py_0    conda-forge
nbconvert                 5.6.1                    py37_0    conda-forge
nbformat                  4.4.0                      py_1    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
notebook                  6.0.1                    py37_0    conda-forge
numba                     0.46.0           py37hb3f55d8_0    conda-forge
numpy                     1.17.3           py37h95a1406_0    conda-forge
nvstrings                 0.10.0                   py37_0    rapidsai
openssl                   1.1.1c               h516909a_0    conda-forge
pandas                    0.24.2           py37hb3f55d8_0    conda-forge
pandoc                    2.7.3                         0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.5.1                      py_0    conda-forge
pexpect                   4.7.0                    py37_0    conda-forge
pickleshare               0.7.5                 py37_1000    conda-forge
pip                       19.3.1                   py37_0    conda-forge
prometheus_client         0.7.1                      py_0    conda-forge
prompt_toolkit            2.0.10                     py_0    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
pyarrow                   0.14.1           py37h8b68381_2    conda-forge
pygments                  2.4.2                      py_0    conda-forge
pyrsistent                0.15.4           py37h516909a_0    conda-forge
python                    3.7.3                h33d41f4_1    conda-forge
python-dateutil           2.8.0                      py_0    conda-forge
pytz                      2019.3                     py_0    conda-forge
pyyaml                    5.1.2            py37h516909a_0    conda-forge
pyzmq                     18.1.0           py37h1768529_0    conda-forge
re2                       2019.09.01           he1b5a44_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
rmm                       0.10.0                   py37_0    rapidsai
send2trash                1.5.0                      py_0    conda-forge
setuptools                41.4.0                   py37_0    conda-forge
six                       1.12.0                py37_1000    conda-forge
snappy                    1.1.7             he1b5a44_1002    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
terminado                 0.8.2                    py37_0    conda-forge
testpath                  0.4.2                   py_1001    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.9             hed695b0_1003    conda-forge
tornado                   6.0.3            py37h516909a_0    conda-forge
traitlets                 4.3.3                    py37_0    conda-forge
uriparser                 0.9.3                he1b5a44_1    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.33.6                   py37_0    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
yaml                      0.1.7             h14c3975_1001    conda-forge
zeromq                    4.3.2                he1b5a44_2    conda-forge
zipp                      0.6.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.0                h3b9ef0a_0    conda-forge

@shwina I see a momentary spike in memory while performing groupby which is ~10x the memory usage of the Dataframe itself (780mb df vs 7.5-8GB spike). Not sure if this memory overhead is expected or not. Note: The peak is momentary and can easily be missed by nvidia-smi. Running it in a loop, or using the pynvml dashboard makes it more evident.

Interesting - thank you @ayushdg and @GarrisonD. @jrhemstad is this expected on the libcudf side?

Interesting - thank you @ayushdg and @GarrisonD. @jrhemstad is this expected on the libcudf side?

It is expected for there to be an increased memory spike while performing a groupby. Doing some rough back of the napkin math...

If n is the number of rows, k is the number of key columns, and a is the number of aggregations, then the _additional_ temporary memory usage should be:

k*n + a*n + 2*n

or n * (k + a + 2)

if any of the a aggregations are mean then add in an extra factor of n.

Using the original example:

agg = {
    'col_1': ['mean'],
    'col_2': ['mean'],
    'col_3': ['mean'],
    'col_4': ['mean'],
}
%time df.groupby('col_0').agg(agg)

k = 1
a = 4 (but `mean`, so really, a = 8)
n * (1 + 8 + 2)

So a ~10x spike in memory usage sounds about right for this example. The exact amount depends on the data type being used, but we're in the right neighborhood.

@jrhemstad Thanks for the explanation!

@jrhemstad Thanks for the explanation!

@GarrisonD one way to reduce the amount of temporary memory needed is to do fewer aggregations in the same groupby. It'll be slower, but at least you won't OOM.

Thanks @jrhemstad for the great explanation. Just wanted to clarify something to ensure that we don't need to profile this further. The math explained above yields an output of 11 * n where n is number of rows. The ~10x spike I see is w.r.t the original memory usage of the df which itself is 5 * n (n rows and 5 columns). By that logic should the memory spike not be in the range of ( ( 11 + 5) * n / (5*n) ) ~= 3x the original df?

Thanks @jrhemstad for the great explanation. Just wanted to clarify something to ensure that we don't need to profile this further. The math explained above yields an output of 11 * n where n is number of rows. The ~10x spike I see is w.r.t the original memory usage of the df which itself is 5 * n (n rows and 5 columns). By that logic should the memory spike not be in the range of ( ( 11 + 5) * n / (5*n) ) ~= 3x the original df?

Yeah, what was hidden in my explanation is the data type (and therefore the size of the data type) which will impact the measured difference in memory consumption.

So to be more explicit...

The size of our inputs are:

4 columns of int8 => (n * 4 *1) bytes

Temporary memory overhead:

Let's break k*n + a*n + 2*n into pieces.

k*n

This depends on the key columns. In this case, that's a single int8, so this is just n bytes

2*n

This is constant independent of the size of the keys or aggregations, and it is 16*n bytes (for the hash map).

a*n
This depends on the aggregations. When you compute a mean aggregation, it computes the count in an int32 and sum in an int64 under the covers and then computes the mean out of place into a double.

Therefore, for each mean aggregation it requires n*4 + n*8 + n*8 or 20n bytes. We're computing 4 means so 80n bytes.

Adding it all up: 80n + 16n + n == 97n

Input / Temporary => 97n / 4n => ~25x memory overhead!

Now then, you don't actually see 25x overhead because not all of the temporary memory exists all at once (some of it is free'd before the rest of it is allocated). So 10x sounds about right for the observable overhead.

Amazing explanation! Thanks a lot @jrhemstad 馃槃

Was this page helpful?
0 / 5 - 0 ratings