I create a data frame this way (it has only 3 groups):
import cudf
import numpy as np
df = cudf.DataFrame()
for i in range(5): df[f'col_{i}'] = np.random.randint(3, size=100_000_000, dtype='int8')
df.dtypes
Out[1]:
col_0 int8
col_1 int8
col_2 int8
col_3 int8
col_4 int8
dtype: object
And after that, I am trying to:
agg = {
'col_1': ['mean'],
'col_2': ['mean'],
'col_3': ['mean'],
}
df.groupby('col_0').agg(agg)
Out[2]:
col_1 col_2 col_3
col_0
0 1.000180 1.000014 0.999990
1 1.000027 1.000018 0.999688
2 0.999865 0.999915 0.999913
And I end up with 0.8 GB used memory in GPU.
But if I try to get mean for the col_4 as well I get an out-of-memory error:
agg = {
'col_1': ['mean'],
'col_2': ['mean'],
'col_3': ['mean'],
'col_4': ['mean'],
}
%time df.groupby('col_0').agg(agg)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<timed eval> in <module>
~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in agg(self, func)
44
45 def agg(self, func):
---> 46 return self._apply_aggregation(func)
47
48 def size(self):
~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in _apply_aggregation(self, agg)
97 Applies the aggregation function(s) ``agg`` on all columns
98 """
---> 99 result = self._groupby.compute_result(agg)
100 nvtx_range_pop()
101 return result
~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in compute_result(self, agg)
233
234 out_key_columns, out_value_columns = _groupby_engine(
--> 235 self.key_columns, self.value_columns, aggs_as_list, self.sort
236 )
237
~/miniconda3/envs/test/lib/python3.7/site-packages/cudf/groupby/groupby.py in _groupby_engine(key_columns, value_columns, aggs, sort)
413 """
414 out_key_columns, out_value_columns = cpp_apply_groupby(
--> 415 key_columns, value_columns, aggs
416 )
417
cudf/bindings/groupby/groupby.pyx in cudf.bindings.groupby.groupby.apply_groupby()
RuntimeError: RMM error encountered at: /conda/conda-bld/libcudf_1566415000697/work/cpp/src/column/legacy/column.cpp:222: 4 RMM_ERROR_OUT_OF_MEMORY
Does anybody know why additional int8 column in aggregation blows up memory requirements from 0.8 GB to 11+ GB?
My wooden workstation:
CPU: Intel i7-4790
RAM: 16GB 1600MHz
GPU: NVidia GTX 1080 Ti
Hi @GarrisonD; I'm not able to reproduce locally.
Could you please post the output of the script print_env.sh:
https://github.com/rapidsai/cudf/blob/branch-0.11/print_env.sh
(just download the script, and run bash /path/to/print_env.sh)
I had this issue with cudf-0.9. I will try it with cudf-0.10 now and let you know if I still have it.
With cudf-0.10 I still have this issue but instead of the error, I get "dead" Jupyter Notebook kernel now.
@shwina Which GPU do you use? Probably your GPU has more memory thus you can't reproduce it...
***git***
Not inside a git repository
***OS Information***
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Linux data-scientist-pc 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Sun Oct 27 11:22:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 19W / 250W | 0MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Stepping: 3
CPU MHz: 3889.560
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 7200.21
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
***CMake***
***g++***
/usr/bin/g++
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
***Python***
/home/data-scientist/miniconda3/envs/rapids-ai/bin/python
Python 3.7.3
***Environment Variables***
PATH : /home/data-scientist/miniconda3/envs/rapids-ai/bin:/home/data-scientist/miniconda3/condabin:/home/data-scientist/bin:/home/data-scientist/miniconda3/bin:/bin:/usr/local/bin:/home/data-scientist/bin:/home/data-scientist/miniconda3/bin:/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
LD_LIBRARY_PATH :
NUMBAPRO_NVVM :
NUMBAPRO_LIBDEVICE :
CONDA_PREFIX : /home/data-scientist/miniconda3/envs/rapids-ai
PYTHON_PATH :
***conda packages***
/home/data-scientist/miniconda3/condabin/conda
# packages in environment at /home/data-scientist/miniconda3/envs/rapids-ai:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
arrow-cpp 0.14.1 py37h5ac5442_4 conda-forge
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
bleach 3.1.0 py_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_1 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.9.11 hecc5488_0 conda-forge
certifi 2019.9.11 py37_0 conda-forge
cudatoolkit 10.1.168 0
cudf 0.10.0 py37_0 rapidsai
cython 0.29.13 py37he1b5a44_0 conda-forge
decorator 4.4.0 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
dlpack 0.2 he1b5a44_1 conda-forge
double-conversion 3.1.5 he1b5a44_1 conda-forge
entrypoints 0.3 py37_1000 conda-forge
fastavro 0.22.5 py37h516909a_0 conda-forge
fsspec 0.5.2 py_0 conda-forge
gflags 2.2.2 he1b5a44_1001 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
grpc-cpp 1.23.0 h18db393_0 conda-forge
icu 64.2 he1b5a44_1 conda-forge
importlib_metadata 0.23 py37_0 conda-forge
ipykernel 5.1.3 py37h5ca1d4c_0 conda-forge
ipython 7.9.0 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.15.1 py37_0 conda-forge
jinja2 2.10.3 py_0 conda-forge
json5 0.8.5 py_0 conda-forge
jsonschema 3.1.1 py37_0 conda-forge
jupyter_client 5.3.3 py37_1 conda-forge
jupyter_core 4.5.0 py_0 conda-forge
jupyterlab 1.1.4 py_0 conda-forge
jupyterlab_server 1.0.6 py_0 conda-forge
jupytext 1.2.4 0 conda-forge
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libcudf 0.10.0 cuda10.1_0 rapidsai
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 he1b5a44_1006 conda-forge
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_2 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libllvm8 8.0.1 hc9558a2_0 conda-forge
libnvstrings 0.10.0 cuda10.1_0 rapidsai
libopenblas 0.3.7 h6e990d7_2 conda-forge
libprotobuf 3.8.0 h8b12597_0 conda-forge
librmm 0.10.0 cuda10.1_0 rapidsai
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
llvmlite 0.30.0 py37h8b12597_0 conda-forge
lz4-c 1.8.3 he1b5a44_1001 conda-forge
markupsafe 1.1.1 py37h14c3975_0 conda-forge
mistune 0.8.4 py37h14c3975_1000 conda-forge
more-itertools 7.2.0 py_0 conda-forge
nbconvert 5.6.1 py37_0 conda-forge
nbformat 4.4.0 py_1 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
notebook 6.0.1 py37_0 conda-forge
numba 0.46.0 py37hb3f55d8_0 conda-forge
numpy 1.17.3 py37h95a1406_0 conda-forge
nvstrings 0.10.0 py37_0 rapidsai
openssl 1.1.1c h516909a_0 conda-forge
pandas 0.24.2 py37hb3f55d8_0 conda-forge
pandoc 2.7.3 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.5.1 py_0 conda-forge
pexpect 4.7.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pip 19.3.1 py37_0 conda-forge
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 2.0.10 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.14.1 py37h8b68381_2 conda-forge
pygments 2.4.2 py_0 conda-forge
pyrsistent 0.15.4 py37h516909a_0 conda-forge
python 3.7.3 h33d41f4_1 conda-forge
python-dateutil 2.8.0 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
pyyaml 5.1.2 py37h516909a_0 conda-forge
pyzmq 18.1.0 py37h1768529_0 conda-forge
re2 2019.09.01 he1b5a44_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
rmm 0.10.0 py37_0 rapidsai
send2trash 1.5.0 py_0 conda-forge
setuptools 41.4.0 py37_0 conda-forge
six 1.12.0 py37_1000 conda-forge
snappy 1.1.7 he1b5a44_1002 conda-forge
sqlite 3.30.1 hcee41ef_0 conda-forge
terminado 0.8.2 py37_0 conda-forge
testpath 0.4.2 py_1001 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.9 hed695b0_1003 conda-forge
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.3 py37_0 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
wcwidth 0.1.7 py_1 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.33.6 py37_0 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.1.7 h14c3975_1001 conda-forge
zeromq 4.3.2 he1b5a44_2 conda-forge
zipp 0.6.0 py_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
zstd 1.4.0 h3b9ef0a_0 conda-forge
@shwina I see a momentary spike in memory while performing groupby which is ~10x the memory usage of the Dataframe itself (780mb df vs 7.5-8GB spike). Not sure if this memory overhead is expected or not. Note: The peak is momentary and can easily be missed by nvidia-smi. Running it in a loop, or using the pynvml dashboard makes it more evident.
Interesting - thank you @ayushdg and @GarrisonD. @jrhemstad is this expected on the libcudf side?
Interesting - thank you @ayushdg and @GarrisonD. @jrhemstad is this expected on the libcudf side?
It is expected for there to be an increased memory spike while performing a groupby. Doing some rough back of the napkin math...
If n is the number of rows, k is the number of key columns, and a is the number of aggregations, then the _additional_ temporary memory usage should be:
k*n + a*n + 2*n
or n * (k + a + 2)
if any of the a aggregations are mean then add in an extra factor of n.
Using the original example:
agg = {
'col_1': ['mean'],
'col_2': ['mean'],
'col_3': ['mean'],
'col_4': ['mean'],
}
%time df.groupby('col_0').agg(agg)
k = 1
a = 4 (but `mean`, so really, a = 8)
n * (1 + 8 + 2)
So a ~10x spike in memory usage sounds about right for this example. The exact amount depends on the data type being used, but we're in the right neighborhood.
@jrhemstad Thanks for the explanation!
@jrhemstad Thanks for the explanation!
@GarrisonD one way to reduce the amount of temporary memory needed is to do fewer aggregations in the same groupby. It'll be slower, but at least you won't OOM.
Thanks @jrhemstad for the great explanation. Just wanted to clarify something to ensure that we don't need to profile this further. The math explained above yields an output of 11 * n where n is number of rows. The ~10x spike I see is w.r.t the original memory usage of the df which itself is 5 * n (n rows and 5 columns). By that logic should the memory spike not be in the range of ( ( 11 + 5) * n / (5*n) ) ~= 3x the original df?
Thanks @jrhemstad for the great explanation. Just wanted to clarify something to ensure that we don't need to profile this further. The math explained above yields an output of
11 * nwherenis number of rows. The ~10x spike I see is w.r.t the original memory usage of the df which itself is5 * n(n rows and 5 columns). By that logic should the memory spike not be in the range of( ( 11 + 5) * n / (5*n) ) ~= 3xthe original df?
Yeah, what was hidden in my explanation is the data type (and therefore the size of the data type) which will impact the measured difference in memory consumption.
So to be more explicit...
The size of our inputs are:
4 columns of int8 => (n * 4 *1) bytes
Temporary memory overhead:
Let's break k*n + a*n + 2*n into pieces.
k*n
This depends on the key columns. In this case, that's a single int8, so this is just n bytes
2*n
This is constant independent of the size of the keys or aggregations, and it is 16*n bytes (for the hash map).
a*n
This depends on the aggregations. When you compute a mean aggregation, it computes the count in an int32 and sum in an int64 under the covers and then computes the mean out of place into a double.
Therefore, for each mean aggregation it requires n*4 + n*8 + n*8 or 20n bytes. We're computing 4 means so 80n bytes.
Adding it all up: 80n + 16n + n == 97n
Input / Temporary => 97n / 4n => ~25x memory overhead!
Now then, you don't actually see 25x overhead because not all of the temporary memory exists all at once (some of it is free'd before the rest of it is allocated). So 10x sounds about right for the observable overhead.
Amazing explanation! Thanks a lot @jrhemstad 馃槃