Cudf: [BUG] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory

Created on 17 Nov 2020 · 7Comments · Source: rapidsai/cudf

ENVIRONMENT

followed guide - https://github.com/rapidsai-community/notebooks-contrib/blob/branch-0.14/intermediate_notebooks/E2E/synthetic_3D/rapids_ml_workflow_demo.ipynb
conda create -n rapids-0.16 -c rapidsai -c nvidia -c conda-forge -c defaults rapids=0.16 python=3.7 cudatoolkit=10.2
AWS EC2: Deep Learning AMI (Ubuntu 18.04) Version 36.0 - ami-063585f0e06d22308: MXNet-1.7.0, TensorFlow-2.3.1, 2.1.0 & 1.15.3, PyTorch-1.4.0 & 1.7.0, Neuron, & others. NVIDIA CUDA, cuDNN, NCCL, Intel MKL-DNN, Docker, NVIDIA-Docker & EFA support. For fully managed experience, check: https://aws.amazon.com/sagemaker
AWS EC2 instance - g4dn.4xlarge - 16GB VRAM, 64 GB RAM

CODE

I am just trying to gave a trainign and a test set for the model
1st data package - train_data = xgboost.DMatrix(data=X_train, label=y_train)
2nd data package - test_data = xgboost.DMatrix(data=X_test, label=y_test) couple cells down the line, they are not executed together

ERROR

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-25-7bd66d4fabf4> in <module>
      1 #train = xgboost.DMatrix(data=X, label=y) #ORIGINAL
----> 2 test_data = xgboost.DMatrix(data=X_test, label=y_test)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    448             feature_names=feature_names,
    449             feature_types=feature_types,
--> 450             enable_categorical=enable_categorical)
    451         assert handle is not None
    452         self.handle = handle

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    543     if _is_cudf_df(data):
    544         return _from_cudf_df(data, missing, threads, feature_names,
--> 545                              feature_types)
    546     if _is_cudf_ser(data):
    547         return _from_cudf_df(data, missing, threads, feature_names,

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in _from_cudf_df(data, missing, nthread, feature_names, feature_types)
    400             ctypes.c_float(missing),
    401             ctypes.c_int(nthread),
--> 402             ctypes.byref(handle)))
    403     return handle, feature_names, feature_types
    404 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
    184     """
    185     if ret != 0:
--> 186         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    187 
    188 

XGBoostError: [12:32:18] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
- Free memory: 1539047424
- Requested memory: 3091258960

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(+0x13674f) [0x7fad04f7274f]
  [bt] (1) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x3ad) [0x7fad05190b0d]
  [bt] (2) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry>::allocate(unsigned long)+0x1df) [0x7fad051ac11f]
  [bt] (3) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(thrust::detail::vector_base<xgboost::Entry, dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry> >::fill_insert(thrust::detail::normal_iterator<thrust::device_ptr<xgboost::Entry> >, unsigned long, xgboost::Entry const&)+0x26d) [0x7fad051d0d0d]
  [bt] (4) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::HostDeviceVector<xgboost::Entry>::Resize(unsigned long, xgboost::Entry)+0xc9) [0x7fad051d1cc9]
  [bt] (5) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int)+0x3df) [0x7fad052259cf]
  [bt] (6) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x133) [0x7fad051f3aa3]
  [bt] (7) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaceColumns+0xc6) [0x7fad0518c286]
  [bt] (8) /home/ubuntu/anaconda3/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fae60078630]

CODE 2 If I clean a out a restart the notebook that execute them together in 1 cell.

train_data = xgboost.DMatrix(data=X_train, label=y_train) 
test_data = xgboost.DMatrix(data=X_test, label=y_test)

ERROR 2

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-20-f0c3710678a8> in <module>
      1 #train = xgboost.DMatrix(data=X, label=y) #ORIGINAL
      2 train_data = xgboost.DMatrix(data=X_train, label=y_train)
----> 3 test_data = xgboost.DMatrix(data=X_test, label=y_test)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    448             feature_names=feature_names,
    449             feature_types=feature_types,
--> 450             enable_categorical=enable_categorical)
    451         assert handle is not None
    452         self.handle = handle

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    543     if _is_cudf_df(data):
    544         return _from_cudf_df(data, missing, threads, feature_names,
--> 545                              feature_types)
    546     if _is_cudf_ser(data):
    547         return _from_cudf_df(data, missing, threads, feature_names,

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in _from_cudf_df(data, missing, nthread, feature_names, feature_types)
    400             ctypes.c_float(missing),
    401             ctypes.c_int(nthread),
--> 402             ctypes.byref(handle)))
    403     return handle, feature_names, feature_types
    404 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
    184     """
    185     if ret != 0:
--> 186         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    187 
    188 

XGBoostError: [15:20:36] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
- Free memory: 3015442432
- Requested memory: 3091258960

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(+0x13674f) [0x7f7eea73674f]
  [bt] (1) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x3ad) [0x7f7eea954b0d]
  [bt] (2) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry>::allocate(unsigned long)+0x1df) [0x7f7eea97011f]
  [bt] (3) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(thrust::detail::vector_base<xgboost::Entry, dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry> >::fill_insert(thrust::detail::normal_iterator<thrust::device_ptr<xgboost::Entry> >, unsigned long, xgboost::Entry const&)+0x26d) [0x7f7eea994d0d]
  [bt] (4) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::HostDeviceVector<xgboost::Entry>::Resize(unsigned long, xgboost::Entry)+0xc9) [0x7f7eea995cc9]
  [bt] (5) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int)+0x3df) [0x7f7eea9e99cf]
  [bt] (6) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x133) [0x7f7eea9b7aa3]
  [bt] (7) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaceColumns+0xc6) [0x7f7eea950286]
  [bt] (8) /home/ubuntu/anaconda3/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f8044f8d630]

bug cuDF (Python) invalid

Source

stromal

Most helpful comment

4.
You could do something like:

data, X, y = None

del(data)
del(X)
del(y)

Similar approach would be taken in 7 to clear the other variables.

kkraus14 on 17 Nov 2020

🚀1 ❤1 🎉1 👍1

All 7 comments

Ultimately this is just an out of memory error: cudaErrorMemoryAllocation out of memory

I would suggest trying a larger GPU with 32GB of GPU memory.

kkraus14 on 17 Nov 2020

👀1

@kkraus14

Ultimately this is just an out of memory error: cudaErrorMemoryAllocation out of memory

I would suggest trying a larger GPU with 32GB of GPU memory.

3 091 258 960 -> 3 Gigabyte
3 015 442 432 -> 3 Gigabyte
And this GPU has 16 GB VRAM

stromal on 17 Nov 2020

For ERROR 2: It looks like you have at least X_train, X_test, and train_data in GPU memory when you try to create test_data which causes the OOM. Add in needing some temporary space for calculations and you can very quickly hit the 16GB limit.

kkraus14 on 17 Nov 2020

🚀1 ❤1 🎉1 👍1

@kkraus14

Q1.) How can I delete things from GPU memory from jupyter lab to have enough space for the next cell?
Q2.) is it ok to keep data in daks dataframe and just at training and testing use the 16 gb VRAM?

stromal on 17 Nov 2020

How can I delete things from GPU memory from jupyter lab to have enough space for the next cell?

Generally, you just want to make sure you don't have python variables referring to GPU backed objects lying around. If they are then Python can't garbage collect them and we can't free the GPU memory. Additionally, in Jupyter instead of just doing a, do print(a). Doing just a, causes Jupyter to hold a reference to the variable a which prevents garbage collection.

is it ok to keep data in daks dataframe and just at training and testing use the 16 gb VRAM?

That will do all of the dataframe computation on the CPU instead of GPU.

kkraus14 on 17 Nov 2020

🚀1 ❤1 🎉1 👍1

@kkraus14

Generally, you just want to make sure you don't have python variables referring to GPU backed objects lying around. If they are then Python can't garbage collect them and we can't free the GPU memory. Additionally, in Jupyter instead of just doing a, do print(a). Doing just a, causes Jupyter to hold a reference to the variable a which prevents garbage collection.

I just want to make sure I understand it correctly.

EXAMPLE
Cells remark jupyter notebook cells

1.cell
data = cudf.read_csv('X_train.csv.txt', delimiter=',', skiprows=1, names=colNames, dtype=['float64', 'float64', 'float64'])

2.cell

X = data[1:500,:]
y = data[0;:]

3.cell
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 0, shuffle=True)

??? 4.cell
?At this point how can I clean out 'data' from the GPU memory?

5.cell
train_data = xgboost.DMatrix(data=X_train, label=y_train)

6.cell
MACHINE LEARNING

??? 7.cell
?At this point how can I clean out 'train_data' from the GPU memory?

8.cell
test_data = xgboost.DMatrix(data=X_test, label=y_test)

....

stromal on 17 Nov 2020

4.
You could do something like:

data, X, y = None

del(data)
del(X)
del(y)

Similar approach would be taken in 7 to clear the other variables.

kkraus14 on 17 Nov 2020

🚀1 ❤1 🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[FEA] Introduce `DecimalDtype` to cuDF

shwina · 3Comments

[BUG] Series built from ephemeral CuPy arrays change due to CuPy's memory reuse

beckernick · 3Comments

Latest Docker container gives CUDA driver version error

MurrayData · 3Comments

[BUG] cudf installation via pip segfaults

henningpeters · 3Comments

[BUG] mean() fails on groupby

AjayThorve · 3Comments