Cudf: [BUG] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory

Created on 17 Nov 2020  路  7Comments  路  Source: rapidsai/cudf

ENVIRONMENT

CODE

  • I am just trying to gave a trainign and a test set for the model
    1st data package - train_data = xgboost.DMatrix(data=X_train, label=y_train)
    2nd data package - test_data = xgboost.DMatrix(data=X_test, label=y_test) couple cells down the line, they are not executed together

ERROR

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-25-7bd66d4fabf4> in <module>
      1 #train = xgboost.DMatrix(data=X, label=y) #ORIGINAL
----> 2 test_data = xgboost.DMatrix(data=X_test, label=y_test)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    448             feature_names=feature_names,
    449             feature_types=feature_types,
--> 450             enable_categorical=enable_categorical)
    451         assert handle is not None
    452         self.handle = handle

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    543     if _is_cudf_df(data):
    544         return _from_cudf_df(data, missing, threads, feature_names,
--> 545                              feature_types)
    546     if _is_cudf_ser(data):
    547         return _from_cudf_df(data, missing, threads, feature_names,

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in _from_cudf_df(data, missing, nthread, feature_names, feature_types)
    400             ctypes.c_float(missing),
    401             ctypes.c_int(nthread),
--> 402             ctypes.byref(handle)))
    403     return handle, feature_names, feature_types
    404 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
    184     """
    185     if ret != 0:
--> 186         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    187 
    188 

XGBoostError: [12:32:18] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
- Free memory: 1539047424
- Requested memory: 3091258960

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(+0x13674f) [0x7fad04f7274f]
  [bt] (1) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x3ad) [0x7fad05190b0d]
  [bt] (2) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry>::allocate(unsigned long)+0x1df) [0x7fad051ac11f]
  [bt] (3) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(thrust::detail::vector_base<xgboost::Entry, dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry> >::fill_insert(thrust::detail::normal_iterator<thrust::device_ptr<xgboost::Entry> >, unsigned long, xgboost::Entry const&)+0x26d) [0x7fad051d0d0d]
  [bt] (4) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::HostDeviceVector<xgboost::Entry>::Resize(unsigned long, xgboost::Entry)+0xc9) [0x7fad051d1cc9]
  [bt] (5) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int)+0x3df) [0x7fad052259cf]
  [bt] (6) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x133) [0x7fad051f3aa3]
  [bt] (7) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaceColumns+0xc6) [0x7fad0518c286]
  [bt] (8) /home/ubuntu/anaconda3/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fae60078630]

CODE 2 If I clean a out a restart the notebook that execute them together in 1 cell.

train_data = xgboost.DMatrix(data=X_train, label=y_train) 
test_data = xgboost.DMatrix(data=X_test, label=y_test) 

ERROR 2

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-20-f0c3710678a8> in <module>
      1 #train = xgboost.DMatrix(data=X, label=y) #ORIGINAL
      2 train_data = xgboost.DMatrix(data=X_train, label=y_train)
----> 3 test_data = xgboost.DMatrix(data=X_test, label=y_test)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in __init__(self, data, label, weight, base_margin, missing, silent, feature_names, feature_types, nthread, enable_categorical)
    448             feature_names=feature_names,
    449             feature_types=feature_types,
--> 450             enable_categorical=enable_categorical)
    451         assert handle is not None
    452         self.handle = handle

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical)
    543     if _is_cudf_df(data):
    544         return _from_cudf_df(data, missing, threads, feature_names,
--> 545                              feature_types)
    546     if _is_cudf_ser(data):
    547         return _from_cudf_df(data, missing, threads, feature_names,

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/data.py in _from_cudf_df(data, missing, nthread, feature_names, feature_types)
    400             ctypes.c_float(missing),
    401             ctypes.c_int(nthread),
--> 402             ctypes.byref(handle)))
    403     return handle, feature_names, feature_types
    404 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/xgboost/core.py in _check_call(ret)
    184     """
    185     if ret != 0:
--> 186         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    187 
    188 

XGBoostError: [15:20:36] /opt/conda/envs/rapids/conda-bld/xgboost_1603491651651/work/src/c_api/../data/../common/device_helpers.cuh:400: Memory allocation error on worker 0: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
- Free memory: 3015442432
- Requested memory: 3091258960

Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(+0x13674f) [0x7f7eea73674f]
  [bt] (1) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x3ad) [0x7f7eea954b0d]
  [bt] (2) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry>::allocate(unsigned long)+0x1df) [0x7f7eea97011f]
  [bt] (3) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(thrust::detail::vector_base<xgboost::Entry, dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry> >::fill_insert(thrust::detail::normal_iterator<thrust::device_ptr<xgboost::Entry> >, unsigned long, xgboost::Entry const&)+0x26d) [0x7f7eea994d0d]
  [bt] (4) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::HostDeviceVector<xgboost::Entry>::Resize(unsigned long, xgboost::Entry)+0xc9) [0x7f7eea995cc9]
  [bt] (5) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int)+0x3df) [0x7f7eea9e99cf]
  [bt] (6) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::CudfAdapter>(xgboost::data::CudfAdapter*, float, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x133) [0x7f7eea9b7aa3]
  [bt] (7) /home/ubuntu/anaconda3/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaceColumns+0xc6) [0x7f7eea950286]
  [bt] (8) /home/ubuntu/anaconda3/envs/rapids/lib/python3.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f8044f8d630]
bug cuDF (Python) invalid

Most helpful comment

4.
You could do something like:

data, X, y = None

or

del(data)
del(X)
del(y)

Similar approach would be taken in 7 to clear the other variables.

All 7 comments

Ultimately this is just an out of memory error: cudaErrorMemoryAllocation out of memory

I would suggest trying a larger GPU with 32GB of GPU memory.

@kkraus14

Ultimately this is just an out of memory error: cudaErrorMemoryAllocation out of memory

I would suggest trying a larger GPU with 32GB of GPU memory.

3 091 258 960 -> 3 Gigabyte
3 015 442 432 -> 3 Gigabyte
And this GPU has 16 GB VRAM

For ERROR 2: It looks like you have at least X_train, X_test, and train_data in GPU memory when you try to create test_data which causes the OOM. Add in needing some temporary space for calculations and you can very quickly hit the 16GB limit.

@kkraus14

Q1.) How can I delete things from GPU memory from jupyter lab to have enough space for the next cell?
Q2.) is it ok to keep data in daks dataframe and just at training and testing use the 16 gb VRAM?

How can I delete things from GPU memory from jupyter lab to have enough space for the next cell?

Generally, you just want to make sure you don't have python variables referring to GPU backed objects lying around. If they are then Python can't garbage collect them and we can't free the GPU memory. Additionally, in Jupyter instead of just doing a, do print(a). Doing just a, causes Jupyter to hold a reference to the variable a which prevents garbage collection.

is it ok to keep data in daks dataframe and just at training and testing use the 16 gb VRAM?

That will do all of the dataframe computation on the CPU instead of GPU.

@kkraus14

Generally, you just want to make sure you don't have python variables referring to GPU backed objects lying around. If they are then Python can't garbage collect them and we can't free the GPU memory. Additionally, in Jupyter instead of just doing a, do print(a). Doing just a, causes Jupyter to hold a reference to the variable a which prevents garbage collection.

I just want to make sure I understand it correctly.

EXAMPLE
Cells remark jupyter notebook cells

1.cell
data = cudf.read_csv('X_train.csv.txt', delimiter=',', skiprows=1, names=colNames, dtype=['float64', 'float64', 'float64'])

2.cell

X = data[1:500,:]
y = data[0;:]

3.cell
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state = 0, shuffle=True)

??? 4.cell
?At this point how can I clean out 'data' from the GPU memory?

5.cell
train_data = xgboost.DMatrix(data=X_train, label=y_train)

6.cell
MACHINE LEARNING

??? 7.cell
?At this point how can I clean out 'train_data' from the GPU memory?

8.cell
test_data = xgboost.DMatrix(data=X_test, label=y_test)

....

4.
You could do something like:

data, X, y = None

or

del(data)
del(X)
del(y)

Similar approach would be taken in 7 to clear the other variables.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shwina picture shwina  路  3Comments

beckernick picture beckernick  路  3Comments

MurrayData picture MurrayData  路  3Comments

henningpeters picture henningpeters  路  3Comments

AjayThorve picture AjayThorve  路  3Comments