Xgboost: OOM in multi-GPU

Created on 8 Nov 2018 · 7Comments · Source: dmlc/xgboost

OS:Ubuntu 16.04.5
cuda:release 9.2, V9.2.148
xgboost:0.80

I'm using skopt.gp_minimize for optimizing xgboost's parameters on mult-GPU. after some iterations I got OOM like these:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): what(): std::bad_alloc: out of memory
std::bad_alloc: out of memory
ion.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(self._args, *self._kwargs)
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/pool.py", line 132, in worker
put((job, i, (False, wrapped)))
File "/home/wuyh/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/pool.py", line 386, in put
return send(obj)
File "/home/wuyh/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/pool.py", line 372, in send
self._writer.send_bytes(buffer.getvalue())
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/wuyh/anaconda3/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Source

wuyunhua

Most helpful comment

@wuyunhua I'd say the general recommendation is to not go above 10 for this parameter. It's probably preferable to add more trees (iterations) if you find that your model is underfitting, though given the sensitivity of the algorithm that's unlikely (much easier to over rather than under-fit).

thvasilo on 9 Nov 2018

😄2 👍1

All 7 comments

Can you try using version 0.81? We have a bug fix related memory usage: #3635.

hcho3 on 8 Nov 2018

@hcho3 thanks your advice, however, I just change xgboost from 0.80 into 0.81, another error throws like these:
terminate called after throwing an instance of 'dmlc::Error'
terminate called after throwing an instance of 'dmlc::Error'
what(): [10:12:10] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/tree/updater_gpu_hist.cu: 279: invalid argument

Stack trace returned 7 entries:
[bt] (0) /home/wuyh/anaconda3/xgboost/libxgboost.so(dmlc::StackTrace()+0x3d) [0x7fa131bc0b0d]
[bt] (1) /home/wuyh/anaconda3/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7fa131bc0f08]
[bt] (2) /home/wuyh/anaconda3/xgboost/libxgboost.so(+0x34faa0) [0x7fa131e10aa0]
[bt] (3) /home/wuyh/anaconda3/xgboost/libxgboost.so(+0x3517bb) [0x7fa131e127bb]
[bt] (4) /home/wuyh/anaconda3/bin/../lib/libgomp.so.1(+0x11bef) [0x7fa138207bef]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fa1cec0c6ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa1ce94241d]
what(): [10:12:10] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/tree/updater_gpu_hist.cu: 279: invalid argument

Stack trace returned 10 entries:
[bt] (0) /home/wuyh/anaconda3/xgboost/libxgboost.so(dmlc::StackTrace()+0x3d) [0x7fa131bc0b0d]
[bt] (1) /home/wuyh/anaconda3/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x7fa131bc0f08]
[bt] (2) /home/wuyh/anaconda3/xgboost/libxgboost.so(+0x34faa0) [0x7fa131e10aa0]
[bt] (3) /home/wuyh/anaconda3/xgboost/libxgboost.so(+0x3517bb) [0x7fa131e127bb]
[bt] (4) /home/wuyh/anaconda3/xgboost/libxgboost.so(void dh::ExecuteShards >, xgboost::tree::GPUHistMaker::BuildHistLeftRight(int, int, int)::{lambda(std::unique_ptr >&)#1}>(std::vector >, std::allocator >, xgboost::tree::GPUHistMaker::BuildHistLeftRight(int, int, int)::{lambda(std::unique_ptr >&)#1})+0x88) [0x7fa131e28218]
[bt] (5) /home/wuyh/anaconda3/xgboost/libxgboost.so(xgboost::tree::GPUHistMaker::BuildHistLeftRight(int, int, int)+0x249) [0x7fa131e28599]
[bt] (6) /home/wuyh/anaconda3/xgboost/libxgboost.so(xgboost::tree::GPUHistMaker::UpdateTree(xgboost::HostDeviceVector >, xgboost::DMatrix, xgboost::RegTree)+0x12bb) [0x7fa131e2e82b]
[bt] (7) /home/wuyh/anaconda3/xgboost/libxgboost.so(xgboost::tree::GPUHistMaker::Update(xgboost::HostDeviceVector >, xgboost::DMatrix, std::vector > const&)+0x182) [0x7fa131e2f562]
[bt] (8) /home/wuyh/anaconda3/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector >, xgboost::DMatrix, int, std::vector >, std::allocator > > >)+0x5a8) [0x7fa131c93298]
[bt] (9) /home/wuyh/anaconda3/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector >, xgboost::ObjFunction)+0x90e) [0x7fa131c9405e]

wuyunhua on 8 Nov 2018

Can you post the full script?

hcho3 on 8 Nov 2018

OK, the server has 4GPUs, tesla V100, and the training datasets.shape:(888096, 60), which size is 410.088649MB
the key script snippet with related xgboost :
if __name__ == "__main__":
train=get_dataframe(base_url)# get training datasets
preporcessing(train)# do feature engineering

pp_xgb = {'predictor':'cpu_predictor',"tree_method":'gpu_hist',"n_gpus":-1,"gpu_id":0,'n_jobs':2,'max_bin':63}
reg = XGBRegressor(**pp_xgb)
space  = [Integer(5, 25, name='max_depth'),
      Real(.005, .1, "log-uniform", name='learning_rate'),
      Integer(800, 1000, name='n_estimators'),
      Real(0.05,1,'log-uniform', name='gamma'),
      Real(1e-9,1.,'log-uniform', name='reg_alpha'),
         Real(1e-9,1000,'log-uniform', name='reg_lambda'),
         Real(.6,1.,'log-uniform', name='colsample_bytree'),
         Real(.6,1.,'log-uniform', name='subsample')]

feature_selected = train.columns[:-6]
X = train[feature_selected]
q = deque(maxlen=15)
for period in [7,14,21]:
    q.extend([1000000]*50)#init deque for each period
    RET_AF = 'Y{:d}'.format(period)

    y = train[RET_AF]
    cv = PurgedTimeSeriesSplit(n_splits=2, period=period)#self-definition CV

    @use_named_args(space)
    def objective(**params):
        reg.set_params(**params)

        return -np.mean(cross_val_score(reg, X, y, cv=cv, n_jobs=-1,verbose=1,pre_dispatch=1,
                                    scoring="neg_mean_squared_error"))
    # optimizing
    mycallback = MyCallback(50)
    res_gp = gp_minimize(objective, space, n_calls=50,callback=mycallback)
    #logging infos
    #
    gc.enable()
    del res_gp
    gc.collect()

wuyunhua on 8 Nov 2018

You are using a maxdepth range from 5-25. Actually every one increase in max depth doubles the size of the tree, essentially doubling the memory requirements of the algorithm.

We've seen this problem before, and I'd recommend we deprecate that parameter in favor of providing a power-of-two max leaf number that makes explicit the growth in computation cost/memory/model size, or at least make it very clear that changing this setting from something like 10 to 15 is _not_ a 50% increase in resources but rather a 32x increase (1024 leaves vs. 32768).

thvasilo on 8 Nov 2018

@thvasilo thanks, maybe the max_depth is the key of this solution other than xgboost itself, I'll try it out later...

wuyunhua on 9 Nov 2018

thvasilo on 9 Nov 2018

😄2 👍1

Was this page helpful?

0 / 5 - 0 ratings