Xgboost doesn't release gpu memory after training/predicting the model on large data.
Every further rerun of .fit causes more memory allocation until eventual crash of the kernel because GPU memory is out of bounds.
Operating System: Ubuntu 16.04 on PowerPC
Compiler:
Package used (python/R/jvm/C++): python
xgboost
version used:
If installing from source, please provide
git rev-parse HEAD
) 84ab74f3a56739829b03161fb9c249f3a760a518If you are using python package, please provide
Following code should not cause issues but causes out of memory issues if you run it twice. You might have to decrease the repeat number for data depending on the GPU memory you have (16gb on my side)
import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.datasets import dump_svmlight_file
from sklearn.externals import joblib
from sklearn.metrics import precision_score
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# use DMatrix for xgbosot
dtrain = xgb.DMatrix(X_train.repeat(300000,axis=0), label=y_train.repeat(300000))
dtest = xgb.DMatrix(X_test.repeat(300000,axis=0), label=y_test.repeat(300000))
# set xgboost params
param = {
'tree_method': 'gpu_exact',
'max_depth': 3, # the maximum depth of each tree
'eta': 0.3, # the training step for each iteration
'silent': 1, # logging mode - quiet
'objective': 'multi:softprob', # error evaluation for multiclass training
'num_class': 3,
'n_jobs':10} # the number of classes that exist in this datset
num_round = 20 # the number of training iterations
#------------- numpy array ------------------
# training and testing - numpy matrices
bst = xgb.train(param, dtrain, num_round)
preds = bst.predict(dtest)
Can you try calling bst.__delete__() after each round? Python is garbage collected so it may keep the booster object around. If the error persists after this then it may be a bug.
Closing as no response. Can reopen if the issue persists.
Sorry, I haven't gotten around to test it on the original system. I will give it a try and see what happens
Okay. I have called bst.__del__()
which seems to work. Two things to note:
bst.__del__()
is called before the .predict
the kernel dies and the core is dumped (makes sense that it won't work but kernel death can be prevented by some check I assume) __del__()
which means that if your training+inference data exceed GPU memory you will get OOM even though individual datasets might fit into the memory.That seems limiting since there is no need to keep training data in the GPU memory after training is completed. .predict()
method on the other hand, purges the data after the call.This raises a question - is there any way to purge the data off the GPU but keep the trained model?
P.S. By no means an expert in how the things are handled in this amazing package. I will understand if it is necessary to keep the training data after .fit
is complete
Saving the model, deleting the booster then loading the model again should achieve this.
Sounds good, thanks for the help!
I am having what appears to be the same problem, but using R. I'm not sure what the equivalent of "deleting the booster" in R would be, since what is returned in R is considered a model object. There also does not appear to be a close match to the bst.__del__()
call in Python. Any suggestions for what might work in a similar manner to purge the data off the GPU would be much appreciated.
Since this is a closely-related issue, I'm hoping to piggyback on this ticket rather than opening a nearly-duplicate ticket.
@jpbowman01 "deleting the booster" in R would be
rm(bst)
gc()
Have the same problem.
Did the delete() trick but it does not work,
bst.__delete__()
Booster' object has no attribute '__delete__'
@aliyesilkanat typo above. it needs to be bst.__del__()
nonetheless not working for me. single process. applying .__del__(). also seeing in nvidia-smi that the GPU mem is being cleared. still always running into this issue even predictably. compiled with different nvidia drivers, GCCs, linux headers, cmake. dont understand why this issue is closed.
se-I, I had the same problem and was able to solve it using garbage collect gc.collect() after the del() command.
I also have this problem on a windows machine, with xgboost 0.7 and tree_method='gpu_hist'
. the GPU memory does not get released if, for example, xgbReggressor.fit
finishes successfully, but some post-processing results in a Python Error).
del xgbRegressor
gc.collect();
does not seem to release the GPU memory (but a kernel restart does:).
Trying to call bst.__del__()
, I get an exception:
'XGBRegressor' object has no attribute '__del__'
I run my models with {'predictor':'cpu_predictor'}
(including, due to issue 3756), and so would like to free GPU memory as soon as training is finished. This way I would be able to test more hyper-parameter sets in parallel.
Most helpful comment
@jpbowman01 "deleting the booster" in R would be