Hi,
I created a XGBoost model with data subsampling (subsample < 1
) and I implemented a loop to train it using different random_state
values and choose the best performing initialization.
This is the simplified Python code:
# setup train_x, train_y, eval_set, eval_metric #
accuracies = []
best = None
for i in range(100):
variation = xgb.XGBClassifier(random_state=i, n_jobs=-1)
variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
if best is None or accuracy > max(accuracies):
best = variation
accuracies.append(accuracy)
It works as expected, but I noticed that if I monitor the RAM during the process, the usage continues to grow after each iteration even if I assigned a new XGBClassifier to the same variation variable. Python should free the unreferenced memory location every time and maintain the memory usage constant, right?
This is the screenshot of System Monitor Resources:
Computation starts at the first CPU spike at 100% on the left and a new iteration starts at each drop in CPU usage while RAM usage steps up accordingly, but never goes down.
Is this a bug in resource deallocation inside XGBClassifier
or am I missing something?
Many thanks in advance,
Cheers
Can you try inserting del best
under the condition if best_model is None or accuracy > max(accuracies)
, so as to explicitly remove the object?
Thanks for the suggestion! However I cannot delete best
, because it is needed in the code following the snippet... maybe you are referring to the temporary variation
?
I modified the code like this:
# setup train_x, train_y, eval_set, eval_metric #
accuracies = []
best = None
for i in range(100):
variation = xgb.XGBClassifier(random_state=i, n_jobs=-1)
variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
if best is None or accuracy > max(accuracies):
best = variation
accuracies.append(accuracy)
del variation
But unfortunately RAM usage continues to grow... :(
Are you able to reproduce this?
Can you post a toy dataset so that I can reproduce it too
Sure, I prepared a small HDF5 dataset together with toy.py
script to replicate this behaviour. It is not as evident as in the large dataset, however the RAM keeps growing in this case, too.
Let me know if you need more information.
Hi, any news about this potential issue?
Regards
Not yet. Marked as blocking. Thanks for the patience.
I made some other tests and I can confirm that this issue is still present even when Exact Greedy algorithm is used instead of Fast Histograms.
Sorry for the long wait. I'm refactoring the IO logic right now. Will start squashing bugs once it's sorted out.
@GuidoBartoli How many classes do you have? We use one stack of trees for each class. If you have lots of classes, then the memory consumed by trees can be significant.
I ran the following and did not observe the leak on master branch f24be2efb479872da2613b40fbf04a77f961d250:
import xgboost as xgb
import numpy as np
kRows = 200000
kCols = 1000
kClasses = 8
train_x = np.random.randn(kRows, kCols)
train_y = np.random.randint(0, kClasses, size=kRows)
print(train_y.shape)
eval_x = np.random.randn(kRows, kCols)
eval_y = np.random.randint(0, kClasses, size=kRows)
print(eval_y.shape)
eval_set = [(train_x, train_y), (eval_x, eval_y)]
eval_metric = 'mlogloss'
accuracies = []
best = None
for i in range(100):
# tested with slow default tree_method `approx' and faster `hist'
# and `gpu_hist'
variation = xgb.XGBClassifier(random_state=i, n_jobs=12)
variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
print(variation.evals_result())
accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
if best is None or accuracy > max(accuracies):
best = variation
accuracies.append(accuracy)
del variation
@GuidoBartoli How many classes do you have? We use one stack of trees for each class. If you have lots of classes, then the memory consumed by trees can be significant.
I tried with binary classification, you can take a look at code and dataset in the provided toy example.
Hi trivialfis,
I used the following code adapted from your snippet to make it look more similar to our original use case. I tried various combinations of dataset size, jobs, histograms, but in every case the memory keeps growing even if at regular intervals there is an usage drop as you can see it in the attached screenshot.
At the moment on my machine I cannot recompile the latest master branch to reproduce this behavior, can you try this exact code on that version?
import xgboost as xgb
import numpy as np
samples = 50000
features = 100
classes = 2
variations = 100
histograms = False
n_jobs = 4
print('Generating dataset ({}x{}x{})...'.format(samples, features, classes))
train_x = np.random.randn(samples, features)
train_y = np.random.randint(0, classes, size=samples)
eval_x = np.random.randn(samples, features)
eval_y = np.random.randint(0, classes, size=samples)
eval_set = [(train_x, train_y), (eval_x, eval_y)]
if classes == 2:
objective = 'binary:logistic'
eval0 = 'error'
eval1 = 'logloss'
else:
objective = 'multi:softmax'
eval0 = 'merror'
eval1 = 'mlogloss'
eval_metric = [eval0, eval1]
kwargs = {'tree_method': 'hist', 'grow_policy': 'depthwise'} if not histograms else {}
accuracies = []
best = None
for i in range(variations):
print('Training variation #{}/{}...'.format(i, variations))
variation = xgb.XGBClassifier(random_state=i, n_jobs=n_jobs)
variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric, verbose=False)
accuracy = 1 - variation.evals_result()['validation_1'][eval0][-1]
if best is None or accuracy > max(accuracies):
best = variation
accuracies.append(accuracy)
print('> Accuracy = {}'.format(accuracy))
del variation
@GuidoBartoli
At the moment on my machine I cannot recompile the latest master branch to reproduce this behavior, can you try this exact code on that version?
You can install the nightly build by running
pip3 install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.0.0_SNAPSHOT%2B96cd7ec2bbdec1addf81b1ca2adb13c9155e32f3-py2.py3-none-manylinux1_x86_64.whl
Ok, thanks, I installed the 1.0.0_SNAPSHOT version in my virtual environment using pip.
I run the previous snippet on that version, but unfortunately I see a similar behavior (purple line in the graph)
I took another screenshot of the memory usage after 33 variations.
Will take a look again. Thanks for the follow up.
From: Guido Bartoli notifications@github.com
Sent: Thursday, October 24, 2019 9:20:46 PM
To: dmlc/xgboost xgboost@noreply.github.com
Cc: Jiaming Yuan jm.yuan@outlook.com; Comment comment@noreply.github.com
Subject: Re: [dmlc/xgboost] Memory leak in subsequent model fitting? (#4843)
Ok, thanks, I installed the 1.0.0_SNAPSHOT version in my virtual environment using pip.
I run the previous snippet on that version, but unfortunately I see a similar behavior.
I took another screenshot of the memory usage after 33 variations.
[XGB3]https://user-images.githubusercontent.com/16103676/67489332-afa3dd80-f671-11e9-8166-e92bc5f77310.jpg
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/dmlc/xgboost/issues/4843?email_source=notifications&email_token=AD7YPKKQ24SYMEJRFQDCIJDQQGOK5A5CNFSM4IUZPKCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECE72WY#issuecomment-545914203, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD7YPKJYF47AZTV3V76RJH3QQGOK5ANCNFSM4IUZPKCA.
@GuidoBartoli And looked. Switching to my laptop helps. The magic is here: notice the gc
, it forces Python to collect memory.
import xgboost as xgb
import numpy as np
import gc
samples = 50000
features = 100
classes = 2
variations = 100
histograms = False
n_jobs = 4
print('Generating dataset ({}x{}x{})...'.format(samples, features, classes))
train_x = np.random.randn(samples, features)
train_y = np.random.randint(0, classes, size=samples)
eval_x = np.random.randn(samples, features)
eval_y = np.random.randint(0, classes, size=samples)
eval_set = [(train_x, train_y), (eval_x, eval_y)]
if classes == 2:
objective = 'binary:logistic'
eval0 = 'error'
eval1 = 'logloss'
else:
objective = 'multi:softmax'
eval0 = 'merror'
eval1 = 'mlogloss'
eval_metric = [eval0, eval1]
kwargs = {'tree_method': 'hist', 'grow_policy': 'depthwise'} if not histograms else {}
print('classes:', classes)
accuracies = []
best = None
for i in range(variations):
print('Training variation #{}/{}...'.format(i, variations))
variation = xgb.XGBClassifier(random_state=i, n_jobs=n_jobs)
variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric, verbose=False)
accuracy = 1 - variation.evals_result()['validation_1'][eval0][-1]
if best is None or accuracy > max(accuracies):
best = variation
accuracies.append(accuracy)
print('> Accuracy = {}'.format(accuracy))
del variation
gc.collect()
Awesome, that was it! :trophy:
And there is more: with forced garbage collection, "del variation" is not needed anymore, since a new instance is assigned to the same variable and Python recycles the memory occupied by the previous one (as expected in my first comment in this issue).
Many thanks! :+1:
For anyone who encounter this issue, it's mitigated and expected to be part of 1.1.
Most helpful comment
Awesome, that was it! :trophy:
And there is more: with forced garbage collection, "del variation" is not needed anymore, since a new instance is assigned to the same variable and Python recycles the memory occupied by the previous one (as expected in my first comment in this issue).
Many thanks! :+1: