Xgboost: Memory leak in subsequent model fitting?

Created on 9 Sep 2019 · 18Comments · Source: dmlc/xgboost

Hi,
I created a XGBoost model with data subsampling (subsample < 1) and I implemented a loop to train it using different random_state values and choose the best performing initialization.

This is the simplified Python code:

# setup train_x, train_y, eval_set, eval_metric #
accuracies = []
best = None
for i in range(100):
    variation = xgb.XGBClassifier(random_state=i, n_jobs=-1)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
    accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)

It works as expected, but I noticed that if I monitor the RAM during the process, the usage continues to grow after each iteration even if I assigned a new XGBClassifier to the same variation variable. Python should free the unreferenced memory location every time and maintain the memory usage constant, right?

This is the screenshot of System Monitor Resources:
memory
Computation starts at the first CPU spike at 100% on the left and a new iteration starts at each drop in CPU usage while RAM usage steps up accordingly, but never goes down.

Is this a bug in resource deallocation inside XGBClassifier or am I missing something?

Many thanks in advance,
Cheers

Blocking

Source

GuidoBartoli

Most helpful comment

Awesome, that was it! :trophy:
And there is more: with forced garbage collection, "del variation" is not needed anymore, since a new instance is assigned to the same variable and Python recycles the memory occupied by the previous one (as expected in my first comment in this issue).

Many thanks! :+1:

GuidoBartoli on 24 Oct 2019

🎉1 👍1

All 18 comments

Can you try inserting del best under the condition if best_model is None or accuracy > max(accuracies), so as to explicitly remove the object?

hcho3 on 10 Sep 2019

Thanks for the suggestion! However I cannot delete best, because it is needed in the code following the snippet... maybe you are referring to the temporary variation?

I modified the code like this:

# setup train_x, train_y, eval_set, eval_metric #
accuracies = []
best = None
for i in range(100):
    variation = xgb.XGBClassifier(random_state=i, n_jobs=-1)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
    accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)
    del variation

But unfortunately RAM usage continues to grow... :(
memory

Are you able to reproduce this?

GuidoBartoli on 10 Sep 2019

Can you post a toy dataset so that I can reproduce it too

hcho3 on 10 Sep 2019

Sure, I prepared a small HDF5 dataset together with toy.py script to replicate this behaviour. It is not as evident as in the large dataset, however the RAM keeps growing in this case, too.

toy.zip

Let me know if you need more information.

GuidoBartoli on 10 Sep 2019

Hi, any news about this potential issue?

Regards

GuidoBartoli on 18 Sep 2019

Not yet. Marked as blocking. Thanks for the patience.

trivialfis on 18 Sep 2019

👍1

I made some other tests and I can confirm that this issue is still present even when Exact Greedy algorithm is used instead of Fast Histograms.

GuidoBartoli on 9 Oct 2019

Sorry for the long wait. I'm refactoring the IO logic right now. Will start squashing bugs once it's sorted out.

trivialfis on 9 Oct 2019

🚀1

@GuidoBartoli How many classes do you have? We use one stack of trees for each class. If you have lots of classes, then the memory consumed by trees can be significant.

trivialfis on 23 Oct 2019

I ran the following and did not observe the leak on master branch f24be2efb479872da2613b40fbf04a77f961d250:

import xgboost as xgb
import numpy as np

kRows = 200000
kCols = 1000
kClasses = 8

train_x = np.random.randn(kRows, kCols)
train_y = np.random.randint(0, kClasses, size=kRows)
print(train_y.shape)

eval_x = np.random.randn(kRows, kCols)
eval_y = np.random.randint(0, kClasses, size=kRows)
print(eval_y.shape)

eval_set = [(train_x, train_y), (eval_x, eval_y)]
eval_metric = 'mlogloss'

accuracies = []
best = None
for i in range(100):
    # tested with slow default tree_method `approx' and faster `hist'
    # and `gpu_hist'
    variation = xgb.XGBClassifier(random_state=i, n_jobs=12)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric)
    print(variation.evals_result())
    accuracy = 1 - variation.evals_result()['validation_1']['error'][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)
    del variation

trivialfis on 23 Oct 2019

@GuidoBartoli How many classes do you have? We use one stack of trees for each class. If you have lots of classes, then the memory consumed by trees can be significant.

I tried with binary classification, you can take a look at code and dataset in the provided toy example.

GuidoBartoli on 23 Oct 2019

Hi trivialfis,
I used the following code adapted from your snippet to make it look more similar to our original use case. I tried various combinations of dataset size, jobs, histograms, but in every case the memory keeps growing even if at regular intervals there is an usage drop as you can see it in the attached screenshot.

At the moment on my machine I cannot recompile the latest master branch to reproduce this behavior, can you try this exact code on that version?

import xgboost as xgb
import numpy as np

samples = 50000
features = 100
classes = 2
variations = 100
histograms = False
n_jobs = 4

print('Generating dataset ({}x{}x{})...'.format(samples, features, classes))
train_x = np.random.randn(samples, features)
train_y = np.random.randint(0, classes, size=samples)
eval_x = np.random.randn(samples, features)
eval_y = np.random.randint(0, classes, size=samples)
eval_set = [(train_x, train_y), (eval_x, eval_y)]
if classes == 2:
    objective = 'binary:logistic'
    eval0 = 'error'
    eval1 = 'logloss'
else:
    objective = 'multi:softmax'
    eval0 = 'merror'
    eval1 = 'mlogloss'
eval_metric = [eval0, eval1]
kwargs = {'tree_method': 'hist', 'grow_policy': 'depthwise'} if not histograms else {}

accuracies = []
best = None
for i in range(variations):
    print('Training variation #{}/{}...'.format(i, variations))
    variation = xgb.XGBClassifier(random_state=i, n_jobs=n_jobs)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric, verbose=False)
    accuracy = 1 - variation.evals_result()['validation_1'][eval0][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)
    print('> Accuracy = {}'.format(accuracy))
    del variation

XGB2

GuidoBartoli on 24 Oct 2019

@GuidoBartoli

At the moment on my machine I cannot recompile the latest master branch to reproduce this behavior, can you try this exact code on that version?

You can install the nightly build by running
pip3 install https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.0.0_SNAPSHOT%2B96cd7ec2bbdec1addf81b1ca2adb13c9155e32f3-py2.py3-none-manylinux1_x86_64.whl

hcho3 on 24 Oct 2019

Ok, thanks, I installed the 1.0.0_SNAPSHOT version in my virtual environment using pip.
I run the previous snippet on that version, but unfortunately I see a similar behavior (purple line in the graph)
I took another screenshot of the memory usage after 33 variations.
XGB3

GuidoBartoli on 24 Oct 2019

Will take a look again. Thanks for the follow up.

From: Guido Bartoli notifications@github.com
Sent: Thursday, October 24, 2019 9:20:46 PM
To: dmlc/xgboost xgboost@noreply.github.com
Cc: Jiaming Yuan jm.yuan@outlook.com; Comment comment@noreply.github.com
Subject: Re: [dmlc/xgboost] Memory leak in subsequent model fitting? (#4843)

Ok, thanks, I installed the 1.0.0_SNAPSHOT version in my virtual environment using pip.
I run the previous snippet on that version, but unfortunately I see a similar behavior.
I took another screenshot of the memory usage after 33 variations.
[XGB3]https://user-images.githubusercontent.com/16103676/67489332-afa3dd80-f671-11e9-8166-e92bc5f77310.jpg

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/dmlc/xgboost/issues/4843?email_source=notifications&email_token=AD7YPKKQ24SYMEJRFQDCIJDQQGOK5A5CNFSM4IUZPKCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECE72WY#issuecomment-545914203, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD7YPKJYF47AZTV3V76RJH3QQGOK5ANCNFSM4IUZPKCA.

trivialfis on 24 Oct 2019

@GuidoBartoli And looked. Switching to my laptop helps. The magic is here: notice the gc, it forces Python to collect memory.

import xgboost as xgb
import numpy as np
import gc

samples = 50000
features = 100
classes = 2
variations = 100
histograms = False
n_jobs = 4

print('Generating dataset ({}x{}x{})...'.format(samples, features, classes))
train_x = np.random.randn(samples, features)
train_y = np.random.randint(0, classes, size=samples)
eval_x = np.random.randn(samples, features)
eval_y = np.random.randint(0, classes, size=samples)
eval_set = [(train_x, train_y), (eval_x, eval_y)]

if classes == 2:
    objective = 'binary:logistic'
    eval0 = 'error'
    eval1 = 'logloss'
else:
    objective = 'multi:softmax'
    eval0 = 'merror'
    eval1 = 'mlogloss'
eval_metric = [eval0, eval1]
kwargs = {'tree_method': 'hist', 'grow_policy': 'depthwise'} if not histograms else {}

print('classes:', classes)

accuracies = []
best = None
for i in range(variations):
    print('Training variation #{}/{}...'.format(i, variations))
    variation = xgb.XGBClassifier(random_state=i, n_jobs=n_jobs)
    variation.fit(train_x, train_y, eval_set=eval_set, eval_metric=eval_metric, verbose=False)
    accuracy = 1 - variation.evals_result()['validation_1'][eval0][-1]
    if best is None or accuracy > max(accuracies):
        best = variation
    accuracies.append(accuracy)
    print('> Accuracy = {}'.format(accuracy))
    del variation
    gc.collect()

trivialfis on 24 Oct 2019

Many thanks! :+1:

GuidoBartoli on 24 Oct 2019

🎉1 👍1

For anyone who encounter this issue, it's mitigated and expected to be part of 1.1.

trivialfis on 21 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings