Keras: How to remove stale models from GPU memory

Created on 9 Feb 2017 · 15Comments · Source: keras-team/keras

Update (2018/08/01): I would like to provide an update as when I posted the question I was new to Keras. Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session(). This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.

I am working on a pipeline that takes a pre-trained model, splits it, caches the intermediate results of the bottom layers, fine-tunes the top and merges bottom & top back. I do 2 passes of the above using different splits & optimizers. This helps me speed up the training by a factor of 3x instead of freezing the bottom layers.

As you understand the above process initializes many models which are later discarded. Unfortunately though it seems that their weights remain in GPU memory and after a couple of steps I get an out of memory exception "ResourceExhaustedError (see above for traceback): OOM when allocating tensor".

Is there a way to remove stale models from GPU memory? I tried "del" and calling Python's gc but did not work. Closing/clearing the session is not possible as this is part of a single pipeline. My backend is Tensorflow.

Here is a simplified pseudo-code of the process:

model = load_pretrained_model()
bottom, top = split_model(model) #bottom and top have a fresh copy of the weights
del model
gc.collect()

intermediate_results = bottom.predit(data)
top.fit(intermediate_results)
del intermediate_results, data

model = merge_model(top, bottom) #Exception happens here
del top, bottom
gc.collect()

stale

Source

datumbox

👍8

Most helpful comment

The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables?

m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name)

datumbox on 10 Feb 2017

👍13

All 15 comments

Hi, I have a similar issue even when just retraining the same model (interchanging model.fit with model.fit_generator). As I keep all the weights and batch sizes are all equal there shouldn't be a reason for it to consume more memory.

Markus92 on 10 Feb 2017

m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name)

datumbox on 10 Feb 2017

👍13

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

astrojuanlu on 26 Jun 2017

👍5

Hit the same problem. K.clear_session() doesn't work.

ruiyuanlu on 16 Apr 2018

👍5

Would be nice to have some guidance on this issue from folks who have dealt with it more elegantly than the save model/delete model/clear session/load model hack. It is pretty important for reproducibility in Keras in my view.

drsxr on 23 Apr 2018

👍9

Does this work?

    import gc
    K.clear_session()
    gc.collect()

tRosenflanz on 1 Jun 2018

👍8 👎1

Same problem here. I'm using an EC2 instance with 100 GB RAM and a Tesla M60 GPU. I wrote a simple iterative loop where I'd perform a grid search on my hyperparameters and validate them on a small subset of my training data. However, I can't do this due to the constant OOM errors, and quite frankly, manual sequential tuning is getting on my nerves. Is there any concrete way to clear the GPU memory utilized by Keras in-code? I don't want to keep restarting my kernel every time.

Just FYI, I run watch -d nvidia-smi in order to keep a track on the GPU memory.

I load a model into memory for the first time and Keras utilizes all of the GPU's 8GB memory. Even after calling K.clear_session() or del K or gc.collect(), it doesn't clear the stale model from the memory.

Does anyone have a concrete solution/workaround to this?

rahulkulhalli on 30 Jul 2018

👍8

Check bottom of #2102 and #9379

ms1design on 31 Jul 2018

Try clear_session() before del model - hypothesis being model is needed by clear_session().

phobrain on 4 Oct 2018

👍6

The memory is not released immediately after calling K.clear_session(). I guess one needs to run load_model afterwards?

You save my day!
but btw how do you know that

flyingalexis on 25 Nov 2018

    K.clear_session()
    del model

after each training cycle worked for me.

danFromTelAviv on 3 Dec 2018

👍7

I'm not sure why, but this works for me when I added all of these three lines:

K.clear_session()
gc.collect()
del model

xl233 on 10 Apr 2019

👀1

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

zachmayer on 1 Aug 2019

👍1

@zachmayer

Is there a way to tell which tensorflow variables are associated with a specific model?

I'd like to only clear the variables associated with a specific model, and then delete it.

Just a guess. What about creating separate session for each model and then use methods mentioned above to clear that specific session? Presumably, you should clean only variables you want.

maks-ym on 1 Sep 2019

This strategy is currently working for me on a lambda machine; I am using only one GPU at a time:

https://stackoverflow.com/a/61252435/12763497

It is a community wiki answer so please feel free to edit if you find anything else out. I do have memory leaks, but they are eliminated using calls to multiprocessing.Process with a timeout feature (which does require estimating the maximum duration of each model training/validation session).