Update (2018/08/01): I would like to provide an update as when I posted the question I was new to Keras. Currently only TensorFlow backend supports proper cleaning up of the session. This can be done by calling K.clear_session()
. This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally). So there is no way to remove a specific stale model. This is not a bug of Keras but a limitation of the backends.
I am working on a pipeline that takes a pre-trained model, splits it, caches the intermediate results of the bottom layers, fine-tunes the top and merges bottom & top back. I do 2 passes of the above using different splits & optimizers. This helps me speed up the training by a factor of 3x instead of freezing the bottom layers.
As you understand the above process initializes many models which are later discarded. Unfortunately though it seems that their weights remain in GPU memory and after a couple of steps I get an out of memory exception "ResourceExhaustedError (see above for traceback): OOM when allocating tensor".
Is there a way to remove stale models from GPU memory? I tried "del" and calling Python's gc but did not work. Closing/clearing the session is not possible as this is part of a single pipeline. My backend is Tensorflow.
Here is a simplified pseudo-code of the process:
model = load_pretrained_model()
bottom, top = split_model(model) #bottom and top have a fresh copy of the weights
del model
gc.collect()
intermediate_results = bottom.predit(data)
top.fit(intermediate_results)
del intermediate_results, data
model = merge_model(top, bottom) #Exception happens here
del top, bottom
gc.collect()
Hi, I have a similar issue even when just retraining the same model (interchanging model.fit with model.fit_generator). As I keep all the weights and batch sizes are all equal there shouldn't be a reason for it to consume more memory.
The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables?
m = Model(.....)
m.save(tmp_model_name)
del m
K.clear_session()
m = load_model(tmp_model_name)
The memory is not released immediately after calling K.clear_session()
. I guess one needs to run load_model
afterwards?
Hit the same problem. K.clear_session() doesn't work.
Would be nice to have some guidance on this issue from folks who have dealt with it more elegantly than the save model/delete model/clear session/load model hack. It is pretty important for reproducibility in Keras in my view.
Does this work?
import gc
K.clear_session()
gc.collect()
Same problem here. I'm using an EC2 instance with 100 GB RAM and a Tesla M60 GPU. I wrote a simple iterative loop where I'd perform a grid search on my hyperparameters and validate them on a small subset of my training data. However, I can't do this due to the constant OOM errors, and quite frankly, manual sequential tuning is getting on my nerves. Is there any concrete way to clear the GPU memory utilized by Keras in-code? I don't want to keep restarting my kernel every time.
Just FYI, I run watch -d nvidia-smi
in order to keep a track on the GPU memory.
I load a model into memory for the first time and Keras utilizes all of the GPU's 8GB memory. Even after calling K.clear_session()
or del K
or gc.collect()
, it doesn't clear the stale model from the memory.
Does anyone have a concrete solution/workaround to this?
Check bottom of #2102 and #9379
Try clear_session() before del model - hypothesis being model is needed by clear_session().
The memory is not released immediately after calling
K.clear_session()
. I guess one needs to runload_model
afterwards?
You save my day!
but btw how do you know that
K.clear_session()
del model
after each training cycle worked for me.
I'm not sure why, but this works for me when I added all of these three lines:
K.clear_session()
gc.collect()
del model
Is there a way to tell which tensorflow variables are associated with a specific model?
I'd like to only clear the variables associated with a specific model, and then delete it.
@zachmayer
Is there a way to tell which tensorflow variables are associated with a specific model?
I'd like to only clear the variables associated with a specific model, and then delete it.
Just a guess. What about creating separate session for each model and then use methods mentioned above to clear that specific session? Presumably, you should clean only variables you want.
This strategy is currently working for me on a lambda machine; I am using only one GPU at a time:
https://stackoverflow.com/a/61252435/12763497
It is a community wiki answer so please feel free to edit if you find anything else out. I do have memory leaks, but they are eliminated using calls to multiprocessing.Process
with a timeout feature (which does require estimating the maximum duration of each model training/validation session).
Most helpful comment
The only hacky/terrible solution that seems to work involves checkpointing the models that you want to keep, cleaning up all the memory and reloading the models from disk. Anyone knows a better way? Perhaps you can drop specific graphs or variables?