There is a slow memory leak, about 600M-1G per hour.
I tried to change some parameters to save memory according to earlier issues. They don't work though.
train_config: {
batch_size: 24
batch_queue_capacity: 10
num_batch_queue_threads: 8
...
}
train_input_reader: {
....
num_readers: 10
}
eval_input_reader: {
num_readers: 1
}
I've got this problem too when I use estimator-based training.
Haven't meet this while using slim-based training.
I wonder if evaluating while training cause this.
the same problem as me. For my situation, I added a line plt.close(fig)
in object detection source code, then it at least can be trained about 200k steps and not be killed by OOM.you can refer from this issue for details.
@wxzs5 thank you, I'll try this.
btw, training process never got killed in my case. However it would occupy all available memory which make pc extremely slow.
Any update on this? Even I am facing the exact same issue when I am training a mask rcnn model.
Still got this issue.
I found that lower frequency of saving checkpoint and evaluate do make leak much slower, and that's all.
At least, training won't stop during I'm at work.
Issue 3603 is just part of the problem. I added plt.close(fig) in two places, which only slowed the leak a bit.
I wonder how does Google not bothered by this? (given it has been there for almost half a year!)
I experience the same problem too.
Same here. This is probably a bug. Can't locate the leak source though.
UPDATE:
Still memory leaking with tf version 1.10.
Workaround:
When training is killed because of out of memory, I can pick up the training process by configure the fine_tune_checkpoint entry in config file, and adjust the decayed learning rate by calculation. However, it consistantly shows worse results than training directly, if you look at the tensorboard graphs.
Tried eval_interval_secs flag in config, in order to slow down the eval process. Yet it is not implemented.
Following #5144 to implement eval_interval_secs, still, DO NOT work.
hi, @liuchang8am . I'm the guy who commit #5144.
It worked in tensorflow r1.9, however they change the way estimator trigger evaluation in 1.10.
See https://github.com/tensorflow/tensorflow/commit/3edb609926f2521c726737fc1efeae1572dc6581#diff-bc4a1638bbcd88997adf5e723b8609c7 for detail.
Simply speaking:
Here is my workaround
For example:
I set RunConfig.save_checkpoints_steps to 5000 and EvalSpec.throttle_secs to 864000 (10 days).
to skipped most evaluation other than first and last time. and I run evaluation on every saved checkpoint after training ends or while training is stopped.
As for training process is killed thing, I just restart the training with exactly same command. I do zero change to my config. It just automatically detect the last checkpoint and re-start training from it.
@bleqdyce Hi, I'd like to know your way to set RunConfig.save_checkpoints_steps=k means to get model-ckpt in every k step ? If not ,how can we achieve that? Thanks
Hi, @swg209 It does mean to save model-ckpt in model_dir every k step in Tensorflow r1.10.
At least it works like that on my PC =P, check official doc for more detail.
@bleqdyce Thanks for your help. When I directly run model_main.py ,your way works. But when I run legacy/train.py , it doesn't save checkpoint . Did you meet the same issue?
legacy/train.py use slim as its tensorflow wrapper.
If you want to use legacy version, you may want to check out tf slim. I'm not familiar with slim, but I find out that according to this, you can pass save_interval_secs to slim.learning.train in legacy/trainer.py
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/legacy/trainer.py#L402
and save_interval_secs is to define How often, in seconds, to save the model to logdir.
. which is writing in tensorflow/contrib/slim/python/slim/learning.py#L607
hope this help but cant guarantee though =P
Sorry for the late response! We removed the usage of "add_cdf_image_summary" in PR, you're encouraged to try again once it gets in.
following
Has any body checked now that PR is merged, if the memory leak still exists?
@zishanahmed08 With tensorflow 1.11.0-rc2 and the latest from models, the memory leak is gone for me.
@YijinLiu @zishanahmed08
I tried agian with the latest tensorflow/model repo in docker using image tensorflow/tensorflow:1.11.0-rc2-devel-gpu, still leak (~ 500 MB) every time when evaluation is done.
Please re-open the issue : it's still there with tf 1.11.0 (from pip install)
I am using ssd_resnet50_v1_fpn : same parameter as for pets example exept :
The process go from 13G at start training to 31G after 2h (35100 global_step) on CPU RAM. The GPU RAM do no change from 5.4GB (I have the gpu option allow_growth activated)
I throttle the eval to every 1 day (although I still have one eval run after the first checkpoint is saved). So basically there is no eval step during those 2h of training.
@mhtrinh @bleqdyce Could you try the following patch and see whether it fixes it? This is the only change I have in my client that might be related to the leak. My training could run 2 days w/o any memory issue.
diff --git a/research/object_detection/utils/visualization_utils.py b/research/object_detection/utils/visualization_utils.py index f40f9c30..2ffa322a 100644 --- a/research/object_detection/utils/visualization_utils.py +++ b/research/object_detection/utils/visualization_utils.py @@ -701,6 +701,7 @@ def add_cdf_image_summary(values, name): width, height = fig.get_size_inches() * fig.get_dpi() image = np.fromstring(fig.canvas.tostring_rgb(), dtype='uint8').reshape( 1, int(height), int(width), 3) + plt.close(fig) return image cdf_plot = tf.py_func(cdf_plot, [values], tf.uint8) tf.summary.image(name, cdf_plot) @@ -730,6 +731,7 @@ def add_hist_image_summary(values, bins, name): image = np.fromstring( fig.canvas.tostring_rgb(), dtype='uint8').reshape( 1, int(height), int(width), 3) + plt.close(fig) return image hist_plot = tf.py_func(hist_plot, [values, bins], tf.uint8) tf.summary.image(name, hist_plot)
@YijinLiu : I applied your patch and it make the training process crash just after the first checkpoint saving:
INFO:tensorflow:global_step/sec: 5.43315
INFO:tensorflow:loss = 0.48430496, step = 3040 (1.841 sec)
INFO:tensorflow:global_step/sec: 5.44665
INFO:tensorflow:loss = 0.6008992, step = 3050 (1.836 sec)
INFO:tensorflow:Saving checkpoints for 3060 into models/ssd_resnet50_v1_fpn.wholeCow.20181004.patch/model.ckpt.
Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7ff03e1c03c8>>
Traceback (most recent call last):
File "/usr/lib/python3.6/tkinter/__init__.py", line 3504, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fefb3ee4470>>
Traceback (most recent call last):
[...]
RuntimeError: main thread is not in main loop
Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fefb863fac8>>
Traceback (most recent call last):
File "/usr/lib/python3.6/tkinter/__init__.py", line 3504, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
./startTraining.sh: line 50: 5651 Aborted (core dumped) python3 $tensorflowModel/object_detection/model_main.py --model_dir=$outDir --pipeline_config_path=$config
Same error happen if I resume training : it pick up at last checkpoint then train for a while until the first saving checkpoint and crash.
This issue may be a duplicate of #5296 : the memory leak do not happen with FasterRCNN in my case but only with SSD models.
I use fasterRCNN model but get killed !
training mobilenetv2-SSD models always get killed because of OOM on my PC, tf 1.9.0 from pip install.
Most helpful comment
hi, @liuchang8am . I'm the guy who commit #5144.
It worked in tensorflow r1.9, however they change the way estimator trigger evaluation in 1.10.
See https://github.com/tensorflow/tensorflow/commit/3edb609926f2521c726737fc1efeae1572dc6581#diff-bc4a1638bbcd88997adf5e723b8609c7 for detail.
Simply speaking:
it is still in the minimum time interval of evaluation.
Here is my workaround
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_main.py#L55
or simply pass throttle_secs to EvalSpec here.
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_lib.py#L609
For example:
I set RunConfig.save_checkpoints_steps to 5000 and EvalSpec.throttle_secs to 864000 (10 days).
to skipped most evaluation other than first and last time. and I run evaluation on every saved checkpoint after training ends or while training is stopped.
As for training process is killed thing, I just restart the training with exactly same command. I do zero change to my config. It just automatically detect the last checkpoint and re-start training from it.