Models: Object detection memory leak

Created on 20 Aug 2018 · 25Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.04
TensorFlow installed from (source or binary):
Source
TensorFlow version (use command below):
1.10.0
Bazel version (if compiling from source):
0.15.2
CUDA/cuDNN version:
9.2/7.1.4.18
GPU model and memory:
GTX 960/4G
Exact command to reproduce:
python research/object_detection/model_main.py --alsologtostderr
--pipeline_config_path=research/object_detection/samples/configs/ssdlite_mobilenet_v2_coco.config
--model_dir=XXX --num_eval_steps=6000

Describe the problem

There is a slow memory leak, about 600M-1G per hour.
I tried to change some parameters to save memory according to earlier issues. They don't work though.
train_config: {
batch_size: 24
batch_queue_capacity: 10
num_batch_queue_threads: 8
...
}
train_input_reader: {
....
num_readers: 10
}
eval_input_reader: {
num_readers: 1
}

Source

YijinLiu

👍3

Most helpful comment

Tried eval_interval_secs flag in config, in order to slow down the eval process. Yet it is not implemented.
Following #5144 to implement eval_interval_secs, still, DO NOT work.

hi, @liuchang8am . I'm the guy who commit #5144.
It worked in tensorflow r1.9, however they change the way estimator trigger evaluation in 1.10.
See https://github.com/tensorflow/tensorflow/commit/3edb609926f2521c726737fc1efeae1572dc6581#diff-bc4a1638bbcd88997adf5e723b8609c7 for detail.

Simply speaking:

evaluation is triggered after you save a checkpoint.
RunConfig.save_checkpoints_secs and RunConfig.save_checkpoints_steps is now the legitimate way to define the frequency of saving checkpoint.
EvalSpec.throttle_secs does not define the frequency of saving checkpoint anymore. However it define the minimum time interval of evaluation.
EvalSpec.throttle_secs will override RunConfig, which means that evaluation will be skipped if
it is still in the minimum time interval of evaluation.

Here is my workaround

To adjust the frequency of saving checkpoint, you can simply add save_checkpoints_secs or save_checkpoints_steps in this line.
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_main.py#L55
To prevent checkpoint saving to trigger evaluation, follow #5144 and set eval_interval_secs in config.
or simply pass throttle_secs to EvalSpec here.
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_lib.py#L609

For example:

I set RunConfig.save_checkpoints_steps to 5000 and EvalSpec.throttle_secs to 864000 (10 days).
to skipped most evaluation other than first and last time. and I run evaluation on every saved checkpoint after training ends or while training is stopped.

As for training process is killed thing, I just restart the training with exactly same command. I do zero change to my config. It just automatically detect the last checkpoint and re-start training from it.

bleqdyce on 6 Sep 2018

👍27 ❤4

All 25 comments

I've got this problem too when I use estimator-based training.
Haven't meet this while using slim-based training.
I wonder if evaluating while training cause this.

bleqdyce on 20 Aug 2018

👍2

the same problem as me. For my situation, I added a line plt.close(fig) in object detection source code, then it at least can be trained about 200k steps and not be killed by OOM.you can refer from this issue for details.

wxzs5 on 20 Aug 2018

👍5

@wxzs5 thank you, I'll try this.
btw, training process never got killed in my case. However it would occupy all available memory which make pc extremely slow.

bleqdyce on 20 Aug 2018

Any update on this? Even I am facing the exact same issue when I am training a mask rcnn model.

KapoorHitesh on 24 Aug 2018

Still got this issue.
I found that lower frequency of saving checkpoint and evaluate do make leak much slower, and that's all.
At least, training won't stop during I'm at work.

bleqdyce on 24 Aug 2018

Issue 3603 is just part of the problem. I added plt.close(fig) in two places, which only slowed the leak a bit.
I wonder how does Google not bothered by this? (given it has been there for almost half a year!)

YijinLiu on 25 Aug 2018

👍1

I experience the same problem too.

ldalzovo on 31 Aug 2018

Same here. This is probably a bug. Can't locate the leak source though.

liuchang8am on 5 Sep 2018

UPDATE:

Add plt.close(fig) in visualization_utils.py DO NOT work.
Tried eval_interval_secs flag in config, in order to slow down the eval process. Yet it is not implemented. Following https://github.com/tensorflow/models/pull/5144 to implement eval_interval_secs, still, DO NOT work.

Still memory leaking with tf version 1.10.

Workaround:
When training is killed because of out of memory, I can pick up the training process by configure the fine_tune_checkpoint entry in config file, and adjust the decayed learning rate by calculation. However, it consistantly shows worse results than training directly, if you look at the tensorboard graphs.

liuchang8am on 6 Sep 2018

Tried eval_interval_secs flag in config, in order to slow down the eval process. Yet it is not implemented.
Following #5144 to implement eval_interval_secs, still, DO NOT work.

Simply speaking:

evaluation is triggered after you save a checkpoint.
RunConfig.save_checkpoints_secs and RunConfig.save_checkpoints_steps is now the legitimate way to define the frequency of saving checkpoint.
EvalSpec.throttle_secs does not define the frequency of saving checkpoint anymore. However it define the minimum time interval of evaluation.
EvalSpec.throttle_secs will override RunConfig, which means that evaluation will be skipped if
it is still in the minimum time interval of evaluation.

Here is my workaround

To adjust the frequency of saving checkpoint, you can simply add save_checkpoints_secs or save_checkpoints_steps in this line.
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_main.py#L55
To prevent checkpoint saving to trigger evaluation, follow #5144 and set eval_interval_secs in config.
or simply pass throttle_secs to EvalSpec here.
https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/model_lib.py#L609

For example:

bleqdyce on 6 Sep 2018

👍27 ❤4

@bleqdyce Hi, I'd like to know your way to set RunConfig.save_checkpoints_steps=k means to get model-ckpt in every k step ? If not ,how can we achieve that? Thanks

swg209 on 6 Sep 2018

Hi, @swg209 It does mean to save model-ckpt in model_dir every k step in Tensorflow r1.10.
At least it works like that on my PC =P, check official doc for more detail.

bleqdyce on 6 Sep 2018

👍1

@bleqdyce Thanks for your help. When I directly run model_main.py ,your way works. But when I run legacy/train.py , it doesn't save checkpoint . Did you meet the same issue?

swg209 on 6 Sep 2018

legacy/train.py use slim as its tensorflow wrapper.

If you want to use legacy version, you may want to check out tf slim. I'm not familiar with slim, but I find out that according to this, you can pass save_interval_secs to slim.learning.train in legacy/trainer.py

https://github.com/tensorflow/models/blob/7b3046768d2d6ea3f306b6ee7b62c02c9ca128a1/research/object_detection/legacy/trainer.py#L402

and save_interval_secs is to define How often, in seconds, to save the model to logdir. . which is writing in tensorflow/contrib/slim/python/slim/learning.py#L607

hope this help but cant guarantee though =P

bleqdyce on 6 Sep 2018

Sorry for the late response! We removed the usage of "add_cdf_image_summary" in PR, you're encouraged to try again once it gets in.

pkulzc on 15 Sep 2018

👍1

following

zishanahmed08 on 18 Sep 2018

Has any body checked now that PR is merged, if the memory leak still exists?

zishanahmed08 on 24 Sep 2018

@zishanahmed08 With tensorflow 1.11.0-rc2 and the latest from models, the memory leak is gone for me.

YijinLiu on 25 Sep 2018

👍1

@YijinLiu @zishanahmed08
I tried agian with the latest tensorflow/model repo in docker using image tensorflow/tensorflow:1.11.0-rc2-devel-gpu, still leak (~ 500 MB) every time when evaluation is done.

bleqdyce on 25 Sep 2018

👍1

Please re-open the issue : it's still there with tf 1.11.0 (from pip install)
I am using ssd_resnet50_v1_fpn : same parameter as for pets example exept :

1 class
10 files of 1.8GB of tfrecord

The process go from 13G at start training to 31G after 2h (35100 global_step) on CPU RAM. The GPU RAM do no change from 5.4GB (I have the gpu option allow_growth activated)
I throttle the eval to every 1 day (although I still have one eval run after the first checkpoint is saved). So basically there is no eval step during those 2h of training.

mhtrinh on 5 Oct 2018

👍1

@mhtrinh @bleqdyce Could you try the following patch and see whether it fixes it? This is the only change I have in my client that might be related to the leak. My training could run 2 days w/o any memory issue.

diff --git a/research/object_detection/utils/visualization_utils.py b/research/object_detection/utils/visualization_utils.py
index f40f9c30..2ffa322a 100644
--- a/research/object_detection/utils/visualization_utils.py
+++ b/research/object_detection/utils/visualization_utils.py
@@ -701,6 +701,7 @@ def add_cdf_image_summary(values, name):
     width, height = fig.get_size_inches() * fig.get_dpi()
     image = np.fromstring(fig.canvas.tostring_rgb(), dtype='uint8').reshape(
         1, int(height), int(width), 3)
+    plt.close(fig)
     return image
   cdf_plot = tf.py_func(cdf_plot, [values], tf.uint8)
   tf.summary.image(name, cdf_plot)
@@ -730,6 +731,7 @@ def add_hist_image_summary(values, bins, name):
     image = np.fromstring(
         fig.canvas.tostring_rgb(), dtype='uint8').reshape(
             1, int(height), int(width), 3)
+    plt.close(fig)
     return image
   hist_plot = tf.py_func(hist_plot, [values, bins], tf.uint8)
   tf.summary.image(name, hist_plot)

YijinLiu on 5 Oct 2018

😕1

@YijinLiu : I applied your patch and it make the training process crash just after the first checkpoint saving:
INFO:tensorflow:global_step/sec: 5.43315 INFO:tensorflow:loss = 0.48430496, step = 3040 (1.841 sec) INFO:tensorflow:global_step/sec: 5.44665 INFO:tensorflow:loss = 0.6008992, step = 3050 (1.836 sec) INFO:tensorflow:Saving checkpoints for 3060 into models/ssd_resnet50_v1_fpn.wholeCow.20181004.patch/model.ckpt. Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7ff03e1c03c8>> Traceback (most recent call last): File "/usr/lib/python3.6/tkinter/__init__.py", line 3504, in __del__ self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fefb3ee4470>> Traceback (most recent call last): [...] RuntimeError: main thread is not in main loop Exception ignored in: <bound method Image.__del__ of <tkinter.PhotoImage object at 0x7fefb863fac8>> Traceback (most recent call last): File "/usr/lib/python3.6/tkinter/__init__.py", line 3504, in __del__ self.tk.call('image', 'delete', self.name) RuntimeError: main thread is not in main loop Tcl_AsyncDelete: async handler deleted by the wrong thread ./startTraining.sh: line 50: 5651 Aborted (core dumped) python3 $tensorflowModel/object_detection/model_main.py --model_dir=$outDir --pipeline_config_path=$config
Same error happen if I resume training : it pick up at last checkpoint then train for a while until the first saving checkpoint and crash.

mhtrinh on 9 Oct 2018

This issue may be a duplicate of #5296 : the memory leak do not happen with FasterRCNN in my case but only with SSD models.

mhtrinh on 12 Oct 2018

I use fasterRCNN model but get killed !