Models: Resource exhausted: OOM when allocating tensor with shape[32,960,10,10]

Created on 11 May 2020  路  4Comments  路  Source: tensorflow/models

Error got during object detection training with ssd_mobilenet_v2_quantized_300x300_coco model.
I am running below command to start the training:
python ../../models/research/object_detection/model_main.py --pipeline_config_path=./ssd_mobilenet_v2_quantized_300x300_coco.config --model_dir=./training/ --num_train_steps=2000000 --sample_1_of_n_eval_examples=1 --alsologtostderr
Training was going fine till 47900 steps, after that i got error:

File "/home/saini/.virtualenvs/cv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
聽 聽 raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
聽 (0) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Loss/Cast_232/_16919]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

聽 (1) Resource exhausted: OOM when allocating tensor with shape[32,960,10,10] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/AddN_162-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

OS: Ubuntu 18.04
Tensorflow: 1.14.0 GPU
CUDA: 10.0
CUDNN:: 7.6
Batch Size: 32

Follow changes i have made in default TF object detection API:

model_lib.py
tf.estimator.EvalSpec(
聽 聽 聽 聽 聽 聽 name=eval_spec_name,
聽 聽 聽 聽 聽 聽 input_fn=eval_input_fn,
聽 聽 聽 聽 聽 聽 steps=None,
聽 聽 聽 聽 聽 聽 throttle_secs = 172800,
聽 聽 聽 聽 聽 聽 exporters=exporter))
eval.proto
optional uint32 eval_interval_secs = 3 [default = 172800]; # default = 600
model_main.py
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_steps=5000)
research bug

Most helpful comment

I change the batch size to 16, then all works good

All 4 comments

Note : The error occurred after 47900 steps. My question is why after 47900 steps there is an error. Why not at the initial steps?

OOM means out of memory.
Maybe there is a case that at initial steps, memory might be empty, but when your steps go high, it uses more memory. I think that's why it happen.

@VismayTandel Yes you are right: OOM means out of memory.
If Images size and Batch size is same throughout the training. So how can be possible later steps training needs more memory.
Further, as i am not at all running any other program which takes the GPU memory. That's why for me its quite strange GPU goes out of memory in between training.

I change the batch size to 16, then all works good

Was this page helpful?
0 / 5 - 0 ratings