Models: Resource exhausted: OOM when allocating tensor with shape[2304,384] Traceback (most recent call last):

Created on 20 Jul 2017 · 11Comments · Source: tensorflow/models

Please go to Stack Overflow for help and support:

I tried to run models/tutorials/image/cifar10/train.py
I let it run about a day on my pc :
(windows10 , tensorflow-gpu 1.2 ,) after
2017-07-20 13:58:20.441224: step 941580, loss = 0.14 (3076.2 examples/sec; 0.042 sec/batch)

`I got this error :

2017-07-20 13:58:20.791379: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[2304,384]
Traceback (most recent call last):
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
    return fn(*args)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
    status, run_metadata)
  File "D:\Anaconda3\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
     [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
    tf.app.run()
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
    train()
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 115, in train
    mon_sess.run(train_op)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 842, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
    run_metadata_ptr)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
     [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

Caused by op 'ExponentialMovingAverage/AssignMovingAvg_4/sub_1', defined at:
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
    tf.app.run()
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
    train()
  File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 79, in train
    train_op = cifar10.train(loss, global_step)
  File "C:\Users\Hoda\Documents\GitHub\models\tutorials\image\cifar10\cifar10.py", line 373, in train
    variables_averages_op = variable_averages.apply(tf.trainable_variables())
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 392, in apply
    self._averages[var], var, decay, zero_debias=zero_debias))
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 72, in assign_moving_average
    update_delta = (variable - value) * decay
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py", line 694, in _run_op
    return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 838, in binary_op_wrapper
    return func(x, y, name=name)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2501, in _sub
    result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2510, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1273, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,384]
     [[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]

`how can I fix it? and do I have to run it again from or the previous result is saved?
ibe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source

ehfo0

Most helpful comment

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

drpngx on 20 Jul 2017

👍80 ❤8 🚀7 👎3

All 11 comments

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

drpngx on 20 Jul 2017

👍80 ❤8 🚀7 👎3

Are you saying that there is a memory leak?

drpngx on 20 Jul 2017

thanks for the answer
I changed the batch_size from 128 to 64 now it's running ! I am running it on my pc so it might take a while!
I don't know about the memory leak but I guess the the cache run out of memory! I have 16 GB RAM and geforce gtx970 graphic card.
does it lose the previous time I let it train or just the network gets better each time we run train.py?

ehfo0 on 21 Jul 2017

👍2

thanks I reduced the batch size and it worked!
I got precision of :
2017-07-21 21:42:04.630874: precision @ 1 = 0.859

ehfo0 on 21 Jul 2017

👍6

Yay!!

drpngx on 22 Jul 2017

👍3

Im having the same issue
below you can find more details
https://github.com/tensorflow/tensorflow/issues/4735#issuecomment-427580320
please look into it

deepakmeena635 on 6 Oct 2018

👍1

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

I am using CPU to train the model.
I have already kept batch size to 1 and have resized image size to 200 X 200.
Still it is throwing Resource Exhausted Error.

Please help

ShubhamKanitkar on 20 Nov 2018

👍3

I am having the same issue even with a reduced batch size

krw0320 on 2 Jul 2019

👍4

same here. The batch size was already 1 and i've change the fixed_shape_resizer as 500x500 (using faster rcnn models) and

  session_config = tf.ConfigProto()
  session_config.gpu_options.per_process_gpu_memory_fraction = 0.3
  config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)

is also set.

But it keep showing (by ssd resnet, same error) :

2019-07-08 18:37:10.194834: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[100,51150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
until it break automatically.
I can only train the ssd mobilenet. very confusing.

i'm using gtx 1060 6G and RTX 2070. both has same error

heizie on 8 Jul 2019

Helllo, facing the same problem when training with the kangaroo data set .
Reducing training bach from 16 to 4 has not changed anything
CPU 8Gb + ( Intel(R) UHD Graphics 630
GPU Geforce GTX 1050 3GB , Windows10 + Anaconda

from
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())_
I have the response
GPU:0 with 2131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 18102239670215265869
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 2235275673
locality {
bus_id: 1
links {
}
}
incarnation: 6041356209009565047
physical_device_desc: "device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1"
]