Please go to Stack Overflow for help and support:
I tried to run models/tutorials/image/cifar10/train.py
I let it run about a day on my pc :
(windows10 , tensorflow-gpu 1.2 ,) after
2017-07-20 13:58:20.441224: step 941580, loss = 0.14 (3076.2 examples/sec; 0.042 sec/batch)
`I got this error :
2017-07-20 13:58:20.791379: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[2304,384]
Traceback (most recent call last):
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "D:\Anaconda3\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
[[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
tf.app.run()
File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
train()
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 115, in train
mon_sess.run(train_op)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 842, in run
run_metadata=run_metadata)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,384]
[[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]
Caused by op 'ExponentialMovingAverage/AssignMovingAvg_4/sub_1', defined at:
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 127, in <module>
tf.app.run()
File "D:\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 123, in main
train()
File "C:/Users/Hoda/Documents/GitHub/models/tutorials/image/cifar10/cifar10_train.py", line 79, in train
train_op = cifar10.train(loss, global_step)
File "C:\Users\Hoda\Documents\GitHub\models\tutorials\image\cifar10\cifar10.py", line 373, in train
variables_averages_op = variable_averages.apply(tf.trainable_variables())
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 392, in apply
self._averages[var], var, decay, zero_debias=zero_debias))
File "D:\Anaconda3\lib\site-packages\tensorflow\python\training\moving_averages.py", line 72, in assign_moving_average
update_delta = (variable - value) * decay
File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py", line 694, in _run_op
return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 838, in binary_op_wrapper
return func(x, y, name=name)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2501, in _sub
result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2510, in create_op
original_op=self._default_original_op, op_def=op_def)
File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1273, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,384]
[[Node: ExponentialMovingAverage/AssignMovingAvg_4/sub_1 = Sub[T=DT_FLOAT, _class=["loc:@local3/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](local3/weights/ExponentialMovingAverage/read, local3/weights/read)]]
`how can I fix it? and do I have to run it again from or the previous result is saved?
ibe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.
Are you saying that there is a memory leak?
thanks for the answer
I changed the batch_size from 128 to 64 now it's running ! I am running it on my pc so it might take a while!
I don't know about the memory leak but I guess the the cache run out of memory! I have 16 GB RAM and geforce gtx970 graphic card.
does it lose the previous time I let it train or just the network gets better each time we run train.py?
thanks I reduced the batch size and it worked!
I got precision of :
2017-07-21 21:42:04.630874: precision @ 1 = 0.859
Yay!!
Im having the same issue
below you can find more details
https://github.com/tensorflow/tensorflow/issues/4735#issuecomment-427580320
please look into it
The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.
I am using CPU to train the model.
I have already kept batch size to 1 and have resized image size to 200 X 200.
Still it is throwing Resource Exhausted Error.
Please help
I am having the same issue even with a reduced batch size
same here. The batch size was already 1 and i've change the fixed_shape_resizer as 500x500 (using faster rcnn models) and
session_config = tf.ConfigProto()
session_config.gpu_options.per_process_gpu_memory_fraction = 0.3
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)
is also set.
But it keep showing (by ssd resnet, same error) :
2019-07-08 18:37:10.194834: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[100,51150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
until it break automatically.
I can only train the ssd mobilenet. very confusing.
i'm using gtx 1060 6G and RTX 2070. both has same error
Helllo, facing the same problem when training with the kangaroo data set .
Reducing training bach from 16 to 4 has not changed anything
CPU 8Gb + ( Intel(R) UHD Graphics 630
GPU Geforce GTX 1050 3GB , Windows10 + Anaconda
from
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())_
I have the response
GPU:0 with 2131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 18102239670215265869
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 2235275673
locality {
bus_id: 1
links {
}
}
incarnation: 6041356209009565047
physical_device_desc: "device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Thanks for your help and advise ....
the issue is there, not sure why the ticket is closed. There must be a leak since if you reboot the machine it goes away
Most helpful comment
The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.