Hi everyone. I am facing an issue when trying to train my network in deeplabcut:
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,2048,62,110] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]
Your Operating system and DeepLabCut version
I work on Windows 10, Anaconda3, 2019.10, GeForce RTX 2060, Nvidia driver version 442.19 and used the easy install option for GPU (so using DeepLabCut 2.1.5). Tensorflow version: 1.14.0
Describe the problem
I am facing an issue when trying to train my network in DeepLabCut. I always get the same error (printed below). I already tried reducing the batch size to 1, but it didn't work. NVIDIA-SMI reads that a few other Processes are using the GPU and 92% utilization (see screenshot below). And
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
is also set.
Error:
In [10]: deeplabcut.train_network(config_path, gputouse=0)
Config:
{'all_joints': [[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9],
[10],
[11],
[12],
[13]],
'all_joints_names': ['l-tail-tip',
'l-back-toe_tip',
'l-front-toe-tip',
'l-nose',
'd-tail-tip',
'd-back-right',
'd-back-left',
'd-front-right',
'd-front-left',
'd-nose',
'r-tail-tip',
'r-back-toe_tip',
'r-front-toe-tip',
'r-nose'],
'batch_size': 1,
'bottomheight': 400,
'crop': True,
'crop_pad': 0,
'cropratio': 0.4,
'dataset': 'training-datasets\\iteration-0\\UnaugmentedDataSet_TESTMar3\\TEST_REBECCAWEBER95shuffle1.mat',
'dataset_type': 'default',
'deterministic': False,
'display_iters': 1000,
'fg_fraction': 0.25,
'global_scale': 0.8,
'init_weights': 'c:\\users\\deeplabcut2\\anaconda3\\envs\\dlc208_1\\lib\\site-packages\\deeplabcut\\pose_estimation_tensorflow\\models\\pretrained\\resnet_v1_50.ckpt',
'intermediate_supervision': False,
'intermediate_supervision_layer': 12,
'leftwidth': 400,
'location_refinement': True,
'locref_huber_loss': True,
'locref_loss_weight': 0.05,
'locref_stdev': 7.2801,
'log_dir': 'log',
'max_input_size': 1500,
'mean_pixel': [123.68, 116.779, 103.939],
'metadataset': 'training-datasets\\iteration-0\\UnaugmentedDataSet_TESTMar3\\Documentation_data-TEST_95shuffle1.pickle',
'min_input_size': 64,
'minsize': 100,
'mirror': False,
'multi_step': [[0.005, 400], [0.02, 20000], [0.002, 30000], [0.001, 40000]],
'net_type': 'resnet_50',
'num_joints': 14,
'optimizer': 'sgd',
'pos_dist_thresh': 17,
'project_path': 'C:\\Users\\DeepLabCut2\\Desktop\\020320\\Analysis\\TEST-REBECCAWEBER-2020-03-03',
'regularize': False,
'rightwidth': 400,
'save_iters': 50000,
'scale_jitter_lo': 0.5,
'scale_jitter_up': 1.25,
'scoremap_dir': 'test',
'shuffle': True,
'snapshot_prefix': 'C:\\Users\\DeepLabCut2\\Desktop\\020320\\Analysis\\TEST-REBECCAWEBER-2020-03-03\\dlc-models\\iteration-0\\TESTMar3-trainset95shuffle1\\train\\snapshot',
'stride': 8.0,
'topheight': 400,
'weigh_negatives': False,
'weigh_only_present_joints': False,
'weigh_part_predictions': False,
'weight_decay': 0.0001}
Starting with standard pose-dataset loader.
...
2020-03-05 14:47:58.495499: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_ops.cc:486 : Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1340 return self._call_tf_sessionrun(
-> 1341 options, feed_dict, fetch_list, target_list, run_metadata)
1342
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1428 self._session, options, feed_dict, fetch_list, target_list,
-> 1429 run_metadata)
1430
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[train_op/control_dependency/_621]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-10-8587d1e65ef1> in <module>()
----> 1 deeplabcut.train_network(config_path, gputouse=0)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, gputouse, max_snapshots_to_keep, autotune, displayiters, saveiters, maxiters)
95 train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
96 except BaseException as e:
---> 97 raise e
98 finally:
99 os.chdir(str(start_path))
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, gputouse, max_snapshots_to_keep, autotune, displayiters, saveiters, maxiters)
93
94 try:
---> 95 train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
96 except BaseException as e:
97 raise e
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py in train(config_yaml, displayiters, saveiters, maxiters, max_to_keep)
156 current_lr = lr_gen.get_lr(it)
157 [_, loss_val, summary] = sess.run([train_op, total_loss, merged_summaries],
--> 158 feed_dict={learning_rate: current_lr})
159 cum_loss += loss_val
160 train_writer.add_summary(summary, it)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
948 try:
949 result = self._run(None, fetches, feed_dict, options_ptr,
--> 950 run_metadata_ptr)
951 if run_metadata:
952 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1171 if final_fetches or final_targets or (handle and feed_dict_tensor):
1172 results = self._do_run(handle, final_targets, final_fetches,
-> 1173 feed_dict_tensor, options, run_metadata)
1174 else:
1175 results = []
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1348 if handle is None:
1349 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1350 run_metadata)
1351 else:
1352 return self._do_call(_prun_fn, handle, feeds, fetches)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1368 pass
1369 message = error_interpolation.interpolate(message, self._graph)
-> 1370 raise type(e)(node_def, op, message)
1371
1372 def _extend_graph(self):
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[train_op/control_dependency/_621]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D:
pose/locref_pred/block4/weights/read (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py:30)
Input Source operations connected to node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D:
pose/locref_pred/block4/weights/read (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py:30)
Original stack trace for 'gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D':
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\DeepLabCut2\Anaconda3\envs\dlc208_2\Scripts\ipython.exe\__main__.py", line 7, in <module>
sys.exit(start_ipython())
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\__init__.py", line 125, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\traitlets\config\application.py", line 664, in launch_instance
app.start()
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\ipapp.py", line 353, in start
self.shell.mainloop()
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\interactiveshell.py", line 459, in mainloop
self.interact()
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\interactiveshell.py", line 450, in interact
self.run_cell(code, store_history=True)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2683, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2793, in run_ast_nodes
if self.run_code(code, result):
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2847, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-8587d1e65ef1>", line 1, in <module>
deeplabcut.train_network(config_path, gputouse=0)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 95, in train_network
train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 119, in train
learning_rate, train_op = get_optimizer(total_loss, cfg)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 90, in get_optimizer
train_op = slim.learning.create_train_op(loss_op, optimizer)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 442, in create_train_op
check_numerics=check_numerics)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\training\python\training\training.py", line 450, in create_train_op
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\training\optimizer.py", line 512, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 731, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 403, in _MaybeCompile
return grad_fn() # Exit early
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 731, in <lambda>
lambda: grad_fn(op, *out_grads))
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\nn_grad.py", line 65, in _Conv2DBackpropInputGrad
data_format=op.get_attr("data_format").decode())
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
op_def=op_def)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'pose/locref_pred/block4/conv2d_transpose', defined at:
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
[elided 11 identical lines from previous traceback]
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 95, in train_network
train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 105, in train
losses = pose_net(cfg).train(batch)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 108, in train
heads = self.get_net(batch[Batch.inputs])
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 95, in get_net
return self.prediction_layers(net, end_points)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 83, in prediction_layers
cfg.num_joints * 2)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 30, in prediction_layer
scope='block4')
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1417, in convolution2d_transpose
outputs = layer.apply(inputs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1479, in apply
return self.__call__(inputs, *args, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\layers\base.py", line 537, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 634, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 146, in wrapper
), args, kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 446, in converted_call
return _call_unconverted(f, args, kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 253, in _call_unconverted
return f(*args, **kwargs)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 821, in call
dilation_rate=self.dilation_rate)
File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\backend.py", line 4582, in conv2d_transpose
data_format=tf_data_format)
Can anyone help me?
Your GPU looks rather small. You could try lowering the 'global_scale' variable in the pose_cfg.yaml under dlc_models/././train.
Your images are too large for that GPU; I wuld not force growth = true as this crashes out, as you see. You should downsample your videos first, but since already labeled, as Jessy says, you should edit global_scale to be smaller.
Also note in this file the max input image size is set :'max_input_size': 1500 so you should edit that too, otherwise large images are not used for training.
Well, he could also use 'max_input_size' to filter out too large images...
any updates @Zoerew ?
Most helpful comment
Your GPU looks rather small. You could try lowering the 'global_scale' variable in the pose_cfg.yaml under dlc_models/././train.