Deeplabcut: Error in deeplabcut.train_network: Resource exhausted: OOM when allocating tensor with shape [1,2048,62,110].

Created on 5 Mar 2020 · 4Comments · Source: DeepLabCut/DeepLabCut

Hi everyone. I am facing an issue when trying to train my network in deeplabcut:

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,2048,62,110] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]

Your Operating system and DeepLabCut version
I work on Windows 10, Anaconda3, 2019.10, GeForce RTX 2060, Nvidia driver version 442.19 and used the easy install option for GPU (so using DeepLabCut 2.1.5). Tensorflow version: 1.14.0

Describe the problem
I am facing an issue when trying to train my network in DeepLabCut. I always get the same error (printed below). I already tried reducing the batch size to 1, but it didn't work. NVIDIA-SMI reads that a few other Processes are using the GPU and 92% utilization (see screenshot below). And

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

is also set.

Error:

In [10]: deeplabcut.train_network(config_path, gputouse=0)
Config:
{'all_joints': [[0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13]],
 'all_joints_names': ['l-tail-tip',
                      'l-back-toe_tip',
                      'l-front-toe-tip',
                      'l-nose',
                      'd-tail-tip',
                      'd-back-right',
                      'd-back-left',
                      'd-front-right',
                      'd-front-left',
                      'd-nose',
                      'r-tail-tip',
                      'r-back-toe_tip',
                      'r-front-toe-tip',
                      'r-nose'],
 'batch_size': 1,
 'bottomheight': 400,
 'crop': True,
 'crop_pad': 0,
 'cropratio': 0.4,
 'dataset': 'training-datasets\\iteration-0\\UnaugmentedDataSet_TESTMar3\\TEST_REBECCAWEBER95shuffle1.mat',
 'dataset_type': 'default',
 'deterministic': False,
 'display_iters': 1000,
 'fg_fraction': 0.25,
 'global_scale': 0.8,
 'init_weights': 'c:\\users\\deeplabcut2\\anaconda3\\envs\\dlc208_1\\lib\\site-packages\\deeplabcut\\pose_estimation_tensorflow\\models\\pretrained\\resnet_v1_50.ckpt',
 'intermediate_supervision': False,
 'intermediate_supervision_layer': 12,
 'leftwidth': 400,
 'location_refinement': True,
 'locref_huber_loss': True,
 'locref_loss_weight': 0.05,
 'locref_stdev': 7.2801,
 'log_dir': 'log',
 'max_input_size': 1500,
 'mean_pixel': [123.68, 116.779, 103.939],
 'metadataset': 'training-datasets\\iteration-0\\UnaugmentedDataSet_TESTMar3\\Documentation_data-TEST_95shuffle1.pickle',
 'min_input_size': 64,
 'minsize': 100,
 'mirror': False,
 'multi_step': [[0.005, 400], [0.02, 20000], [0.002, 30000], [0.001, 40000]],
 'net_type': 'resnet_50',
 'num_joints': 14,
 'optimizer': 'sgd',
 'pos_dist_thresh': 17,
 'project_path': 'C:\\Users\\DeepLabCut2\\Desktop\\020320\\Analysis\\TEST-REBECCAWEBER-2020-03-03',
 'regularize': False,
 'rightwidth': 400,
 'save_iters': 50000,
 'scale_jitter_lo': 0.5,
 'scale_jitter_up': 1.25,
 'scoremap_dir': 'test',
 'shuffle': True,
 'snapshot_prefix': 'C:\\Users\\DeepLabCut2\\Desktop\\020320\\Analysis\\TEST-REBECCAWEBER-2020-03-03\\dlc-models\\iteration-0\\TESTMar3-trainset95shuffle1\\train\\snapshot',
 'stride': 8.0,
 'topheight': 400,
 'weigh_negatives': False,
 'weigh_only_present_joints': False,
 'weigh_part_predictions': False,
 'weight_decay': 0.0001}
Starting with standard pose-dataset loader.

...


2020-03-05 14:47:58.495499: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_ops.cc:486 : Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1355     try:
-> 1356       return fn(*args)
   1357     except errors.OpError as e:

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1340       return self._call_tf_sessionrun(
-> 1341           options, feed_dict, fetch_list, target_list, run_metadata)
   1342

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1428         self._session, options, feed_dict, fetch_list, target_list,
-> 1429         run_metadata)
   1430

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[train_op/control_dependency/_621]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-10-8587d1e65ef1> in <module>()
----> 1 deeplabcut.train_network(config_path, gputouse=0)

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, gputouse, max_snapshots_to_keep, autotune, displayiters, saveiters, maxiters)
     95           train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
     96       except BaseException as e:
---> 97           raise e
     98       finally:
     99           os.chdir(str(start_path))

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, gputouse, max_snapshots_to_keep, autotune, displayiters, saveiters, maxiters)
     93
     94       try:
---> 95           train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
     96       except BaseException as e:
     97           raise e

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py in train(config_yaml, displayiters, saveiters, maxiters, max_to_keep)
    156         current_lr = lr_gen.get_lr(it)
    157         [_, loss_val, summary] = sess.run([train_op, total_loss, merged_summaries],
--> 158                                           feed_dict={learning_rate: current_lr})
    159         cum_loss += loss_val
    160         train_writer.add_summary(summary, it)

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    948     try:
    949       result = self._run(None, fetches, feed_dict, options_ptr,
--> 950                          run_metadata_ptr)
    951       if run_metadata:
    952         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1171     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1172       results = self._do_run(handle, final_targets, final_fetches,
-> 1173                              feed_dict_tensor, options, run_metadata)
   1174     else:
   1175       results = []

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1348     if handle is None:
   1349       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1350                            run_metadata)
   1351     else:
   1352       return self._do_call(_prun_fn, handle, feeds, fetches)

c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1368           pass
   1369       message = error_interpolation.interpolate(message, self._graph)
-> 1370       raise type(e)(node_def, op, message)
   1371
   1372   def _extend_graph(self):

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[train_op/control_dependency/_621]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,2048,70,124] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D:
 pose/locref_pred/block4/weights/read (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py:30)

Input Source operations connected to node gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D:
 pose/locref_pred/block4/weights/read (defined at c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py:30)

Original stack trace for 'gradients/pose/locref_pred/block4/conv2d_transpose_grad/Conv2D':
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\DeepLabCut2\Anaconda3\envs\dlc208_2\Scripts\ipython.exe\__main__.py", line 7, in <module>
    sys.exit(start_ipython())
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\__init__.py", line 125, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\traitlets\config\application.py", line 664, in launch_instance
    app.start()
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\ipapp.py", line 353, in start
    self.shell.mainloop()
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\interactiveshell.py", line 459, in mainloop
    self.interact()
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\terminal\interactiveshell.py", line 450, in interact
    self.run_cell(code, store_history=True)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2683, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2793, in run_ast_nodes
    if self.run_code(code, result):
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\IPython\core\interactiveshell.py", line 2847, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-8587d1e65ef1>", line 1, in <module>
    deeplabcut.train_network(config_path, gputouse=0)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 95, in train_network
    train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 119, in train
    learning_rate, train_op = get_optimizer(total_loss, cfg)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 90, in get_optimizer
    train_op = slim.learning.create_train_op(loss_op, optimizer)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 442, in create_train_op
    check_numerics=check_numerics)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\training\python\training\training.py", line 450, in create_train_op
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\training\optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 731, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 403, in _MaybeCompile
    return grad_fn()  # Exit early
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gradients_util.py", line 731, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\nn_grad.py", line 65, in _Conv2DBackpropInputGrad
    data_format=op.get_attr("data_format").decode())
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'pose/locref_pred/block4/conv2d_transpose', defined at:
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
[elided 11 identical lines from previous traceback]
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 95, in train_network
    train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep) #pass on path and file name for pose_cfg.yaml!
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 105, in train
    losses = pose_net(cfg).train(batch)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 108, in train
    heads = self.get_net(batch[Batch.inputs])
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 95, in get_net
    return self.prediction_layers(net, end_points)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 83, in prediction_layers
    cfg.num_joints * 2)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\deeplabcut\pose_estimation_tensorflow\nnet\pose_net.py", line 30, in prediction_layer
    scope='block4')
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1417, in convolution2d_transpose
    outputs = layer.apply(inputs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1479, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\layers\base.py", line 537, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 634, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 146, in wrapper
    ), args, kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 446, in converted_call
    return _call_unconverted(f, args, kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 253, in _call_unconverted
    return f(*args, **kwargs)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 821, in call
    dilation_rate=self.dilation_rate)
  File "c:\users\deeplabcut2\anaconda3\envs\dlc208_1\lib\site-packages\tensorflow\python\keras\backend.py", line 4582, in conv2d_transpose
    data_format=tf_data_format)

Can anyone help me?

CONFIG

Cmd_Terminal

tensorflotraining

Source

Zoerew

Most helpful comment

Your GPU looks rather small. You could try lowering the 'global_scale' variable in the pose_cfg.yaml under dlc_models/././train.

jeylau on 5 Mar 2020

👍2

All 4 comments

Your GPU looks rather small. You could try lowering the 'global_scale' variable in the pose_cfg.yaml under dlc_models/././train.

jeylau on 5 Mar 2020

👍2

Your images are too large for that GPU; I wuld not force growth = true as this crashes out, as you see. You should downsample your videos first, but since already labeled, as Jessy says, you should edit global_scale to be smaller.

Also note in this file the max input image size is set :'max_input_size': 1500 so you should edit that too, otherwise large images are not used for training.