models 🚀 - OMM during training in Object detection with batch size>12

@jch1, could you take a look?

aselle on 26 Jul 2017

I have the same problem

hnsywangxin on 28 Jul 2017

What dataset are you using? Could you post your configs as well?

derekjchow on 28 Jul 2017

@derekjchow I am using my own dataset ,with about 25000 images in train.record.per image is no more than 900k,the train.record is 3.2G for all,here is my config:

model {
ssd {
num_classes: 26
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
}
}
similarity_calculator {
iou_similarity {
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
reduce_boxes_in_lowest_layer: true
}
}
image_resizer {
fixed_shape_resizer {
height: 500
width: 500
}
}
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
}
}
}
feature_extractor {
type: 'ssd_inception_v2'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
}
loss {
classification_loss {
weighted_sigmoid {
anchorwise_output: true
}
}
localization_loss {
weighted_smooth_l1 {
anchorwise_output: true
}
}
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.99
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}

train_config: {
batch_size: 8
optimizer {
rms_prop_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.004
schedule {
step: 0
learning_rate: 0.004
}
schedule {
step: 50000
learning_rate: 0.001
}
schedule {
step: 200000
learning_rate: 0.0004
}
schedule {
step: 300000
learning_rate: 0.00004
}
schedule {
step: 500000
learning_rate: 0.00001
}
schedule {
step: 800000
learning_rate: 0.000001
}

  }
  momentum_optimizer_value: 0.9
  decay: 0.9
  epsilon: 1.0
}

}
fine_tune_checkpoint: "/mnt/raid4/home/zhangyi/modelsmaster/modelsmaster/object_detection/premodel/ssd_inception_v2_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}

}

train_input_reader: {

tf_record_input_reader {
input_path: "/data/public/image/data7.24/train.record"
}
label_map_path: "/mnt/raid4/home/zhangyi/modelsmaster/modelsmaster/object_detection/data/data26_label_map.pbtxt"
}

eval_config: {
num_examples: 1778
}

eval_input_reader: {
tf_record_input_reader {
input_path: "/data/public/image/data7.24/test1.record"
}
label_map_path: "/mnt/raid4/home/zhangyi/modelsmaster/modelsmaster/object_detection/data/data26_label_map_eng.pbtxt"
shuffle: false
num_readers: 1
}

System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): VERSION="16.04.2 LTS (Xenial Xerus)"
TensorFlow version (use command below): 1.1.0
GPU memory: 11.7G

Is there anything wrong to make this problem?

zhangyilalala on 31 Jul 2017

What size are the images in your dataset? We find that datasets with very large images (1920x1080) tend to hit OOM. Prescaling the images in the dataset to smaller resolutions can help.

derekjchow on 1 Aug 2017

@derekjchow Thanks a lot,I have resized the images and the batch size can up to 32 now.

zhangyilalala on 2 Aug 2017

I am facing same similar issue. Any help please?

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[18816,20000] [[Node: logits/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, logits/weights/read)]] [[Node: Adam/update/_962 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5661_Adam/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Though I tried setting the Allocator type option as explained here. But no luck

config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'

jollysean on 29 Aug 2017

@jollysean could you describe your dataset? As mentioned, we recommend pre-shrinking very large images.

derekjchow on 29 Aug 2017

I am training bookcorpus using skip-thought vectors model. For this i reduced the dimensionality to suite my needs.

jollysean on 29 Aug 2017

Hi,

I face this issue when running evaluation (eval.py) on GPU. The training runs fine on another thread. Getting this issue when running the eval script. The dataset has images in order 1300 x 1300.
Running on Titan X Nvidia

shresthamalik on 8 Sep 2017

@shresthamalik it could be that the train.py script captures all the memory, even though it doesn't use all of it. There was a solution around here , like this (in train.py):

def main(_):
    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)  # leave for eval
    sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

cipri-tom on 23 Oct 2017

👍1

I have GTX 1060 6GB
I resized all my images to 300x300
in config:

    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }

and it still works only with batch_size value 1
higher values than 1 (2, 8, 12, 24, 32) cause OOM errors

anonym24 on 9 Nov 2018

👍1

even with batch_size value 1 after some steps (~100) it fails retraining ssd mobilenet:

INFO:tensorflow:global step 171: loss = 14.8309 (0.360 sec/step)
I1109 12:00:57.197321  7740 tf_logging.py:115] global step 171: loss = 14.8309 (0.360 sec/step)
INFO:tensorflow:global step 172: loss = 11.7885 (0.351 sec/step)
I1109 12:00:57.549896  7740 tf_logging.py:115] global step 172: loss = 11.7885 (0.351 sec/step)
INFO:tensorflow:global step 173: loss = 12.5532 (0.369 sec/step)
I1109 12:00:57.919557  7740 tf_logging.py:115] global step 173: loss = 12.5532 (0.369 sec/step)
INFO:tensorflow:global step 174: loss = 13.3306 (0.328 sec/step)
I1109 12:00:58.248665  7740 tf_logging.py:115] global step 174: loss = 13.3306 (0.328 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Loss/Match/cond/mul_4', defined at:
  File "train.py", line 184, in <module>
    tf.app.run()
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "train.py", line 180, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 290, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "C:\tensorflow1\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 205, in _create_losses
    losses_dict = detection_model.loss(prediction_dict, true_image_shapes)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 680, in loss
    keypoints, weights)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 853, in _assign_targets
    groundtruth_weights_list)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 483, in batch_assign_targets
    anchors, gt_boxes, gt_class_targets, unmatched_class_label, gt_weights)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 182, in assign
    valid_rows=tf.greater(groundtruth_weights, 0))
  File "C:\tensorflow1\models\research\object_detection\core\matcher.py", line 241, in match
    return Match(self._match(similarity_matrix, valid_rows),
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 194, in _match
    _match_when_rows_are_non_empty, _match_when_rows_are_empty)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2086, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1930, in BuildCondBranch
    original_result = fn()
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 175, in _match_when_rows_are_non_empty
    tf.cast(tf.expand_dims(valid_rows, axis=-1), dtype=tf.float32))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 866, in binary_op_wrapper
    return func(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 1131, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5358, in mul
    "Mul", x=x, y=y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I1109 12:00:58.279880  7740 tf_logging.py:115] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Loss/Match/cond/mul_4', defined at:
  File "train.py", line 184, in <module>
    tf.app.run()
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "train.py", line 180, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 290, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "C:\tensorflow1\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 205, in _create_losses
    losses_dict = detection_model.loss(prediction_dict, true_image_shapes)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 680, in loss
    keypoints, weights)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 853, in _assign_targets
    groundtruth_weights_list)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 483, in batch_assign_targets
    anchors, gt_boxes, gt_class_targets, unmatched_class_label, gt_weights)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 182, in assign
    valid_rows=tf.greater(groundtruth_weights, 0))
  File "C:\tensorflow1\models\research\object_detection\core\matcher.py", line 241, in match
    return Match(self._match(similarity_matrix, valid_rows),
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 194, in _match
    _match_when_rows_are_non_empty, _match_when_rows_are_empty)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2086, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1930, in BuildCondBranch
    original_result = fn()
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 175, in _match_when_rows_are_non_empty
    tf.cast(tf.expand_dims(valid_rows, axis=-1), dtype=tf.float32))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 866, in binary_op_wrapper
    return func(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 1131, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5358, in mul
    "Mul", x=x, y=y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Traceback (most recent call last):
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,1917] vs. [3,1]
         [[{{node Loss/Match/cond/mul_4}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 184, in <module>
    tf.app.run()
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "train.py", line 180, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 415, in train
    saver=saver)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 770, in train
    sess, train_op, global_step, train_step_kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Loss/Match/cond/mul_4', defined at:
  File "train.py", line 184, in <module>
    tf.app.run()
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "train.py", line 180, in main
    graph_hook_fn=graph_rewriter_fn)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 290, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "C:\tensorflow1\models\research\slim\deployment\model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "C:\tensorflow1\models\research\object_detection\legacy\trainer.py", line 205, in _create_losses
    losses_dict = detection_model.loss(prediction_dict, true_image_shapes)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 680, in loss
    keypoints, weights)
  File "C:\tensorflow1\models\research\object_detection\meta_architectures\ssd_meta_arch.py", line 853, in _assign_targets
    groundtruth_weights_list)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 483, in batch_assign_targets
    anchors, gt_boxes, gt_class_targets, unmatched_class_label, gt_weights)
  File "C:\tensorflow1\models\research\object_detection\core\target_assigner.py", line 182, in assign
    valid_rows=tf.greater(groundtruth_weights, 0))
  File "C:\tensorflow1\models\research\object_detection\core\matcher.py", line 241, in match
    return Match(self._match(similarity_matrix, valid_rows),
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 194, in _match
    _match_when_rows_are_non_empty, _match_when_rows_are_empty)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 2086, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\control_flow_ops.py", line 1930, in BuildCondBranch
    original_result = fn()
  File "C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py", line 175, in _match_when_rows_are_non_empty
    tf.cast(tf.expand_dims(valid_rows, axis=-1), dtype=tf.float32))
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 866, in binary_op_wrapper
    return func(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\math_ops.py", line 1131, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5358, in mul
    "Mul", x=x, y=y, name=name)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "C:\Users\Admin\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [2,1917] vs. [3,1]
         [[node Loss/Match/cond/mul_4 (defined at C:\tensorflow1\models\research\object_detection\matchers\argmax_matcher.py:175)  = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss/Match/cond/one_hot, Loss/Match/cond/Cast_2)]]
         [[{{node gradients/FeatureExtractor/MobilenetV1/Conv2d_13_pointwise_2_Conv2d_5_3x3_s2_128/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_1497}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2718_...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

anonym24 on 9 Nov 2018

I have GTX 1060 6GB
I resized all my images to 300x300
in config:
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
and it still works only with batch_size value 1
higher values than 1 (2, 8, 12, 24, 32) cause OOM errors

Hi. Did you solve your OOM errors?

HaFred on 21 Oct 2019

Models: OMM during training in Object detection with batch size>12

All 14 comments

Related issues