Models: Caused by op u'Loss/BoxClassifierLoss/Loss/sub', defined at:

Created on 26 Jun 2017 · 8Comments · Source: tensorflow/models

Caused by op u'Loss/BoxClassifierLoss/Loss/sub', defined at:
File "object_detection/train.py", line 198, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/wangxiaopeng/models-master/object_detection/trainer.py", line 192, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/wangxiaopeng/models-master/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(args, *kwargs)
File "/home/wangxiaopeng/models-master/object_detection/trainer.py", line 133, in _create_losses
losses_dict = detection_model.loss(prediction_dict)
File "/home/wangxiaopeng/models-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1173, in loss
groundtruth_classes_with_background_list))
File "/home/wangxiaopeng/models-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1329, in _loss_box_classifier
batch_reg_targets, weights=batch_reg_weights) / normalizer
File "/home/wangxiaopeng/models-master/object_detection/core/losses.py", line 71, in __call__
return self._compute_loss(prediction_tensor, target_tensor, **params)
File "/home/wangxiaopeng/models-master/object_detection/core/losses.py", line 157, in _compute_loss
diff = prediction_tensor - target_tensor
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 846, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2582, in _sub
result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2528, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1203, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
[[Node: Loss/BoxClassifierLoss/Loss/sub = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/BoxClassifierLoss/Reshape_9, Loss/BoxClassifierLoss/stack_4)]]
[[Node: clone_loss/_1631 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6487_clone_loss", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

awaiting model gardener

Source

wxp0329

Most helpful comment

I hit that same issue the other day. The problem was in my label map: I'd started my label id values at 0, while I should have started them at 1.

lgutzwil on 26 Jun 2017

❤4

All 8 comments

I hit that same issue the other day. The problem was in my label map: I'd started my label id values at 0, while I should have started them at 1.

lgutzwil on 26 Jun 2017

❤4

@lgutzwil Thank you for your answer，your method is correct.

wxp0329 on 27 Jun 2017

close #24

wxp0329 on 27 Jun 2017

@jch1 we should add a check to prevent these errors.

tombstone on 28 Jun 2017

I have the same error, unfortunately, changing the class ID does not work for me.
I am using the faster_rcnn_inception_resnet_v2 model, and I use only one class.

When the class ID is set to 0, I have the following error:

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
         [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
         [[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Branch_1/Conv2d_0a_1x1/convolution_grad/tuple/control_dependency_1/_4867 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12578_gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Branch_1/Conv2d_0a_1x1/convolution_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Whereas when I use the class ID set to 1, I have the following error:

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,62,4] vs. [1,64,4]
         [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
         [[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Mixed_7a/Branch_1/Conv2d_0a_1x1/BatchNorm/batchnorm/sub_grad/tuple/control_dependency/_5039 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16360_gradients/SecondStageFeatureExtractor/InceptionResnetV2/Mixed_7a/Branch_1/Conv2d_0a_1x1/BatchNorm/batchnorm/sub_grad/tuple/control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

I should mention that I am generating my own tfrecords, and the label for the class in the examples is set to 0.

Here is the config I use:

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/workspace/data/2/train_*.tfrecord"
  }
  label_map_path: "/workspace/data/2/label_map.pbtxt"
}

eval_config: {
  num_examples: 30000
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/workspace/data/2/validation_*.tfrecord"
  }
  label_map_path: "/workspace/data/2/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

tmattio on 6 Jul 2017

After regenerating the tfrecords with the label ids one-based, it's working.

So, to keep in mind when using a custom dataset/config:

The ID 0 should not be used, so the categories are one-based
The ID in the tfrecords must match the IDs in the label_map.pbtxt

tmattio on 7 Jul 2017

👍1

I've been fighting this bug since 2 days. One learning i've had is that if i moved bounding boxes away from image edges, the chance of this bug reduced.

like crash for bounding boxes of (0.001,0.999, 0.001, 0.999) vs (0.2,0.8, 0.2, 0.8), second one does not get this error

prashantmaurice on 10 Aug 2017

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.