Caused by op u'Loss/BoxClassifierLoss/Loss/sub', defined at:
File "object_detection/train.py", line 198, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/wangxiaopeng/models-master/object_detection/trainer.py", line 192, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/wangxiaopeng/models-master/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(args, *kwargs)
File "/home/wangxiaopeng/models-master/object_detection/trainer.py", line 133, in _create_losses
losses_dict = detection_model.loss(prediction_dict)
File "/home/wangxiaopeng/models-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1173, in loss
groundtruth_classes_with_background_list))
File "/home/wangxiaopeng/models-master/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1329, in _loss_box_classifier
batch_reg_targets, weights=batch_reg_weights) / normalizer
File "/home/wangxiaopeng/models-master/object_detection/core/losses.py", line 71, in __call__
return self._compute_loss(prediction_tensor, target_tensor, **params)
File "/home/wangxiaopeng/models-master/object_detection/core/losses.py", line 157, in _compute_loss
diff = prediction_tensor - target_tensor
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 846, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2582, in _sub
result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2528, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1203, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
[[Node: Loss/BoxClassifierLoss/Loss/sub = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/BoxClassifierLoss/Reshape_9, Loss/BoxClassifierLoss/stack_4)]]
[[Node: clone_loss/_1631 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6487_clone_loss", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I hit that same issue the other day. The problem was in my label map: I'd started my label id values at 0, while I should have started them at 1.
@lgutzwil Thank you for your answer锛寉our method is correct.
close #24
@jch1 we should add a check to prevent these errors.
I have the same error, unfortunately, changing the class ID does not work for me.
I am using the faster_rcnn_inception_resnet_v2 model, and I use only one class.
When the class ID is set to 0, I have the following error:
InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
[[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Branch_1/Conv2d_0a_1x1/convolution_grad/tuple/control_dependency_1/_4867 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_12578_gradients/SecondStageFeatureExtractor/InceptionResnetV2/Repeat/block8_9/Branch_1/Conv2d_0a_1x1/convolution_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Whereas when I use the class ID set to 1, I have the following error:
InvalidArgumentError (see above for traceback): Incompatible shapes: [1,62,4] vs. [1,64,4]
[[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
[[Node: gradients/SecondStageFeatureExtractor/InceptionResnetV2/Mixed_7a/Branch_1/Conv2d_0a_1x1/BatchNorm/batchnorm/sub_grad/tuple/control_dependency/_5039 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_16360_gradients/SecondStageFeatureExtractor/InceptionResnetV2/Mixed_7a/Branch_1/Conv2d_0a_1x1/BatchNorm/batchnorm/sub_grad/tuple/control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I should mention that I am generating my own tfrecords, and the label for the class in the examples is set to 0.
Here is the config I use:
model {
faster_rcnn {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/workspace/data/2/train_*.tfrecord"
}
label_map_path: "/workspace/data/2/label_map.pbtxt"
}
eval_config: {
num_examples: 30000
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/workspace/data/2/validation_*.tfrecord"
}
label_map_path: "/workspace/data/2/label_map.pbtxt"
shuffle: false
num_readers: 1
}
After regenerating the tfrecords with the label ids one-based, it's working.
So, to keep in mind when using a custom dataset/config:
I've been fighting this bug since 2 days. One learning i've had is that if i moved bounding boxes away from image edges, the chance of this bug reduced.
like crash for bounding boxes of (0.001,0.999, 0.001, 0.999) vs (0.2,0.8, 0.2, 0.8), second one does not get this error
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Most helpful comment
I hit that same issue the other day. The problem was in my label map: I'd started my label id values at 0, while I should have started them at 1.