Tensor had Inf values while training faster_rcnn_inception_resnet_v2_atrous_coco in my own dataset.
Error log tmux-history-crash.txt
config
model {
faster_rcnn {
num_classes: 5
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1200
max_dimension: 1200
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "zoo/faster_rcnn_inception_resnet_v2_atrous_coco/model.ckpt"
from_detection_checkpoint: true
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "train.record"
}
label_map_path: "label_map.pbtxt"
}
eval_config: {
num_examples: 1200
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "eval.record"
}
label_map_path: "label_map.pbtxt"
shuffle: false
num_readers: 1
}
Error log
`
INFO:tensorflow:loss = 0.0033215743, step = 63717 (1.122 sec)
INFO:tensorflow:global_step/sec: 0.985096
INFO:tensorflow:loss = 0.3018198, step = 63718 (1.015 sec)
INFO:tensorflow:global_step/sec: 0.891359
INFO:tensorflow:loss = 0.07413207, step = 63719 (1.122 sec)
INFO:tensorflow:global_step/sec: 0.901555
INFO:tensorflow:loss = 0.010225122, step = 63720 (1.109 sec)
INFO:tensorflow:global_step/sec: 0.895415
INFO:tensorflow:loss = 0.05728304, step = 63721 (1.117 sec)
INFO:tensorflow:global_step/sec: 1.08617
INFO:tensorflow:loss = 0.006452726, step = 63722 (0.921 sec)
INFO:tensorflow:global_step/sec: 0.889418
INFO:tensorflow:loss = 0.014497861, step = 63723 (1.124 sec)
INFO:tensorflow:global_step/sec: 0.896101
INFO:tensorflow:loss = 0.02584396, step = 63724 (1.116 sec)
INFO:tensorflow:global_step/sec: 0.892287
INFO:tensorflow:loss = 0.015612617, step = 63725 (1.121 sec)
INFO:tensorflow:global_step/sec: 0.897164
INFO:tensorflow:loss = 0.011534013, step = 63726 (1.115 sec)
INFO:tensorflow:global_step/sec: 0.899451
INFO:tensorflow:loss = 0.024849497, step = 63727 (1.112 sec)
INFO:tensorflow:global_step/sec: 0.917954
INFO:tensorflow:loss = 0.23693836, step = 63728 (1.089 sec)
INFO:tensorflow:global_step/sec: 0.911387
INFO:tensorflow:loss = 0.09265502, step = 63729 (1.097 sec)
2019-02-09 18:43:08.167044: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f703ba43100 = {0, 1} Found Inf or NaN global norm.
Traceback (most recent call last):
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/dev
ice:GPU:0"](global_norm_1/global_norm)]]
[[{{node control_dependency/_12753}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:loc
alhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_26797_control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/rep
lica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "object_detection/model_main.py", line 111, in
tf.app.run()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/model_main.py", line 107, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(original_exc_info)
File "/home/safetyml/venv/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(args, *kwargs)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(args, **kwargs)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
[[node VerifyFinite/CheckNumerics (defined at /home/safetyml/venv/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/optimizers.py:306
) = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm_1/global_norm)]]
[[{{node control_dependency/_12753}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:loc
alhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_26797_control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/rep
lica:0/task:0/device:CPU:0"]()]]
Caused by op 'VerifyFinite/CheckNumerics', defined at:
File "object_detection/model_main.py", line 111, in
tf.app.run()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/model_main.py", line 107, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, *kwargs)
File "/home/safetyml/models/research/object_detection/model_lib.py", line 382, in model_fn
name='') # Preventing scope prefix on all variables.
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 260, in optimize_loss
gradients = _clip_gradients_by_norm(gradients, clip_gradients)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 306, in _clip_gradients_by_norm
clipped_gradients, _ = clip_ops.clip_by_global_norm(gradients, clip_gradients)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(args, **kwargs)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/safetyml/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values
[[node VerifyFinite/CheckNumerics (defined at /home/safetyml/venv/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/optimizers.py:306
) = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm_1/global_norm)]]
[[{{node control_dependency/_12753}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:loc
alhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_26797_control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/rep
lica:0/task:0/device:CPU:0"]()]]
`
Please consider closing this issue. I've found the root cause due to a mismatch between label_map.pbtxt and TFRecord.
Can you tell me which kind of mismatch was this? I'm facing the same issue
Can you tell me which kind of mismatch was this? I'm facing the same issue
In my case, I had 4 classes in label_map.pbtxt but I had 5 classes in TFRecord. Once I updated label_map.pbtxt, the problem was fixed.
I ran into this issue as well. I specified the correct number of classes in my label_map.pbtxt, but I did not update the "num_classes" field in my config file. Setting the correct number in the config file fixed the problem.
I came with this problem as well. I fixed it. Maybe sometimes the error is due to the inconsistency between the record file and pbtxt file but I checked them which were definitely fine. Finally i just use 2 GPU which had 20G in all and they ran well. So it better to check you program whether over memory or not. Hopefully help you.
Closing this issue since its resolved. Thanks!
I am trying to train custom object detector using Google colab, After removal of few error finally I Run this command to start training
!python model_main.py --logtostderr --model_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config
I am getting following error.
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had Inf values
[[node VerifyFinite/CheckNumerics (defined at /usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/model_lib.py:382) ]]
[[node control_dependency (defined at /usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/model_lib.py:382) ]]
Kindly help me to resolve this error.
I have check mismatch between .pbtxt file and tf record, no of classes in config and .pbtxt file everything is fine.
Most helpful comment
In my case, I had 4 classes in label_map.pbtxt but I had 5 classes in TFRecord. Once I updated label_map.pbtxt, the problem was fixed.