When I start to train my own image dataset using ssd_mobilenet_v2_coco_2018_03_29, after step 667 it throws below exceptions:
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance
[[Node: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/tag, BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/read)]]
[[Node: FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/kernel/Regularizer/l2_regularizer/_119 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1124_...egularizer", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance', defined at:
File "train.py", line 167, in <module>
tf.app.run()
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/adil/workspace/models/research/object_detection/trainer.py", line 338, in train
model_var.op.name, model_var))
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 203, in histogram
tag=tag, values=values, name=scope)
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 283, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance
What can be the reason, and how to fix? Thanks!!
I have the same problem, did you find the reason ? Thank you!
@timoonboru You can first check all the images are readable and it is not an interrupted image, some of the images even is an interrupted image also can be opened in Windows OS, but cannot in Linux. The best way to check all the images are OK is to use OpenCV read all the images and write is back to the disk with JPEG format if it was readable by OpenCV, then it means this image is OK, if not, then it will throw the problem above.
Closing as this is resolved, free to reopen if problem persists.
I met this question too, when I modified the batch_size from 1 to some value bigger like 64, it's work
I solved similar issue just by changing batch_size from 1 to 2
@TheoPaput thank you . i try it and it solved my problem.
Most helpful comment
I met this question too, when I modified the batch_size from 1 to some value bigger like 64, it's work