Models: Error reported to Coordinator: Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance

Created on 11 May 2018 · 6Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: object-detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
Bazel version (if compiling from source): --
CUDA/cuDNN version: 9.0, 7.1.3
GPU model and memory: Tesla P40, totalMemory: 22.38GiB freeMemory: 22.21GiB
Exact command to reproduce:

When I start to train my own image dataset using ssd_mobilenet_v2_coco_2018_03_29, after step 667 it throws below exceptions:

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance
     [[Node: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/tag, BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/read)]]
     [[Node: FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/kernel/Regularizer/l2_regularizer/_119 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1124_...egularizer", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance', defined at:
  File "train.py", line 167, in <module>
    tf.app.run()
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/adil/workspace/models/research/object_detection/trainer.py", line 338, in train
    model_var.op.name, model_var))
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 203, in histogram
    tag=tag, values=values, name=scope)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 283, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance

What can be the reason, and how to fix? Thanks!!

Source

Bahramudin

Most helpful comment

I met this question too, when I modified the batch_size from 1 to some value bigger like 64, it's work

lighTQ on 14 Mar 2019

👍3 😄1

All 6 comments

I have the same problem, did you find the reason ? Thank you!

timoonboru on 30 Jul 2018

@timoonboru You can first check all the images are readable and it is not an interrupted image, some of the images even is an interrupted image also can be opened in Windows OS, but cannot in Linux. The best way to check all the images are OK is to use OpenCV read all the images and write is back to the disk with JPEG format if it was readable by OpenCV, then it means this image is OK, if not, then it will throw the problem above.

Bahramudin on 30 Jul 2018

Closing as this is resolved, free to reopen if problem persists.