Models: Error reported to Coordinator: Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance

Created on 11 May 2018  路  6Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: object-detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
  • Bazel version (if compiling from source): --
  • CUDA/cuDNN version: 9.0, 7.1.3
  • GPU model and memory: Tesla P40, totalMemory: 22.38GiB freeMemory: 22.21GiB
  • Exact command to reproduce:

When I start to train my own image dataset using ssd_mobilenet_v2_coco_2018_03_29, after step 667 it throws below exceptions:

INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance
     [[Node: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/tag, BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance/read)]]
     [[Node: FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/kernel/Regularizer/l2_regularizer/_119 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1124_...egularizer", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance', defined at:
  File "train.py", line 167, in <module>
    tf.app.run()
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/adil/workspace/models/research/object_detection/trainer.py", line 338, in train
    model_var.op.name, model_var))
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 203, in histogram
    tag=tag, values=values, name=scope)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 283, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/adil/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Nan in summary histogram for: ModelVars/BoxPredictor_5/ClassPredictor_depthwise/BatchNorm/moving_variance

What can be the reason, and how to fix? Thanks!!

Most helpful comment

I met this question too, when I modified the batch_size from 1 to some value bigger like 64, it's work

All 6 comments

I have the same problem, did you find the reason ? Thank you!

@timoonboru You can first check all the images are readable and it is not an interrupted image, some of the images even is an interrupted image also can be opened in Windows OS, but cannot in Linux. The best way to check all the images are OK is to use OpenCV read all the images and write is back to the disk with JPEG format if it was readable by OpenCV, then it means this image is OK, if not, then it will throw the problem above.

Closing as this is resolved, free to reopen if problem persists.

I met this question too, when I modified the batch_size from 1 to some value bigger like 64, it's work

I solved similar issue just by changing batch_size from 1 to 2

@TheoPaput thank you . i try it and it solved my problem.

Was this page helpful?
0 / 5 - 0 ratings