Models: InvalidArgumentError: Nan in summary histogram

Created on 15 Mar 2018 · 5Comments · Source: tensorflow/models

hi,
I'm trying to train cityscapes dataset by deeplab, I have successfully converted the dataset into tfrecord and started the training, but a 'Nan' error occured around step 1400 and break the training. could u please tell me how to solve this error, thanks a lot!

environment

nvidia-docker, tensorflow/tensorflow:latest-gpu(ubuntu16.04, tf1.6, cuda8, nvidia1080ti)

command

python deeplab/train.py \
--logtostderr \
--training_number_of_steps=90000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=769 \
--train_crop_size=769 \
--train_batch_size=1 \
--dataset="cityscapes" \
--tf_initial_checkpoint="deeplab/model/deeplabv3_cityscapes_train/model.ckpt.index" \
--train_logdir="deeplab/train_log" \
--dataset_dir="deeplab/datasets/cityscapes/tfrecord"

logs

INFO:tensorflow:global step 1420: loss = 6.7608 (0.420 sec/step)
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read)]]
[[Node: xception_65/middle_flow/block1/unit_16/xception_module/separable_conv3_depthwise/depthwise_weights/read/_877 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2313_...ights/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'image_pooling/BatchNorm/moving_variance_1', defined at:
File "deeplab/train.py", line 347, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 268, in main
summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 193, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

Source

helloworld77

👍7

Most helpful comment

https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.

helloworld77 on 15 Mar 2018

👍11

All 5 comments

https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.

helloworld77 on 15 Mar 2018

👍11

Closing as this is resolved

wt-huang on 27 Oct 2018

When fine_tune_batch_norm=True, use at least batch size larger than 12 (batch size more than 16 is better). Otherwise, one could use smaller batch size and set fine_tune_batch_norm=False. The default value of fine_tune_batch_norm is True, so train batch size should be set larger than 12.

yuhuan90 on 28 Apr 2019

I have set fine_tune_batch_norm=False and still getting the same issue. My GPU memory doesn't allow me to use batch size of 12 or larger so can not set it true.