Models: InvalidArgumentError: Nan in summary histogram

Created on 15 Mar 2018  路  5Comments  路  Source: tensorflow/models

hi,
I'm trying to train cityscapes dataset by deeplab, I have successfully converted the dataset into tfrecord and started the training, but a 'Nan' error occured around step 1400 and break the training. could u please tell me how to solve this error, thanks a lot!

environment

nvidia-docker, tensorflow/tensorflow:latest-gpu(ubuntu16.04, tf1.6, cuda8, nvidia1080ti)

command

python deeplab/train.py \
--logtostderr \
--training_number_of_steps=90000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=769 \
--train_crop_size=769 \
--train_batch_size=1 \
--dataset="cityscapes" \
--tf_initial_checkpoint="deeplab/model/deeplabv3_cityscapes_train/model.ckpt.index" \
--train_logdir="deeplab/train_log" \
--dataset_dir="deeplab/datasets/cityscapes/tfrecord"

logs

INFO:tensorflow:global step 1420: loss = 6.7608 (0.420 sec/step)
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read)]]
[[Node: xception_65/middle_flow/block1/unit_16/xception_module/separable_conv3_depthwise/depthwise_weights/read/_877 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2313_...ights/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'image_pooling/BatchNorm/moving_variance_1', defined at:
File "deeplab/train.py", line 347, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 268, in main
summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 193, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

Most helpful comment

https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.

All 5 comments

https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.

Closing as this is resolved

When fine_tune_batch_norm=True, use at least batch size larger than 12 (batch size more than 16 is better). Otherwise, one could use smaller batch size and set fine_tune_batch_norm=False. The default value of fine_tune_batch_norm is True, so train batch size should be set larger than 12.

I have set fine_tune_batch_norm=False and still getting the same issue. My GPU memory doesn't allow me to use batch size of 12 or larger so can not set it true.

setting batch size to 2 and decreasing no. of steps worked for me.

Was this page helpful?
0 / 5 - 0 ratings