hi,
I'm trying to train cityscapes dataset by deeplab, I have successfully converted the dataset into tfrecord and started the training, but a 'Nan' error occured around step 1400 and break the training. could u please tell me how to solve this error, thanks a lot!
nvidia-docker, tensorflow/tensorflow:latest-gpu(ubuntu16.04, tf1.6, cuda8, nvidia1080ti)
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=90000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=769 \
--train_crop_size=769 \
--train_batch_size=1 \
--dataset="cityscapes" \
--tf_initial_checkpoint="deeplab/model/deeplabv3_cityscapes_train/model.ckpt.index" \
--train_logdir="deeplab/train_log" \
--dataset_dir="deeplab/datasets/cityscapes/tfrecord"
INFO:tensorflow:global step 1420: loss = 6.7608 (0.420 sec/step)
INFO:tensorflow:Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[Node: image_pooling/BatchNorm/moving_variance_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](image_pooling/BatchNorm/moving_variance_1/tag, image_pooling/BatchNorm/moving_variance/read)]]
[[Node: xception_65/middle_flow/block1/unit_16/xception_module/separable_conv3_depthwise/depthwise_weights/read/_877 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2313_...ights/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'image_pooling/BatchNorm/moving_variance_1', defined at:
File "deeplab/train.py", line 347, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 268, in main
summaries.add(tf.summary.histogram(model_var.op.name, model_var))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 193, in histogram
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 189, in _histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.
Closing as this is resolved
When fine_tune_batch_norm=True, use at least batch size larger than 12 (batch size more than 16 is better). Otherwise, one could use smaller batch size and set fine_tune_batch_norm=False. The default value of fine_tune_batch_norm is True, so train batch size should be set larger than 12.
I have set fine_tune_batch_norm=False and still getting the same issue. My GPU memory doesn't allow me to use batch size of 12 or larger so can not set it true.
setting batch size to 2 and decreasing no. of steps worked for me.
Most helpful comment
https://stackoverflow.com/a/49260201/9498482, as Chen said, I modified the fine_tune_batch_norm to false (in train.py), and that works.