Models: [DeepLab] Loss is inf or nan using Pascal VOC 2012 Dataset

Created on 30 Jan 2020  路  5Comments  路  Source: tensorflow/models

I just wanna try train the dataset from pascal voc using checkpoint deeplabv3_pascal_train_aug but always fail with this issue. I already follow all steps from (http://www.programmersought(dot)com/article/4188126074/;jsessionid=D5E2DD60EB3053E982BEA4EF5FB2ADEA) and (https://www.analyticsvidhya(dot)com/blog/2019/02/tutorial-semantic-segmentation-google-deeplab/).


System information

  • What is the top-level directory of the model you are using: Deeplab
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • TensorFlow version (use command below):tensorflow-gpu 1.15
  • CUDA/cuDNN version: CUDA 10
  • GPU model and memory: GTX 1060 6GB
  • Exact command to reproduce:

This is my script for running my train
python train.py --logtostderr --training_number_of_steps=90000 --train_split="train" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=513,513 --train_batch_size=1 --dataset="pascal_voc_seg" --dataset_dir="D:\FUNFMN\ta3\models\research\deeplab\datasets\Database\tf_record" --tf_initial_checkpoint="D:\FUNFMN\ta3\models\research\deeplab\checkpoint\deeplabv3_pascal_train_aug\model.ckpt" --train_logdir="D:\FUNFMN\ta3\models\research\deeplab\datasets\Database\train_logdir" --initialize_last_layer=False --last_layers_contain_logits_only=True --fine_tune_batch_norm=False

and this my error

Describe the problem

I0131 01:13:28.899782 24364 supervisor.py:1099] global_step/sec: 0
INFO:tensorflow:global step 10: loss = 0.6717 (0.323 sec/step)
I0131 01:13:33.996986 16004 learning.py:507] global step 10: loss = 0.6717 (0.323 sec/step)
INFO:tensorflow:global step 20: loss = 1.5686 (0.325 sec/step)
I0131 01:13:37.344420 16004 learning.py:507] global step 20: loss = 1.5686 (0.325 sec/step)
INFO:tensorflow:global step 30: loss = 0.1612 (0.312 sec/step)
I0131 01:13:40.560513 16004 learning.py:507] global step 30: loss = 0.1612 (0.312 sec/step)
INFO:tensorflow:global step 40: loss = 0.3794 (0.384 sec/step)
I0131 01:13:44.067640 16004 learning.py:507] global step 40: loss = 0.3794 (0.384 sec/step)
INFO:tensorflow:global step 50: loss = 0.1586 (0.329 sec/step)
I0131 01:13:47.300520 16004 learning.py:507] global step 50: loss = 0.1586 (0.329 sec/step)
INFO:tensorflow:global step 60: loss = 0.2120 (0.313 sec/step)
I0131 01:13:50.652518 16004 learning.py:507] global step 60: loss = 0.2120 (0.313 sec/step)
INFO:tensorflow:global step 70: loss = 0.1773 (0.328 sec/step)
I0131 01:13:53.870276 16004 learning.py:507] global step 70: loss = 0.1773 (0.328 sec/step)
INFO:tensorflow:global step 80: loss = 0.2059 (0.313 sec/step)
I0131 01:13:57.196119 16004 learning.py:507] global step 80: loss = 0.2059 (0.313 sec/step)
INFO:tensorflow:global step 90: loss = 0.4544 (0.311 sec/step)
I0131 01:14:00.384758 16004 learning.py:507] global step 90: loss = 0.4544 (0.311 sec/step)
INFO:tensorflow:global step 100: loss = 0.2439 (0.320 sec/step)
I0131 01:14:03.659304 16004 learning.py:507] global step 100: loss = 0.2439 (0.320 sec/step)
INFO:tensorflow:global step 110: loss = 2.0604 (0.328 sec/step)
I0131 01:14:06.906721 16004 learning.py:507] global step 110: loss = 2.0604 (0.328 sec/step)
INFO:tensorflow:global step 120: loss = 3.6568 (0.324 sec/step)
I0131 01:14:10.141874 16004 learning.py:507] global step 120: loss = 3.6568 (0.324 sec/step)
INFO:tensorflow:global step 130: loss = 0.1893 (0.328 sec/step)
I0131 01:14:13.533044 16004 learning.py:507] global step 130: loss = 0.1893 (0.328 sec/step)
INFO:tensorflow:global step 140: loss = 0.8074 (0.321 sec/step)
I0131 01:14:16.729441 16004 learning.py:507] global step 140: loss = 0.8074 (0.321 sec/step)
INFO:tensorflow:global step 150: loss = 0.1688 (0.311 sec/step)
I0131 01:14:19.898051 16004 learning.py:507] global step 150: loss = 0.1688 (0.311 sec/step)
INFO:tensorflow:global step 160: loss = 0.4113 (0.315 sec/step)
I0131 01:14:23.207805 16004 learning.py:507] global step 160: loss = 0.4113 (0.315 sec/step)
INFO:tensorflow:global step 170: loss = 0.4488 (0.316 sec/step)
I0131 01:14:26.409547 16004 learning.py:507] global step 170: loss = 0.4488 (0.316 sec/step)
INFO:tensorflow:global step 180: loss = 1.9576 (0.320 sec/step)
I0131 01:14:29.666006 16004 learning.py:507] global step 180: loss = 1.9576 (0.320 sec/step)
INFO:tensorflow:global step 190: loss = 0.1410 (0.329 sec/step)
I0131 01:14:32.875228 16004 learning.py:507] global step 190: loss = 0.1410 (0.329 sec/step)
INFO:tensorflow:global step 200: loss = 0.1631 (0.321 sec/step)
I0131 01:14:36.106254 16004 learning.py:507] global step 200: loss = 0.1631 (0.321 sec/step)
INFO:tensorflow:Error reported to Coordinator: , Loss is inf or nan. : Tensor had NaN values
[[node CheckNumerics (defined at D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]

Original stack trace for 'CheckNumerics':
File "train.py", line 464, in
tf.app.run()
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 398, in main
total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 1011, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(args, *kwargs)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()

I0131 01:14:38.084058 16004 coordinator.py:224] Error reported to Coordinator: , Loss is inf or nan. : Tensor had NaN values
[[node CheckNumerics (defined at D:\Anaconda3\envs\dlv3p2\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
.

research support

Most helpful comment

I was able to resolve the issue. The problem was with my training dataset.

All 5 comments

I am training deeplab on custom dataset and it is causing same issue..

INFO:tensorflow:Saving checkpoints for 0 into E:\mypython\Lib\site-packages\tensorflow\models\research\deeplab\datasets\train\model.ckpt.
Total loss is :[0.140642926]
Total loss is :[0.140642881]
INFO:tensorflow:global_step/sec: 1.11037
Total loss is :[0.140642926]
INFO:tensorflow:global_step/sec: 2.71813
Total loss is :[0.140643507]
INFO:tensorflow:global_step/sec: 2.70563
Total loss is :[0.140675873]
INFO:tensorflow:global_step/sec: 2.69324
Total loss is :[nan]
INFO:tensorflow:global_step/sec: 2.70379
Total loss is :[nan]
INFO:tensorflow:global_step/sec: 2.70343
Total loss is :[nan]
INFO:tensorflow:global_step/sec: 2.71518
Total loss is :[nan]
INFO:tensorflow:global_step/sec: 2.72926
Total loss is :[nan]

Traceback (most recent call last):
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1
[[{{node xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1}}]]
[[{{node Mean_382}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "deeplab/train.py", line 513, in
tf.app.run()
File "E:\mypython\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 507, in main
sess.run([train_tensor])
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1270, in run
raise six.reraise(original_exc_info)
File "E:\mypython\lib\site-packages\six.py", line 693, in reraise
raise value
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1255, in run
return self._sess.run(
args, *kwargs)
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1091, in run
return self._sess.run(
args, **kwargs)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "E:\mypython\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1
[[node xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1 (defined at deeplab/train.py:322) ]]
[[node Mean_382 (defined at deeplab/train.py:302) ]]

Caused by op 'xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1', defined at:
File "deeplab/train.py", line 513, in
tf.app.run()
File "E:\mypython\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 464, in main
dataset.ignore_label)
File "deeplab/train.py", line 373, in _train_deeplab_model
reuse_variable=(i != 0))
File "deeplab/train.py", line 269, in _tower_loss
_build_deeplab(iterator, {common.OUTPUT_TYPE: num_of_classes}, ignore_label)
File "deeplab/train.py", line 251, in _build_deeplab
output_type_dict[model.MERGED_LOGITS_SCOPE])
File "deeplab/train.py", line 322, in _log_summaries
tf.summary.histogram(model_var.op.name, model_var)
File "E:\mypython\lib\site-packages\tensorflow\python\summary\summary.py", line 177, in histogram
tag=tag, values=values, name=scope)
File "E:\mypython\lib\site-packages\tensorflow\python\ops\gen_logging_ops.py", line 339, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "E:\mypython\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "E:\mypython\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(args, *kwargs)
File "E:\mypython\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "E:\mypython\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1
[[node xception_65/middle_flow/block1/unit_8/xception_module/separable_conv1_depthwise/BatchNorm/beta_1 (defined at deeplab/train.py:322) ]]
[[node Mean_382 (defined at deeplab/train.py:302) ]]

@RajatGarg45 I think its a different issue, u have problem with ur memory GPU computing, try change fine_tune_batch_norm to false in train.py.

btw i found the problem for my issue. I dont know the detail but it's come from my tfrecords after running pascal_voc2012_data.py for convert the data. When i using tfrecord from sh download_and_convert_voc2012.sh it works normally.

@fmanadeprasetyo I have already set it to false and I am preparing my tfrecord files with shell script only.

I was able to resolve the issue. The problem was with my training dataset.

Was this page helpful?
0 / 5 - 0 ratings