Models: Deeplab train on cityscapes got error said taht loss is inf or nan

Created on 24 Mar 2018 · 17Comments · Source: tensorflow/models

System information

Linux Ubuntu 14.04):
TensorFlow installed from source:
TensorFlow version 1.6:
Bazel version (if compiling from source):
CUDA/cuDNN version:8/5.1:
GPU model and memory:
Exact command to reproduce:

Describe the problem

I trained on cityscapes with fine_tune_batch_norm = false . The model is inied from deeplabv3_cityscapes_train_2018_02_06. And it got loss is inf or nan error.

Training logs

INFO:tensorflow:global step 570: loss = 2.4862 (0.517 sec/step)
INFO:tensorflow:global step 580: loss = 1.9100 (0.539 sec/step)
INFO:tensorflow:global step 590: loss = 1.9793 (0.932 sec/step)
INFO:tensorflow:global step 600: loss = 3.4337 (0.525 sec/step)
INFO:tensorflow:global step 610: loss = 84.6659 (0.515 sec/step)
INFO:tensorflow:global step 620: loss = 20.1596 (0.948 sec/step)
INFO:tensorflow:global step 630: loss = 2.8936 (0.525 sec/step)
INFO:tensorflow:global step 640: loss = 1.9785 (0.529 sec/step)
INFO:tensorflow:global step 650: loss = 1.9451 (0.909 sec/step)
INFO:tensorflow:global step 660: loss = 3.2844 (0.532 sec/step)
INFO:tensorflow:global step 670: loss = 1.9610 (0.524 sec/step)
INFO:tensorflow:global step 680: loss = 3.4317 (0.991 sec/step)
INFO:tensorflow:global step 690: loss = 3.5062 (0.544 sec/step)
INFO:tensorflow:global step 700: loss = 2.9824 (0.633 sec/step)
INFO:tensorflow:global step 710: loss = 3.1381 (0.963 sec/step)
INFO:tensorflow:global step 720: loss = 12.0563 (0.521 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, i
n run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4
98, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py",
 line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271,
 in create_op
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650,
 in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/
task:0/device:CPU:0"](total_loss)]]

Traceback (most recent call last):
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 340, in main
    save_interval_secs=FLAGS.save_interval_secs)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/master-grade3-1/.conda/envs/zhangj/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values

Source

ksnzh

Most helpful comment

@gsaibro
Thanks, I found the reason, the label image with one channel is fine. The problem is that I set number of classes in segmentation_dataset.py as 2 because I thought the background is ignored. When I set the number to 3, the training process is fine with weighted losses.

zheningy1 on 25 Jul 2018

🎉4

All 17 comments

Probably because of limited GPU memory. See https://github.com/tensorflow/models/issues/3716 .

walkerlala on 24 Mar 2018

I've retrained with batchsize=1 on one titan x gpu and had the same error
when global step is 610.

2018-03-24 20:27 GMT+08:00 Yubin notifications@github.com:

Probably because of limited GPU memory. See #3716
https://github.com/tensorflow/models/issues/3716 .

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3729#issuecomment-375883726,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOMWC7L3U11x63exerSzO-UOy9ASrgd7ks5thjuhgaJpZM4S5oIG
.

ksnzh on 24 Mar 2018

Here's my train log.

(tf1.6-py27) ➜  deeplab git:(master) ✗ bash run_cityscapes.sh
INFO:tensorflow:Training on train set
WARNING:tensorflow:From /home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Initializing model from path: /media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/init_models/deeplabv3_cityscapes_train/model.ckpt
WARNING:tensorflow:From /home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-03-24 22:10:29.562639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-24 22:10:29.563002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:01:00.0
totalMemory: 11.92GiB freeMemory: 10.72GiB
2018-03-24 22:10:29.563017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-24 22:10:29.749607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10379 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Restoring parameters from /media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/init_models/deeplabv3_cityscapes_train/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.4971 (0.639 sec/step)
INFO:tensorflow:global step 20: loss = 1.3233 (0.620 sec/step)
INFO:tensorflow:global step 30: loss = 0.8094 (0.664 sec/step)
INFO:tensorflow:global step 40: loss = 1.1604 (0.601 sec/step)
INFO:tensorflow:global step 50: loss = 0.8207 (0.584 sec/step)
INFO:tensorflow:global step 60: loss = 1.2645 (0.593 sec/step)
INFO:tensorflow:global step 70: loss = 0.8217 (0.591 sec/step)
INFO:tensorflow:global step 80: loss = 0.6755 (0.584 sec/step)
INFO:tensorflow:global step 90: loss = 0.8059 (0.591 sec/step)
INFO:tensorflow:global step 100: loss = 0.9199 (0.599 sec/step)
INFO:tensorflow:global step 110: loss = 0.7578 (0.591 sec/step)
INFO:tensorflow:global step 120: loss = 0.9295 (0.608 sec/step)
INFO:tensorflow:global step 130: loss = 0.6265 (0.605 sec/step)
INFO:tensorflow:global step 140: loss = 0.3992 (0.599 sec/step)
INFO:tensorflow:global step 150: loss = 0.7431 (0.603 sec/step)
INFO:tensorflow:global step 160: loss = 0.6495 (0.596 sec/step)
INFO:tensorflow:global step 170: loss = 0.4857 (0.596 sec/step)
INFO:tensorflow:global step 180: loss = 4.6753 (0.640 sec/step)
INFO:tensorflow:global step 190: loss = 0.4989 (0.655 sec/step)
INFO:tensorflow:global step 200: loss = 1.0801 (0.647 sec/step)
INFO:tensorflow:global step 210: loss = 2.3884 (0.624 sec/step)
INFO:tensorflow:global step 220: loss = 0.6902 (0.640 sec/step)
INFO:tensorflow:global step 230: loss = 0.9207 (0.633 sec/step)
INFO:tensorflow:global step 240: loss = 0.5820 (0.646 sec/step)
INFO:tensorflow:global step 250: loss = 0.3848 (0.676 sec/step)
INFO:tensorflow:global step 260: loss = 0.4573 (0.637 sec/step)
INFO:tensorflow:global step 270: loss = 1.4292 (0.649 sec/step)
INFO:tensorflow:global step 280: loss = 0.7442 (0.631 sec/step)
INFO:tensorflow:global step 290: loss = 0.6673 (0.636 sec/step)
INFO:tensorflow:global step 300: loss = 0.6911 (0.616 sec/step)
INFO:tensorflow:global step 310: loss = 0.7100 (0.629 sec/step)
INFO:tensorflow:global step 320: loss = 0.6075 (0.623 sec/step)
INFO:tensorflow:global step 330: loss = 0.6944 (0.621 sec/step)
INFO:tensorflow:global step 340: loss = 0.6090 (0.620 sec/step)
INFO:tensorflow:global step 350: loss = 1.2337 (0.629 sec/step)
INFO:tensorflow:global step 360: loss = 0.8289 (0.625 sec/step)
INFO:tensorflow:global step 370: loss = 9.8227 (0.642 sec/step)
INFO:tensorflow:global step 380: loss = 10.2865 (0.652 sec/step)
INFO:tensorflow:global step 390: loss = 1.4786 (0.635 sec/step)
INFO:tensorflow:global step 400: loss = 0.7148 (0.632 sec/step)
INFO:tensorflow:global step 410: loss = 1.4818 (0.629 sec/step)
INFO:tensorflow:global step 420: loss = 0.7312 (0.623 sec/step)
INFO:tensorflow:global step 430: loss = 3.1183 (0.681 sec/step)
INFO:tensorflow:global step 440: loss = 0.4331 (0.675 sec/step)
INFO:tensorflow:global step 450: loss = 2.6423 (0.611 sec/step)
INFO:tensorflow:global step 460: loss = 1.2076 (0.655 sec/step)
INFO:tensorflow:global step 470: loss = 0.6822 (0.627 sec/step)
INFO:tensorflow:global step 480: loss = 0.8022 (0.613 sec/step)
INFO:tensorflow:global step 490: loss = 1.7316 (0.634 sec/step)
INFO:tensorflow:global step 500: loss = 0.3342 (0.618 sec/step)
INFO:tensorflow:global step 510: loss = 0.9020 (0.625 sec/step)
INFO:tensorflow:global step 520: loss = 0.5030 (0.636 sec/step)
INFO:tensorflow:global step 530: loss = 0.7095 (0.612 sec/step)
INFO:tensorflow:global step 540: loss = 1.2703 (0.626 sec/step)
INFO:tensorflow:global step 550: loss = 3.7665 (0.667 sec/step)
INFO:tensorflow:global step 560: loss = 48.5454 (0.661 sec/step)
INFO:tensorflow:global step 570: loss = 40.2562 (0.606 sec/step)
INFO:tensorflow:global step 580: loss = 0.8814 (0.677 sec/step)
INFO:tensorflow:global step 590: loss = 0.6106 (0.616 sec/step)
INFO:tensorflow:global step 600: loss = 1.8423 (0.676 sec/step)
INFO:tensorflow:global step 610: loss = 6.9696 (0.610 sec/step)
INFO:tensorflow:global step 620: loss = 10501.7734 (0.617 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

Traceback (most recent call last):
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 340, in main
    save_interval_secs=FLAGS.save_interval_secs)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

Caused by op u'CheckNumerics', defined at:
  File "train.py", line 347, in <module>
    tf.app.run()
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "train.py", line 291, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/kenzhang/miniconda2/envs/tf1.6-py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]

ksnzh on 24 Mar 2018

@ksnzh Have you set --fine_tune_batch_norm=False ?

walkerlala on 25 Mar 2018

@walkerlala Yes, I have done this.
Train script

python train.py \
    --logtostderr \
    --training_number_of_steps=90000 \
    --fine_tune_batch_norm=False \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size=513 \
    --train_crop_size=513 \
    --train_batch_size=1 \
    --dataset="cityscapes" \
    --tf_initial_checkpoint="/media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/init_models/deeplabv3_cityscapes_train/model.ckpt" \
    --train_logdir="/media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/train" \
    --dataset_dir="/media/kenzhang/DATA/tensorflow-models/research/deeplab/datasets/cityscapes/tfrecord"

ksnzh on 25 Mar 2018

maybe modifying ’False‘ to ’false‘，and trying to decrease training_number_of_steps helps.

GeorgeBohw on 27 Mar 2018

hello, I got the same probelm.
Have you solved it?

hhwxxx on 5 Apr 2018

Got the same error here. I am trying to train on Cityscapes, and using one 1080 (8G).
Already set fine_tune_batch_norm=False and train_batch_size=1

BeSlower on 27 Apr 2018

System information
Linux Ubuntu 16.04
TensorFlow version 1.6
CUDA/cuDNN version:9/7.0.4
GPU model and memory: xception65, 2 NVIDIA Volta GPUs, 12G
Exact command to reproduce:
python train.py --logtostderr --training_number_of_steps=30000 --fine_tune_batch_norm=False --train_split="train" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=513 --train_crop_size=513 --train_batch_size=1 --dataset="pascal_voc_seg" --tf_initial_checkpoint="init_checkpoints/deeplabv3_pascal_trainval/model.ckpt" --train_logdir="trained_checkpoint" --dataset_dir="datasets/pascal_voc_seg/tfrecord"

What Problem:
same with @ksnzh, Loss Nan. Great appreciation for this wonderful work and hope this problem can be solved earlier.

JackLongKing on 16 May 2018

@ksnzh Hello, i got the same problem. also ,i changed the value fine_tune_batch_norm from true to False. And then i got the error.
Have you solved the problem? could you give me some efficient advice? thank you.

deanzhanggit on 10 Jul 2018

Hi, I had the same problem, it was correlated to the output number of classes.

You have to be sure about how to initializate the flags '--initialize_last_layer' and '--last_layers_contain_logits_only'. Here you have an explanation given by @aquariusjay
'''
When you want to fine-tune DeepLab on other datasets, there are a few cases:

You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).

You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.

You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.
'''

You have to be sure that on segmentation_dataset.py you have the right number of classes, like:
'''
_CITYSCAPES_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 2975,
'val': 500,
},
num_classes=19,
ignore_label=255,
)
'''

And you also have to remember to have the label images with just 1 channel, from 0 to num_classes as pixel values.
Last, fine_tune_batch_norm=false is the case if you don't have a gpu server. false, not False.

gsaibro on 23 Jul 2018

@gsaibro Hi, thanks for your advice. I followed all of these instructions but still come up with this problem when I set weights more than 1 because I have an unbalanced dataset with only two class and most of region is background. As @aquariusjay suggested

In your case, the data samples may be strongly biased to one of the classes. That is why the model only predicts one class in the end. To handle that, I would suggest using larger loss_weight for the under-sampled class (i.e., that class that has fewer data samples). You could modify the weights in line 72 by doing something like
weights = tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0
where you need to tune the label0_weight and label1_weight (e.g., set label0_weight=1 and increase label1_weight).

But even I try replace 1 with 1.5 for the two categories. This error would come up with. I don't know why, hope you could give me some suggestions.

zheningy1 on 24 Jul 2018

@zheningy1
I didn't try to change the weights yet, than I can't help you with that. Are you sure that your labels are images with pixels values like [0,0,0],[1,1,1],[2,2,2]...?

I would suggest you to try with a balanced dataset and standart weights just to be sure that the error is not somewhere else. I trained it on my own dataset with just 300 images and the intersection over union was around 40%.

Other possibility is that your outputs images are not being converted to rgb images, than you would have an image with pixels values like [0,0,0],[1,1,1],[2,2,2]... Try to open your image with pyplot to check if it's not the case.

gsaibro on 25 Jul 2018

zheningy1 on 25 Jul 2018

🎉4

Closing as this is resolved