Models: LossTensor is inf or nan while training ssd_mobilenet_v1_coco model in my own dataset

Created on 22 Mar 2018 · 11Comments · Source: tensorflow/models

I am having issues similar to #1881 and #1907.

Using google object_detection api and the latest tensorflow master repo built with CUDA 9.1 on Linux Mint 18.2 (bases on Ubuntu Xenial).

Have I written custom code: No, but custom dataset
OS Platform and Distribution: Linux Mint 18.2 (based on Ubuntu 16.04)
TensorFlow installed from: Tensorflow built and installed from github master
TensorFlow version: 1.8.0-rc0-cp35-cp35m-linux_x86_64
Bazel version: 0.12.0
CUDA/cuDNN version: CUDA 9.1, cuDNN 7.1
GPU model and memory: GTX1080Ti 11GB
Exact command to reproduce: cd tensorflow/models/research && python3 object_detection/train.py --logtostderr --pipeline_config_path=/path/to/pipeline_config.pbtxt --train_dir=/path/to/train/folder

Describe the problem

I am trying to fine-tune the ssd mobilenet v1 coco model using my own dataset. I am using the default config file that is provided in the object detection repository.

From the very first global step of training I receive the "LossTensor is inf or nan. : Tensor had NaN values" error. Things I have tried:

Ensuring that all bounding box coordinates are inside of the image boundary.
Removing all bounding boxes that are smaller than 20 pixels in either width or height.
Removing all bounding boxes that are smaller than 15% of the image width of height.
Increasing batch size
Decreasing learning rate
Removing data augmentation from the config file

From what I can tell, these are all of the things that were suggested in #1881 and #1907, none of these have worked for me.

Source code / logs

INFO:tensorflow:Restoring parameters from /media/bidski/Portable/imagetagger/tf_objapi/models/ssd_mobilenet_v1_coco_2017_11_17/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /media/bidski/Portable/imagetagger/tf_objapi/models/ssd_mobilenet/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 31.4505 (12.442 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Caused by op 'CheckNumerics', defined at:
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 288, in train
    total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 734, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3303, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1669, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1313, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1421, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 360, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 906, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1141, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Caused by op 'CheckNumerics', defined at:
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 288, in train
    total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 734, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3303, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1669, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Source

Bidski

👍8

Most helpful comment

I've met this issue too. Although I used faster_rcnn_resnet01_pets to train my own dataset. I've learned from someone who said this might be caused by the small pictures, such as 15x30 pixels in the training set. I will try to remove these samples in my dataset and train it again. If any update, I will post it here.

UPDATE: After an investigation, I found the small size of samples are not the root cause of the crash. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.)
Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.

huzq85 on 4 May 2018

👍8 ❤5 🎉3 😄2 🚀1

All 11 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

tensorflowbutler on 25 Apr 2018

Updated to make requested details more obvious

Bidski on 25 Apr 2018

huzq85 on 4 May 2018

👍8 ❤5 🎉3 😄2 🚀1

I just re-checked my dataset. I have no entries where x1 > x2 or `y1 > y2. However, this error still persists for me.

Bidski on 19 May 2018

👍6

@Bidski , when did you encounter this crash during the training? At the very beginning or running after a while? Actually, I am thinking whoever meet this crash, somehow it should be related to the dataset
(maybe it caused by the wrong order of the coordinates, or may be caused by the invalid coordinates' values). Maybe you can split the data set into pieces for training to check in which part of the data will cause this crash happen again. In this way, I think at least you can determine the scope of crash-related data, so that you might be find what is the real problem to cause your crash.

huzq85 on 19 May 2018

👍2

I have the same problem, and I tried

ssd_mobilenet_v1_coco,   
ssd_mobilenet_v1_ppn,  
ssd_mobilenet_v2_coco,  
ssdlite_mobilenet_v2_coco

They also can't work, but if I use faster_rcnn_inception_v2_coco, it works well.

myuanz on 9 Aug 2018

👍2

I've similar problem.
Maybe related to: #4881

xtianhb on 10 Aug 2018

Very same problem here, trying @myuanz set of NN and tensorflow r1.8 with gpu support compiled in Windows 10 Win64.

mawanda-jun on 14 Aug 2018

I had this problem. It was solved when I checked:

(xmin, xmax) < width ; (ymin, ymax) < height
(xmin < xmax) ; (ymin < ymax)
if object_area <= (width * height) / (16 * 16): raise Exception('object too small Error')

carlosfab on 16 Aug 2018

👍2

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

@carlosfab you are absolutely right. You saved the day. Thank you