Models: Training custom model crashes with "ERROR:tensorflow:Model diverged with loss = NaN."

Created on 24 Jul 2018 · 45Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: Object detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.9.0
Bazel version (if compiling from source): -
CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.0.5
GPU model and memory: Nvidia GeForce GTX 1050 Ti
Exact command to reproduce:
python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --logtostderr

Describe the problem

The model_main.py script crashes before even one training step. Normally I'd say it's because of my GPU, but with the now deprecated train.py script it worked well. I'm training a custom model with the ssd_inception_v2_coco config file and the model as finetune checkpoint.

Source code / logs

`(tensorflow2) c:\tensorflow2\models\research\object_detection>python model_main.py --model_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config --eval_training_data --alsologtostderr
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(args, *kwds)
C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\importlib_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(args, *kwds)
C:\tensorflow2\models\research\object_detection\utils\visualization_utils.py:25: UserWarning:
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called before pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was originally set to 'TkAgg' by the following code:
File "model_main.py", line 26, in
from object_detection import model_lib
File "C:\tensorflow2\models\research\object_detection\model_lib.py", line 26, in
from object_detection import eval_util
File "C:\tensorflow2\models\research\object_detection\eval_util.py", line 28, in
from object_detection.metrics import coco_evaluation
File "C:\tensorflow2\models\research\object_detection\metrics\coco_evaluation.py", line 20, in
from object_detection.metrics import coco_tools
File "C:\tensorflow2\models\research\object_detection\metrics\coco_tools.py", line 47, in
from pycocotools import coco
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\pycocotools\coco.py", line 49, in
import matplotlib.pyplot as plt
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\pyplot.py", line 71, in
from matplotlib.backends import pylab_setup
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\matplotlib\backends__init__.py", line 16, in
line for line in traceback.format_stack()

import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
WARNING:tensorflow:Estimator's model_fn (.model_fn at 0x0000013642613C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
2018-07-24 16:28:36.695781: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-07-24 16:28:37.044293: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.4175
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.30GiB
2018-07-24 16:28:37.053468: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-24 16:28:37.688722: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-24 16:28:37.692253: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0
2018-07-24 16:28:37.694498: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018-07-24 16:28:37.696812: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3025 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "model_main.py", line 101, in
tf.app.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
hooks=train_hooks)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
saving_listeners)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
raise six.reraise(original_exc_info)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
return self._sess.run(args, **kwargs)
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
run_metadata=run_metadata))
File "C:\Users\Luca\Anaconda3\envs\tensorflow2\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.`

research support

Source

Luca3424

👍26

Most helpful comment

Hi guys, I end up using the old train.py from the legacy folder.
I mean, like this:
From models/research/object_detection
python ./legacy/train.py --pipeline_config_path=pipeline_config/ssd_mobilenet_v2_coco.config --train_dir=training/ --logtostderr

jacano on 26 Jul 2018

👍11 🎉3

All 45 comments

Same problem with me on windows. If I add follow commd

-num_train_steps=1 -num_eval_steps=1

it will stop working after Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8799 MB memory)

kingstarcraft on 25 Jul 2018

When I'm adding the --num_train_steps=1 and --num_eval_steps=1 commands it crashes because of the following:

tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: training/export\Servo\temp-b'1532522498'; No such file or directory

As soon as I'm increasing the values of these commands it throws the same error I've mentioned above.

Any ideas? Thanks!

Luca3424 on 25 Jul 2018

Same problem here

jacano on 25 Jul 2018

I am also running into this issue. I was able to execute the model_main.py script against the latest tensorflow cpu package and have it run through a large number of steps but when trying to leverage the tensorflow gpu I keep running into the error "model diverged with loss = NaN" I tried varying my batch size but that did not resolve the issue.

GuyTraveler on 25 Jul 2018

jacano on 26 Jul 2018

👍11 🎉3

same problem while using model_main.py to train
@jacano do you see duplicated training steps while using legacy train.py? i saw infos like this

INFO:tensorflow:Restoring parameters from /ChinaRS/code/tensorflow/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

yuezhilanyi on 31 Jul 2018

Yes, I got duplicated lines yesterday, during a training session. Same as you.
I guess it has to do with the --logtostderr flag. Didn´t had time to investigate further, sorry.

jacano on 31 Jul 2018

I have the same problem with a very similar setup/task. I'm training for 1 class, using GPU GTX 1060 6GB.
Command: python model_main.py --num_eval_steps=2000 --num_train_steps=50000 --alsologtostderr --pipeline_config_path=training/ssdlite_mobilenet_v2_coco.config --model_dir=training

Last week I was doing the same task with tensor flow cpu version, on the same system, and worked perfectly. Yesterday I've installed a GPU and found this problem.

I've changed --num_eval_steps=1 --num_train_steps=1 and didn't crash....

xtianhb on 5 Aug 2018

I've updated to the last version, set in my pipe_line_config_file initial_learning_rate: 0 , checked again labels and bounding boxes, and got the same result. With Cpu I don't have this behaviour.

xtianhb on 7 Aug 2018

Same problem here. With CPU works, with GPU prints the error.

daruai on 10 Aug 2018

I've switched to the legacy/train.py script for training, and legacy/eval.py for evaluation.
It works with GPU, no problems. Same setup as commented earlier.

xtianhb on 18 Aug 2018

Relying on the legacy scripts is a workaround for this problem, but the main issue still persists. We shouldn't have to switch back to the legacy scripts when we want to train our model with a GPU.

Running the non-legacy script with -num_train_steps=1 -num_eval_steps=1 works after manually adding the Servo directory to the model dir. But adding more steps will crash with the error in the title.

This could be a Cuda related issue, but I'm not sure about that.

Stukongeluk on 19 Aug 2018

👍2

@Stukongeluk Yes, sure I agree with you. I just wanted to isolate possible problems related to dataset, framework setup, platform, pipeline config, etc, and meanwhile mention the workaround. I've seen problems reported similar to this in #4754 #3688
Yes, I've also found that behaviour with -num_train_steps=1 -num_eval_steps=1

xtianhb on 19 Aug 2018

Any news on this???

mathiasthejsen on 24 Aug 2018

@xtianhb - the problem exists fro me even with the legacy script with batch size = 1. However no NAN loss errors with other batch sizes

zishanahmed08 on 13 Sep 2018

hi all,
OS windows 10 64bit
python 3.6
tensorflow 1.10
cuda 9.0.x
cudnn 7.0.x
run pet data into same issue on GTX1050ti, use my cpu i5 run same dataset and config files it 's okie

looks like this is a bug with object detection api with pet dataset,
please keep on track , let more developer know this issue!

gloomyfish1998 on 16 Sep 2018

more update -->> ERROR:tensorflow:Model diverged with loss = NaN.

gloomyfish1998 on 16 Sep 2018

using legacy train.py can work, while need to change object_detection/utils/variables_helper.py, change like this for import part

import logging

import re

import tensorflow as tf

slim = tf.contrib.slim

import re
import tensorflow as tf
from tensorflow import logging as logging
slim = tf.contrib.slim

resolve the output two same log output issue, now seems like okay, but still could not save jpg with log
on windows10

gloomyfish1998 on 16 Sep 2018

I meet the same problem, -num_train_steps=1 -num_eval_steps=1 can work,but when i add the num_train_steps,-num_eval_steps, it got the same wrong.

cjr0106 on 20 Sep 2018

@gloomyfish1998 have you deal the problem?

cjr0106 on 20 Sep 2018

https://yq.aliyun.com/articles/641576

121649982 on 21 Sep 2018

this problem can solve :
python object_detection/legacy/train.py --pipeline_config_path=D:/tensorflow/my_train/models/ssd_mobilenet_v1_pets.config --train_dir=D:/tensorflow/my_train/models/train –alsologtostderr

121649982 on 21 Sep 2018

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

121649982 on 21 Sep 2018

@121649982
Excuse me please , could you tell me what's the dir points ? "train_dir=D:/tensorflow/my_train/models/train "

cjr0106 on 21 Sep 2018

@cjr0106 just use legacy train.py, train_dir is output directory for your custom training model will be located, can contact with wechat gloomy_fish

gloomyfish1998 on 22 Sep 2018

指模型训练后，模型文件保存的路径

121649982 on 22 Sep 2018

@cjr0106 指模型训练后，模型文件保存的路径

121649982 on 22 Sep 2018

thanks so much ,i solved it ,.
do you see duplicated training steps while using legacy train.py? i also saw infos like this:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Saving checkpoint to path models\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)
INFO:tensorflow:global step 1: loss = 2.2247 (14.886 sec/step)

cjr0106 on 25 Sep 2018

check the pipeline path to ensure it exists
check some path in the pipeline.config，they must be modified to your own path

you'd better to use a config file outside of the fine tune model dir

发自我的小米手机
在 rongrong notifications@github.com，2018年9月25日 20:16写道：

i meet other bug:
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: D:/tensorflow/WAGE/test/models/model/checkpoints/ssd_mobilenet_v2_coc o_2018_03_29/pipeline.config : ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4·\udcbe\udcb6\udca1\udca3
; No such process
--pipeline_config_path=D:/tensorflow/WAGE/test/models/model/checkpoints/ssd_mobilenet_v2_coc o_2018_03_29/pipeline.config
--train_dir=D:/tensorflow/WAGE/test/models/model/train
--alsologtostderr
i set the config:

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/models/issues/4881#issuecomment-424317958, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQfd-0To2WlOYmT4Sq4PcUVbdwExYIrwks5ueh5-gaJpZM4Vc3yg.

yuezhilanyi on 25 Sep 2018

I have the same problem. Working only with legacy train.py!

Victorsoukhov on 28 Sep 2018

@121649982 when you run the eval.py , did it work?
i meet the problem:
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please e
nsure that you have not altered the graph expected based on the checkpoint. Original error:

Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT], _device="/job:localhost/
replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_59 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:C
PU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

the commond is
python eval.py --logtostderr
--pipeline_config_path=D:/tensorflow/WAGE/test/ssd_mobilenet_v2_coco.config
--checkpoint_dir=D:\tensorflow\WAGE\test\models\model\model_dir
--eval_dir=D:\tensorflow\WAGE\test\models\model\eval_dir

cjr0106 on 29 Sep 2018

你好，在eval.py时，遇到这个问题，请问怎么解决呢？
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please e
nsure that you have not altered the graph expected based on the checkpoint. Original error:

Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT], _device="/job:localhost/
replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_59 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:C
PU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

2018-09-22 21:56:01121649982 notifications@github.com写道：

@cjr0106 指模型训练后，模型文件保存的路径

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cjr0106 on 29 Sep 2018

@cjr0106
see eval.py line 20

1) A single pipeline_pb2.TrainEvalPipelineConfig file maybe specified instead.

i saw your pipeline_config_path was .config file, not pbtxt file.

you should use another file, instead of the train config file .

yuezhilanyi on 29 Sep 2018

@robieta Any chance you can look at this issue? I think it is common enough to look into it.

Stukongeluk on 2 Oct 2018

👍1

Please update to tensorflow 1.11.0. No problem with optimizer in that version. My models now run ok.

Victorsoukhov on 2 Oct 2018

👀1

@yuezhilanyi
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?

cjr0106 on 6 Oct 2018

train and eval use different config file

发自我的小米手机
在 rongrong notifications@github.com，2018年10月6日 14:21写道：

@yuezhilanyihttps://github.com/yuezhilanyi
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/models/issues/4881#issuecomment-427550008, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQfd-3-MbDs9gsu5Y84my2AxmvRMSoFDks5uiEvzgaJpZM4Vc3yg.

yuezhilanyi on 6 Oct 2018

The eval config file is rewritten by myself ? or it is produced by executed train.py?

2018-10-06 15:25:10yuezhilanyi notifications@github.com写道：

train and eval use different config file

发自我的小米手机
在 rongrong notifications@github.com，2018年10月6日 14:21写道：

@yuezhilanyihttps://github.com/yuezhilanyi
yeah , pipeline_config_path should be pbtxt file , you mean that the pbtxt should be the gragh.pbtxt when execute train.py ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

cjr0106 on 7 Oct 2018

The same problem +1

lan2720 on 9 Oct 2018

Windows SET CUDA_VISIBLE_DEVICES=0

Linux export CUDA_VISIBLE_DEVICES=0

哥，具体怎么操作？还有啊，我怎么设置在训练的时候每隔多少步计算一次准确率？而不仅仅是输出loss？在train.py 后加参数？还是更改 config.py？望指教

lfydegithub on 27 Feb 2019

Hi,
The error will appear if you forgot to set the num_classes variable in your pipeline.config.

jcRisch on 23 Sep 2019

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

zychen2016 on 5 Dec 2019

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑，训练出来效果并不好，各种显存或内存不足，而且梯度爆炸，我已经弃坑，用作者原代码的模型没有这些乱七八糟的问题

It's a pit, and it doesn't work very well, it's out of memory, it's out of memory, and it's a gradient explosion, and I've abandoned the pit, and I don't have these problems with the model in the author's original code

121649982 on 6 Dec 2019

INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
INFO:tensorflow:loss = 1.2833372, step = 800 (149.845 sec)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "model_main.py", line 111, in <module>
    tf.app.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 107, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run
    run_metadata=run_metadata))
  File "/data2/CZY/software/anconda2/envs/python36/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
    raise NanLossDuringTrainingError

Any one help me?

tensorflow/models就是个坑，训练出来效果并不好，各种显存或内存不足，而且梯度爆炸，我已经弃坑，用作者原代码的模型没有这些乱七八糟的问题

哈哈,SSD_mobilenet_v2有其它版本的代码吗？

zychen2016 on 6 Dec 2019

In my case num_classes were different from no of classes in .pbtxt file.