Models: No messages like 'INFO:tensorflow:global step .. loss...' while training model

Created on 21 Sep 2018 · 28Comments · Source: tensorflow/models

Hi all!

I'm stuck while trying various examples of object-detection. I try 'coco' net with my own images (and I have only 30 pictures now to train model), also I try this example with (Faster-RCNN-Inception-V2 model. model and it's examples images and config. But No one of this models doesn't train! First example (train) I run and wait more than 10 hours! And It's not stop.. Second example also has this problems.

All of this examples I'm running on macOs on MacBook Pro 2017 and I'm always getting only this in logs:

WARNING:tensorflow:Ignoring detection with image id 1241141149 since it was previously added

WARNING:tensorflow:Ignoring ground truth with image id 558212937 since it was previously added

WARNING:tensorflow:Ignoring detection with image id 558212937 since it was previously added

WARNING:tensorflow:Ignoring ground truth with image id 1493033516 since it was previously added

WARNING:tensorflow:Ignoring detection with image id 1493033516 since it was previously added

creating index...

index created!

creating index...

index created!

Running per image evaluation...

Evaluate annotation type *bbox*

DONE (t=0.11s).

Accumulating evaluation results...

DONE (t=0.12s).

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.130

 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.325

 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.113

 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000

 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.130

 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.170

 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.344

 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.421

 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.421

I have NO logs like this

INFO:tensorflow:global step 11788: loss = 0.6717 (0.398 sec/step)
INFO:tensorflow:global step 11789: loss = 0.5310 (0.436 sec/step)

I try to find something about this problem - but no information was found.
I try to change config values such 'num_examples' (to make it equals of images size), 'batch_size' param - but no one help me.

Can someone tell me why I've got this messages instead of "normal" log with loss steps ?

awaiting response

Source

SheptunovaAA

Most helpful comment

@SheptunovaAA I dig into this model_train problem. The model_train.py combined train.py and eval.py
in ole version and find that:
1、Add tf.logging.set_verbosity(tf.logging.INFO) after the import section of the model_main.py script. It will display a summary after every 100th step. Maybe model_train.py forget it officially.
2、As for the warning like WARNING:tensorflow:Ignoring ground truth with image id 283094706 since it was previously added, if you set anyone of the arg in --num_eval_steps=200 or the num_examples in .config file
eval_config: { num_examples: 200 max_evals: 10 }
larger than the numbers of your test images in you datasets. Then it will show the warning. So just set them as the test images num will be ok.
3、When run it with python3，I got another problem: can't pickle dict_values，add list() to category_index.values() in model_lib.py about line 390 as this list(category_index.values()) works for me.
It works for me now.

buzzf on 21 Sep 2018

👍20 🎉8 ❤2

All 28 comments

I got the same problem, just no loss changed messages show step by step in the terminal , but it's running normally. I can see the loss changed from tensorboard. and got the model checkpoint files.

buzzf on 21 Sep 2018

@buzzf I don't have any output files at all. How do you get it if all this 'wrong' logs seems to be infinity? And does gotted files of models work next (detect what you train it) ?

SheptunovaAA on 21 Sep 2018

@SheptunovaAA yeah, it can work,but not very correctly, cauze i haven't trained it for a long time. And I'am trying a second time to try. So from my side it seems not a 'wrong' log, it just like no print show or stoped somewhere in the codes. I'm trying to find it.

buzzf on 21 Sep 2018

@buzzf do you train with legacy/train.py file or with model_main.py? My colleague run some example of object detection with model_main and also have all this problems, which I've described here. But when he run all of this with legacy/train.py - in logs were started to print 'loss' steps and now created some output files.
But with 'model_main.py' it is not working at all

SheptunovaAA on 21 Sep 2018

👍2 🎉1

@SheptunovaAA Now I got the same problem with yours. I always train with model_main.py/ python 3.5. Only about ten minute later, it stoped. I don't know why. do you try it with python2?

buzzf on 21 Sep 2018

@buzzf I run with python2 only. With python3 I have many issues and a big headache to run all steps of creating and training model

SheptunovaAA on 21 Sep 2018

@SheptunovaAA well, may be something wrong with model_train.py, I haven't got the problem first time maybe fot that i trained only for a short time. but it truely run before the error occured from the tensorboard. And I tried the /legacy/train.py , it works well.

buzzf on 21 Sep 2018

👍20 🎉8 ❤2

@buzzf

ok
Oh, I've tried a lot of variants - and last was to set num_examples to numbers of pictures in my train directory - and it's not work. Warnings did not go away
This trick also don't work for me, I try it before wrote this issue

SheptunovaAA on 24 Sep 2018

@SheptunovaAA 2. not train images num, is eval images num.

buzzf on 24 Sep 2018

facing same issue, like no message is printing while its training and saving checkpoint.
If anybody fixed pls share here

BalajiB3663 on 26 Sep 2018

👀1

@BalajiB3663 in my situation with script 'model_mani.py' it's not saving any checkpoint also, not only 'no logs'

SheptunovaAA on 26 Sep 2018

@SheptunovaAA check your architecture. You might me missing some path reference for storing the checkpoint.
reference: args for model_main.py
model_dir="{project_directory}/models/model/"

BalajiB3663 on 26 Sep 2018

Same issue here.
The model is training and the checkpoints are continuously stored, but I can't see any message during training regarding loss, AP or anything on the terminal.

agemagician on 30 Sep 2018

👀1

@SheptunovaAA @BalajiB3663 @agemagician After adding the line tf.logging.set_verbosity(tf.logging.INFO) to the beginning of the program, are the logs still not printing?

k-w-w on 1 Oct 2018

👍4

The above resolves for me - also worth nothing that the updates come every hundred steps now.
Example from Ubuntu 16.04
2018-10-29 20:01:13.961075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9711 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2018-10-29 20:01:14.043073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. I1029 20:01:16.723612 139792596526848 tf_logging.py:115] Running local_init_op. INFO:tensorflow:Done running local_init_op. I1029 20:01:16.986080 139792596526848 tf_logging.py:115] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into training/model.ckpt. I1029 20:01:29.347278 139792596526848 tf_logging.py:115] Saving checkpoints for 0 into training/model.ckpt. INFO:tensorflow:loss = 21.577475, step = 0 I1029 20:01:50.256881 139792596526848 tf_logging.py:115] loss = 21.577475, step = 0 INFO:tensorflow:global_step/sec: 0.977434 I1029 20:03:32.565272 139792596526848 tf_logging.py:115] global_step/sec: 0.977434 INFO:tensorflow:loss = 0.6787091, step = 100 (102.309 sec) I1029 20:03:32.565837 139792596526848 tf_logging.py:115] loss = 0.6787091, step = 100 (102.309 sec) INFO:tensorflow:global_step/sec: 1.08775 I1029 20:05:04.498113 139792596526848 tf_logging.py:115] global_step/sec: 1.08775 INFO:tensorflow:loss = 0.43647617, step = 200 (91.933 sec) I1029 20:05:04.498731 139792596526848 tf_logging.py:115] loss = 0.43647617, step = 200 (91.933 sec) INFO:tensorflow:global_step/sec: 1.09907 I1029 20:06:35.484275 139792596526848 tf_logging.py:115] global_step/sec: 1.09907 INFO:tensorflow:loss = 0.5910977, step = 300 (90.986 sec) I1029 20:06:35.484774 139792596526848 tf_logging.py:115] loss = 0.5910977, step = 300 (90.986 sec)

IamSierraCharlie on 29 Oct 2018

why was it changed to every 100 steps?Is there any reason?
How to get logs every step?

santosh898 on 31 Oct 2018

Hi, I am able to view the steps by adding tf.logging.set_verbosity(tf.logging.INFO) but I am not able to see accuracy and total_loss metrics on tensorboard. I am getting loss1,loss2,gradient_norm, learning rate. Were you guys able to see accuracy and total loss metrics? I am running faster_rcnn model on pets dataset

Anubhav2017 on 10 Dec 2018

👍2

Hi ~, Where is the code corresponding to this output text? thanks.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.130
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.325

IrisLimiao on 19 Dec 2018

👍1

why was it changed to every 100 steps?Is there any reason?
How to get logs every step?

the old version show the steps log
you can run this common
python ~/model/research/object_detection/legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=/home/sy/data/work/StandardCVSXImages/ssd_mobilenet_v1_coco.config
you can get the steps loss

shenyingying on 22 Jan 2019

@SheptunovaAA I suggest you to clone latest repo of training from tensorflow. There are good fixes recently. And try executing with below reference args for the model_main script.
python model_main.py \ --pipeline_config_path=/object_detection/data/ssd_resnet50_v1_fpn.config \
--model_dir=/object_detection/models/model –num_train_steps=2000000 \
--sample_1_of_n_eval_examples=10083 \
–alsologtostder

BalajiB197 on 22 Jan 2019

Where are the checkpoint files stored?

Or where do I change the output path in the model_main file?

schauppi on 23 Jan 2019

@schauppi checkpoints will be stored in model directory.
You can also refer official doc. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

BalajiB197 on 24 Jan 2019

@BalajiB3663
Sorry for the stupid question: but where exactly in the model directory?
can i change the directory in the model_main file?

schauppi on 24 Jan 2019

@schauppi
model directory refers where you want to save the checkpoints. So you can create your own directory and pass that path as argument. In my case I created inside models -/object_detection/models/model here model is my own directory.

BalajiB197 on 24 Jan 2019

👍1

@BalajiB3663 It worked, thank you.

schauppi on 24 Jan 2019

@IrisLimiao
inside research/object_detection/utils/metrics.py there is compute_precision_recall function.
Any idea how to use this function?

@SheptunovaAA
Which variables in eval.py to pass into the compute_precision_recall function?

Thank you.

krustiv on 15 Feb 2019

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.