Hi all!
I'm stuck while trying various examples of object-detection. I try 'coco' net with my own images (and I have only 30 pictures now to train model), also I try this example with (Faster-RCNN-Inception-V2 model. model and it's examples images and config. But No one of this models doesn't train! First example (train) I run and wait more than 10 hours! And It's not stop.. Second example also has this problems.
All of this examples I'm running on macOs on MacBook Pro 2017 and I'm always getting only this in logs:
WARNING:tensorflow:Ignoring detection with image id 1241141149 since it was previously added
WARNING:tensorflow:Ignoring ground truth with image id 558212937 since it was previously added
WARNING:tensorflow:Ignoring detection with image id 558212937 since it was previously added
WARNING:tensorflow:Ignoring ground truth with image id 1493033516 since it was previously added
WARNING:tensorflow:Ignoring detection with image id 1493033516 since it was previously added
creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.11s).
Accumulating evaluation results...
DONE (t=0.12s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.130
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.325
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.113
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.130
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.170
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.344
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.421
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.421
I have NO logs like this
INFO:tensorflow:global step 11788: loss = 0.6717 (0.398 sec/step)
INFO:tensorflow:global step 11789: loss = 0.5310 (0.436 sec/step)
I try to find something about this problem - but no information was found.
I try to change config values such 'num_examples' (to make it equals of images size), 'batch_size' param - but no one help me.
Can someone tell me why I've got this messages instead of "normal" log with loss steps ?
I got the same problem, just no loss changed messages show step by step in the terminal , but it's running normally. I can see the loss changed from tensorboard. and got the model checkpoint files.
@buzzf I don't have any output files at all. How do you get it if all this 'wrong' logs seems to be infinity? And does gotted files of models work next (detect what you train it) ?
@SheptunovaAA yeah, it can work,but not very correctly, cauze i haven't trained it for a long time. And I'am trying a second time to try. So from my side it seems not a 'wrong' log, it just like no print show or stoped somewhere in the codes. I'm trying to find it.
@buzzf do you train with legacy/train.py
file or with model_main.py
? My colleague run some example of object detection with model_main
and also have all this problems, which I've described here. But when he run all of this with legacy/train.py
- in logs were started to print 'loss' steps and now created some output files.
But with 'model_main.py'
it is not working at all
@SheptunovaAA Now I got the same problem with yours. I always train with model_main.py/ python 3.5. Only about ten minute later, it stoped. I don't know why. do you try it with python2?
@buzzf I run with python2 only. With python3 I have many issues and a big headache to run all steps of creating and training model
@SheptunovaAA well, may be something wrong with model_train.py, I haven't got the problem first time maybe fot that i trained only for a short time. but it truely run before the error occured from the tensorboard. And I tried the /legacy/train.py , it works well.
@SheptunovaAA I dig into this model_train problem. The model_train.py combined train.py and eval.py
in ole version and find that:
1銆丄dd tf.logging.set_verbosity(tf.logging.INFO)
after the import section of the model_main.py script. It will display a summary after every 100th step. Maybe model_train.py forget it officially.
2銆丄s for the warning like WARNING:tensorflow:Ignoring ground truth with image id 283094706 since it was previously added
, if you set anyone of the arg in --num_eval_steps=200
or the num_examples in .config file
eval_config: {
num_examples: 200
max_evals: 10
}
larger than the numbers of your test images in you datasets. Then it will show the warning. So just set them as the test images num will be ok.
3銆乄hen run it with python3锛孖 got another problem: can't pickle dict_values
锛宎dd list()
to category_index.values()
in model_lib.py
about line 390 as this list(category_index.values())
works for me.
It works for me now.
@buzzf
num_examples
to numbers of pictures in my train directory - and it's not work. Warnings did not go away@SheptunovaAA 2. not train images num, is eval images num.
facing same issue, like no message is printing while its training and saving checkpoint.
If anybody fixed pls share here
@BalajiB3663 in my situation with script 'model_mani.py' it's not saving any checkpoint also, not only 'no logs'
@SheptunovaAA check your architecture. You might me missing some path reference for storing the checkpoint.
reference: args for model_main.py
model_dir="{project_directory}/models/model/"
Same issue here.
The model is training and the checkpoints are continuously stored, but I can't see any message during training regarding loss, AP or anything on the terminal.
@SheptunovaAA @BalajiB3663 @agemagician After adding the line tf.logging.set_verbosity(tf.logging.INFO)
to the beginning of the program, are the logs still not printing?
The above resolves for me - also worth nothing that the updates come every hundred steps now.
Example from Ubuntu 16.04
2018-10-29 20:01:13.961075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9711 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-29 20:01:14.043073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
I1029 20:01:16.723612 139792596526848 tf_logging.py:115] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1029 20:01:16.986080 139792596526848 tf_logging.py:115] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into training/model.ckpt.
I1029 20:01:29.347278 139792596526848 tf_logging.py:115] Saving checkpoints for 0 into training/model.ckpt.
INFO:tensorflow:loss = 21.577475, step = 0
I1029 20:01:50.256881 139792596526848 tf_logging.py:115] loss = 21.577475, step = 0
INFO:tensorflow:global_step/sec: 0.977434
I1029 20:03:32.565272 139792596526848 tf_logging.py:115] global_step/sec: 0.977434
INFO:tensorflow:loss = 0.6787091, step = 100 (102.309 sec)
I1029 20:03:32.565837 139792596526848 tf_logging.py:115] loss = 0.6787091, step = 100 (102.309 sec)
INFO:tensorflow:global_step/sec: 1.08775
I1029 20:05:04.498113 139792596526848 tf_logging.py:115] global_step/sec: 1.08775
INFO:tensorflow:loss = 0.43647617, step = 200 (91.933 sec)
I1029 20:05:04.498731 139792596526848 tf_logging.py:115] loss = 0.43647617, step = 200 (91.933 sec)
INFO:tensorflow:global_step/sec: 1.09907
I1029 20:06:35.484275 139792596526848 tf_logging.py:115] global_step/sec: 1.09907
INFO:tensorflow:loss = 0.5910977, step = 300 (90.986 sec)
I1029 20:06:35.484774 139792596526848 tf_logging.py:115] loss = 0.5910977, step = 300 (90.986 sec)
why was it changed to every 100 steps?Is there any reason?
How to get logs every step?
Hi, I am able to view the steps by adding tf.logging.set_verbosity(tf.logging.INFO) but I am not able to see accuracy and total_loss metrics on tensorboard. I am getting loss1,loss2,gradient_norm, learning rate. Were you guys able to see accuracy and total loss metrics? I am running faster_rcnn model on pets dataset
Hi ~, Where is the code corresponding to this output text? thanks.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.130
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.325
why was it changed to every 100 steps?Is there any reason?
How to get logs every step?
the old version show the steps log
you can run this common
python ~/model/research/object_detection/legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=/home/sy/data/work/StandardCVSXImages/ssd_mobilenet_v1_coco.config
you can get the steps loss
@SheptunovaAA I suggest you to clone latest repo of training from tensorflow. There are good fixes recently. And try executing with below reference args for the model_main script.
python model_main.py \ --pipeline_config_path=/object_detection/data/ssd_resnet50_v1_fpn.config \
--model_dir=/object_detection/models/model 鈥搉um_train_steps=2000000 \
--sample_1_of_n_eval_examples=10083 \
鈥揳lsologtostder
Where are the checkpoint files stored?
Or where do I change the output path in the model_main file?
@schauppi checkpoints will be stored in model directory.
You can also refer official doc. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md
@BalajiB3663
Sorry for the stupid question: but where exactly in the model directory?
can i change the directory in the model_main file?
@schauppi
model directory refers where you want to save the checkpoints. So you can create your own directory and pass that path as argument. In my case I created inside models -/object_detection/models/model here model is my own directory.
@BalajiB3663 It worked, thank you.
@IrisLimiao
inside research/object_detection/utils/metrics.py there is compute_precision_recall function.
Any idea how to use this function?
@SheptunovaAA
Which variables in eval.py to pass into the compute_precision_recall function?
Thank you.
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Most helpful comment
@SheptunovaAA I dig into this model_train problem. The model_train.py combined train.py and eval.py
in ole version and find that:
1銆丄dd
tf.logging.set_verbosity(tf.logging.INFO)
after the import section of the model_main.py script. It will display a summary after every 100th step. Maybe model_train.py forget it officially.2銆丄s for the warning like
WARNING:tensorflow:Ignoring ground truth with image id 283094706 since it was previously added
, if you set anyone of the arg in--num_eval_steps=200
or the num_examples in .config fileeval_config: { num_examples: 200 max_evals: 10 }
larger than the numbers of your test images in you datasets. Then it will show the warning. So just set them as the test images num will be ok.
3銆乄hen run it with python3锛孖 got another problem:
can't pickle dict_values
锛宎ddlist()
tocategory_index.values()
inmodel_lib.py
about line 390 as thislist(category_index.values())
works for me.It works for me now.