Models: model_main.py does not save checkpoint?

Created on 25 Nov 2018 · 10Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Windows 10 64 bit
TensorFlow installed from (source or binary):
pip installed
TensorFlow version (use command below):
1.12
Bazel version (if compiling from source):
CUDA/cuDNN version:
CUDA 9.0
GPU model and memory:
GTX 1060 6 Gb
Exact command to reproduce:
python E:\Documents\Projects\tensorflow\models\research\object_detection\model_main.py --alsologtostderr --pipeline_config_path=experiments/training/ssdlite_mobilenet_v2_coco.config --model_dir=/experiments/training/ --num_train_steps=50000 --NUM_EVAL_STEPS=2000

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

model_main doesn't save training checkpoints, I see the status (see below) but I dont see any checkpoints being saved during training, what's going on?

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

I see this but its not saving any checkpoint to this directory

tensorflow:Saving 'checkpoint_path' summary for global step 40469: /experiments/training/model.ckpt-40469
I1125 05:27:49.819430 7500 tf_logging.py:115] Saving 'checkpoint_path' summary for global step 40469: /experiments/training/model.ckpt-40469

Source

zubairahmed-ai

Most helpful comment

@saleem-hadad Thanks Saleem the training has actually started after I did some tests, it turns out I needed to create a export and within that a servo folder in order to let Tensorflow save the checkpoint, dont ask me why its just too weird, I tested it with fewer number of steps and actually it started saving checkpoints too see my comment https://github.com/tensorflow/models/issues/2984#issuecomment-441422918

zubairahmed-ai on 26 Nov 2018

🎉3

All 10 comments

I think the problem is with the model_dir=/experiments/training/ flag since you put / before experiments that means you're pointing it to the root dir and not in the project's directory.
try to write model_dir=experiments/training/ instead of model_dir=/experiments/training/

saleem-hadad on 25 Nov 2018

I can't remember exactly but it was giving me a different issue and someone suggested that I use this,
what do you use it like?
It should complain about it at least?
Any other flag that I am unaware of?

zubairahmed-ai on 25 Nov 2018

I'll check my last run command when I come back home

saleem-hadad on 25 Nov 2018

Thanks please do

zubairahmed-ai on 25 Nov 2018

@zubairahmed-ai I think you're missing this flag --checkpoint_dir=yourDir because when I compared my command with yours this is the only one missing

saleem-hadad on 25 Nov 2018

zubairahmed-ai on 26 Nov 2018

🎉3

hhh that's great 🤣 All the best dude

saleem-hadad on 26 Nov 2018

Thanks :)

zubairahmed-ai on 26 Nov 2018

Please, how can I calculate the MAP for these results?

INFO:tensorflow:Restoring parameters from test_image1/model.ckpt-50000
INFO:tensorflow:Restoring parameters from test_image1/model.ckpt-50000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Performing evaluation on 4 images.
INFO:tensorflow:Performing evaluation on 4 images.
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:Loading and preparing annotation results...
INFO:tensorflow:DONE (t=0.00s)
INFO:tensorflow:DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
DONE (t=0.06s).
Accumulating evaluation results...
DONE (t=0.01s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.211
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.380
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.187
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.003
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.233
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.485
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.067
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.241
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.261
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.020
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.283
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.570
INFO:tensorflow:Finished evaluation at 2019-10-28-14:06:58
INFO:tensorflow:Finished evaluation at 2019-10-28-14:06:58
INFO:tensorflow:Saving dict for global step 50000: DetectionBoxes_Precision/mAP = 0.21072435, DetectionBoxes_Precision/mAP (large) = 0.485231, DetectionBoxes_Precision/mAP (medium) = 0.23299623, DetectionBoxes_Precision/mAP (small) = 0.0032204273, DetectionBoxes_Precision/[email protected] = 0.38036388, DetectionBoxes_Precision/[email protected] = 0.1866721, DetectionBoxes_Recall/AR@1 = 0.06658986, DetectionBoxes_Recall/AR@10 = 0.24055299, DetectionBoxes_Recall/AR@100 = 0.26059908, DetectionBoxes_Recall/AR@100 (large) = 0.57045454, DetectionBoxes_Recall/AR@100 (medium) = 0.2825, DetectionBoxes_Recall/AR@100 (small) = 0.02, Loss/BoxClassifierLoss/classification_loss = 0.5046802, Loss/BoxClassifierLoss/localization_loss = 0.3415953, Loss/RPNLoss/localization_loss = 0.54075974, Loss/RPNLoss/objectness_loss = 0.46926486, Loss/total_loss = 1.8563001, global_step = 50000, learning_rate = 0.0002, loss = 1.8563001
INFO:tensorflow:Saving dict for global step 50000: DetectionBoxes_Precision/mAP = 0.21072435, DetectionBoxes_Precision/mAP (large) = 0.485231, DetectionBoxes_Precision/mAP (medium) = 0.23299623, DetectionBoxes_Precision/mAP (small) = 0.0032204273, DetectionBoxes_Precision/[email protected] = 0.38036388, DetectionBoxes_Precision/[email protected] = 0.1866721, DetectionBoxes_Recall/AR@1 = 0.06658986, DetectionBoxes_Recall/AR@10 = 0.24055299, DetectionBoxes_Recall/AR@100 = 0.26059908, DetectionBoxes_Recall/AR@100 (large) = 0.57045454, DetectionBoxes_Recall/AR@100 (medium) = 0.2825, DetectionBoxes_Recall/AR@100 (small) = 0.02, Loss/BoxClassifierLoss/classification_loss = 0.5046802, Loss/BoxClassifierLoss/localization_loss = 0.3415953, Loss/RPNLoss/localization_loss = 0.54075974, Loss/RPNLoss/objectness_loss = 0.46926486, Loss/total_loss = 1.8563001, global_step = 50000, learning_rate = 0.0002, loss = 1.8563001
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 50000: test_image1/model.ckpt-50000
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 50000: test_image1/model.ckpt-50000
INFO:tensorflow:Performing the final export in the end of training.
INFO:tensorflow:Performing the final export in the end of training.