Models: TF2 Object Detection API training script model_main_t2 not working - Stuck on Waiting for new checkpoint - Timed-out waiting for a checkpoint

Created on 16 Jul 2020 · 7Comments · Source: tensorflow/models

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Y ] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[Y ] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[ Y] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/object_detection/model_main_tf2.py

2. Describe the bug

After running for a while, model_main_t2 get stuck on "Waiting for new checkpoint". Then ends with error: "Timed-out waiting for a checkpoint"

3. Steps to reproduce

https://github.com/IvanBrasilico/ajna_bbox
The steps of tf2 installation are on the project README. Basically the steps described in the documentation (generate tf_records for training, download a model definition and check-point, edit pipeline.config with paths of tfrecord, run model_main_tf2.

4. Expected behavior

The expected behavior was to do the training procedure or at least pop an error message.

5. Additional context

The complete model_main_tf2.py console output is on the end of report

6. System information

Important to register that the example colab repository eager_few_shot_od_training_tf2.ipynb is running and training the same model, in the same virtualenv of the same machine.

Linux Ubuntu 16.04:
Python 3.6 ven
Tensorlow 2.2 installed by pip
CUDA/cuDNN version: Cuda 11 installed by apt
GPU model and memory: 1050ti 4GB

Complete environment information:

https://github.com/IvanBrasilico/ajna_bbox/blob/master/tf_env.txt

Complete model_main_tf2 output:

(venv) ivan@ivan-G7-7588:~/PycharmProjects/ajna_bbox$ python models/research/object_detection/model_main_tf2.py --model_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/ --checkpoint_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint --alsologtostderr --pipeline_config_path=bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config --use-tpu=true
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
W0715 23:32:23.856509 140079432734464 model_lib_v2.py:905] Forced number of epochs for all eval validations to be 1.
INFO:tensorflow:Maybe overwriting sample_1_of_n_eval_examples: None
I0715 23:32:23.856632 140079432734464 config_util.py:552] Maybe overwriting sample_1_of_n_eval_examples: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0715 23:32:23.856686 140079432734464 config_util.py:552] Maybe overwriting use_bfloat16: False
INFO:tensorflow:Maybe overwriting eval_num_epochs: 1
I0715 23:32:23.856735 140079432734464 config_util.py:552] Maybe overwriting eval_num_epochs: 1
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1.
W0715 23:32:23.856801 140079432734464 model_lib_v2.py:920] Expected number of evaluation epochs is 1, but instead encountered eval_on_train_input_config.num_epochs = 0. Overwriting num_epochs to 1.
2020-07-15 23:32:23.881471: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-15 23:32:23.923686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-15 23:32:23.924041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 104.43GiB/s
2020-07-15 23:32:23.924195: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-07-15 23:32:23.924305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-07-15 23:32:23.925568: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-07-15 23:32:23.925901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-07-15 23:32:23.928778: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-07-15 23:32:23.928903: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-07-15 23:32:23.932572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-15 23:32:23.932610: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-15 23:32:23.932881: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-15 23:32:23.939320: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2299965000 Hz
2020-07-15 23:32:23.939775: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x657f610 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-15 23:32:23.939791: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-15 23:32:23.941028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-15 23:32:23.941041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0715 23:32:23.947229 140079432734464 dataset_builder.py:83] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic.
W0715 23:32:23.949348 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic.
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data.Dataset.map() W0715 23:32:23.965300 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/builders/dataset_builder.py:175: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map()
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
W0715 23:32:29.178085 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:79: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead.
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0715 23:32:30.630500 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/inputs.py:259: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint
I0715 23:32:33.767113 140079432734464 checkpoint_utils.py:125] Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint
INFO:tensorflow:Found new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0
I0715 23:32:33.767870 140079432734464 checkpoint_utils.py:134] Found new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/eval_util.py:854: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0715 23:33:02.120177 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/eval_util.py:854: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
INFO:tensorflow:Finished eval step 0
I0715 23:33:11.245014 140079432734464 model_lib_v2.py:782] Finished eval step 0
WARNING:tensorflow:From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_functions can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.

W0715 23:33:11.261951 140079432734464 deprecation.py:323] From /home/ivan/PycharmProjects/ajna_bbox/venv/lib/python3.6/site-packages/object_detection/utils/visualization_utils.py:618: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_functions can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.

INFO:tensorflow:Performing evaluation on 21 images.
I0715 23:33:30.778897 140079432734464 coco_evaluation.py:237] Performing evaluation on 21 images.
creating index...
index created!
INFO:tensorflow:Loading and preparing annotation results...
I0715 23:33:30.779220 140079432734464 coco_tools.py:116] Loading and preparing annotation results...
INFO:tensorflow:DONE (t=0.00s)
I0715 23:33:30.780228 140079432734464 coco_tools.py:138] DONE (t=0.00s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type bbox
DONE (t=0.03s).
Accumulating evaluation results...
DONE (t=0.00s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.024
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.024
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.024
INFO:tensorflow:Eval metrics at step 0
I0715 23:33:30.815683 140079432734464 model_lib_v2.py:836] Eval metrics at step 0
INFO:tensorflow: + DetectionBoxes_Precision/mAP: 0.000143
I0715 23:33:30.818211 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP: 0.000143
INFO:tensorflow: + DetectionBoxes_Precision/[email protected]: 0.000286
I0715 23:33:30.818874 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/[email protected]: 0.000286
INFO:tensorflow: + DetectionBoxes_Precision/[email protected]: 0.000000
I0715 23:33:30.819247 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/[email protected]: 0.000000
INFO:tensorflow: + DetectionBoxes_Precision/mAP (small): -1.000000
I0715 23:33:30.819588 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (small): -1.000000
INFO:tensorflow: + DetectionBoxes_Precision/mAP (medium): -1.000000
I0715 23:33:30.819919 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (medium): -1.000000
INFO:tensorflow: + DetectionBoxes_Precision/mAP (large): 0.000215
I0715 23:33:30.820254 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Precision/mAP (large): 0.000215
INFO:tensorflow: + DetectionBoxes_Recall/AR@1: 0.000000
I0715 23:33:30.820581 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@1: 0.000000
INFO:tensorflow: + DetectionBoxes_Recall/AR@10: 0.023810
I0715 23:33:30.820914 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@10: 0.023810
INFO:tensorflow: + DetectionBoxes_Recall/AR@100: 0.023810
I0715 23:33:30.821241 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100: 0.023810
INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (small): -1.000000
I0715 23:33:30.821578 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (small): -1.000000
INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (medium): -1.000000
I0715 23:33:30.821907 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (medium): -1.000000
INFO:tensorflow: + DetectionBoxes_Recall/AR@100 (large): 0.023810
I0715 23:33:30.822265 140079432734464 model_lib_v2.py:839] + DetectionBoxes_Recall/AR@100 (large): 0.023810
INFO:tensorflow: + Loss/localization_loss: 0.189787
I0715 23:33:30.822557 140079432734464 model_lib_v2.py:839] + Loss/localization_loss: 0.189787
INFO:tensorflow: + Loss/classification_loss: 1.298645
I0715 23:33:30.822857 140079432734464 model_lib_v2.py:839] + Loss/classification_loss: 1.298645
INFO:tensorflow: + Loss/regularization_loss: 0.176113
I0715 23:33:30.823152 140079432734464 model_lib_v2.py:839] + Loss/regularization_loss: 0.176113
INFO:tensorflow: + Loss/total_loss: 1.664544
I0715 23:33:30.823446 140079432734464 model_lib_v2.py:839] + Loss/total_loss: 1.664544
INFO:tensorflow:Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint
I0715 23:37:33.829480 140079432734464 checkpoint_utils.py:125] Waiting for new checkpoint at /home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint
INFO:tensorflow:Timed-out waiting for a checkpoint.
I0716 00:37:33.181403 140079432734464 checkpoint_utils.py:188] Timed-out waiting for a checkpoint.

research bug

Source

IvanBrasilico

Most helpful comment

Thanks very much!! It was my fault. I am closing the issue.

As a comment, tensorflow ecosystem is great, but the object detection API needs a better documentation. Even the code needs some cleaning. The documentation is poor, so I made a lot of trials on pipeline.config and on command line. The error messages are weird, then I tried to read the code. The code is complex and make some assumptions: filenames have to be on some patterns (like ckpt-0), directories needs to be on some patterns, unless the code breaks.

Now I am training the net, but there are no messages to see what is really going on (like "found NN examples", training XXX), loss, etc, etc. Just when evaluation runs we get some messages.

I am not a begginer, I fine tuned a lot of keras networks for computer vision tasks before, and even made some of my own from scratch. I made all the courses from deeplearning.ai and others, and I am having a very bad time simply trying to use the Object Detection API.

The keras/tf2 API is great, and very well documented. Would be great if that package achieves same quality. At least, this object detection API needs a complete working example(generate train/test set/ train and evaluate/save/export model to production). The colab is incomplete (no model saving nor evaluation) and don't use the same patterns as the scripts(tfrecords, tensorflow serving export, etc). The scripts and the pipeline config file needs a lot of effort to know how to use, by trial and error. I will try a litle more, because I am using tensorflow model serving on production for another models and would like to stay with it, but if I fail more times I will end going to mathport Mask-RCNN or even PyTorch/FastAI.

IvanBrasilico on 17 Jul 2020

👍12

All 7 comments

@IvanBrasilico I hope this can help you!

I wrote a tutorial to train EfficientDet in Google Colab with the TensorFlow 2 Object Detection API.

You can run this tutorial by changing just one line for your custom dataset import. I hope this tutorial allows newcomers to the repository to quickly get up and running with TensorFlow 2 for object detection!

In the tutorial, I write how to:

Acquire Labeled Object Detection Data
Install TensorFlow 2 Object Detection Dependencies
Download Custom TensorFlow 2 Object Detection Dataset
Write Custom TensorFlow 2 Object Detection Training Configuration
Train Custom TensorFlow 2 Object Detection Model
Export Custom TensorFlow 2 Object Detection Weights
Use Trained TensorFlow 2 Object Detection For Inference on Test Images

Jacobsolawetz on 16 Jul 2020

👎2

@IvanBrasilico Please use the command you used without --checkpoint_dir option. Adding that option changes the mode to evaluation only and not training. Hope that solves .

Command to start training is as given
python models/research/object_detection/model_main_tf2.py --model_dir=/home/ivan/PycharmProjects/ajna_bbox/bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/ --alsologtostderr --pipeline_config_path=bases/models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config --use-tpu=true

sambhusuryamohan on 17 Jul 2020

Thanks very much!! It was my fault. I am closing the issue.

Now I am training the net, but there are no messages to see what is really going on (like "found NN examples", training XXX), loss, etc, etc. Just when evaluation runs we get some messages.

IvanBrasilico on 17 Jul 2020

👍12

Hi @IvanBrasilico , are you able anyway to run a proper evaluation job as described here and as you were trying to do by setting the checkpoint_dir? I have trained my models and now trying to get their performance in terms of mAP but I get the error "Timed-out waiting for a new checkpoint.

tazu786 on 30 Jul 2020

I am having the same problem. I trained my network, but I can't evaluate afterwards. It just says "Waiting for a new checkpoint". There has to be a way to first run training and then run evaluation afterwards based on a saved checkpoint???

nilskk on 14 Oct 2020

Hello @IvanBrasilico, were you able to find a solution for this?
I am facing the same 'INFO:tensorflow:Timed-out waiting for a checkpoint.' error while trying to evaluate my model. I am attaching a screenshot of my config file as reference if that helps.

config file screenshot

Any help would be appreciated, thank you!

radhikam01 on 19 Oct 2020

Hello everybody!
I am having the same issue, Has anyone found a solution?
@IvanBrasilico vc conseguiu resolver esse problema? pode me ajudar por favor?