Models: tensorflow:Waiting for new checkpoint at /home/chowkam/deeplabv3_cityscapes_train? Infinite loop

Created on 2 Jul 2018 · 19Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Bazel version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:
Exact command to reproduce:

Toplevel_Directory : tensorflow:

Generated tfRecord using sh convert_cityscapes.sh
/home/chowkam/cityscapes
----gtFine
-----leftImg8bit
-----tfRecord

Downloaded pretrained models on cityscape using
https://download.tensorflow.org/models/deeplabv3_cityscapes_train_2018_02_06.tar.gz
Model Files : /home/chowkam/deeplabv3_cityscapes_train
-----frozen_inference_graph.pb
-----model.ckpt.data-00000-of-00001
----model.ckpt.index

python deeplab/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=1025 --eval_crop_size=2049 --dataset="cityscapes" --checkpoint_dir="/home/chowkam/deeplabv3_cityscapes_train" --vis_logdir="/home/chowkam/cityscapes/leftImg8bit/val" --dataset_dir="/home/chowkam/cityscapes/tfrecord"

Iam getting into infinite loop with tensorflow:Waiting for new checkpoint

Kindly pls help us how to resolve this problem

Source

chowkamlee81

Most helpful comment

The original code need to train and eval Simultaneously. it is different with keras; in keras ,we can train and eval in the same time and same function. but in tensorflow, it is not allowed, so we need two file, one is ### train.py, and another is ### eval.py or vis.py.

so if you want to eval or test an image alone, you may need to revise the code.

in deeplab file, we have train.py, eval.py and vis.py, the train.py is used to train the model, and eval.py is used to calculate the m-iou; and vis.py is used to test the images.
so we revise the eval.py at line 162.

```
slim.evaluation.evaluation_loop(
master=FLAGS.master,
checkpoint_dir=FLAGS.checkpoint_dir,
logdir=FLAGS.eval_logdir,
num_evals=num_batches,
eval_op=list(metrics_to_updates.values()),
max_number_of_evaluations=num_eval_iters,
eval_interval_secs=FLAGS.eval_interval_secs)
1. ```
change this function with
evaluate_once(
master=FLAGS.master,
checkpoint_path=FLAGS.checkpoint_dir,
logdir=FLAGS.eval_logdir,
num_evals=num_batches,
)
then you can execute eval.py at any time after you have train your model.
and in vis.py at line 280
last_checkpoint = slim.evaluation.wait_for_new_checkpoint(
FLAGS.checkpoint_dir, last_checkpoint)
you just need to change it with:
last_checkpoint =FLAGS.checkpoint_dir
and when you execute vis.py, you have to set the FLAGS.checkpoint_dir as the specific model name, such as ./datasets/cityscapes/exp/train_on_train_set/train/model.ckpt
and then you will get the segement images.

heidongxianhau on 5 Jul 2018

👍16 ❤6 🎉6

All 19 comments

Used tensorflow cpu version 1.80

chowkamlee81 on 2 Jul 2018

Yep, It's a evaluation loop, and waiting for new checkpoint to evaluate. It's not a bug, but a feature. You can run train.py and eval.py simultaneously, train.py will produce the newest model file, and eval.py will monitor that directory and evaluate the newest model on the val set.

tt-leader on 3 Jul 2018

👍3

Hi, thanks for your kind reply.
I don't have enough GPU to train the model.
Hence i downloaded the https://download.tensorflow.org/models/deeplabv3_cityscapes_train_2018_02_06.tar.gz
which contains -----frozen_inference_graph.pb
-----model.ckpt.data-00000-of-00001
----model.ckpt.index
which are pretrained models. i guess so. Pls correct if iam mistake.

Now I want to do inference on each image so that see the results. Kindly let me know how to go ahead.
Regards
Chowkam

chowkamlee81 on 3 Jul 2018

eval.py is to evaluate your model on val data split. Use vis.py to infer on images. Recommend you to have a glance at crash course

tt-leader on 3 Jul 2018

👎4

Iam just a newbie to tensorflow. Thanks for giving insight to have a look at the course you mentioned. Definitely i would go ahead.
From the models below mentioned in the hyperlink,
frozen_inference_graph.pb, model.ckpt.data-00000-of-00001, model.ckpt.index
i tried to do evaluation using eval.py, but it is going in an infinite loop "Waiting for new checkpoint at /home/chowkam/deeplabv3_cityscapes_train"...
As i mentioned earlier i don't have enough GPUs to train, i just want to evaluate using pretrained models. Do you think am i going in a right direction? Kindly suggest

chowkamlee81 on 3 Jul 2018

If you want to use the official checkpoint to evaluate, you should add a file named checkpoint, without any file extensions.
The content of checkpoint file is as follows:

model_checkpoint_path: "/foo/bar/absolute/path/to/deeplabv3_cityscapes_train/model.ckpt"
all_model_checkpoint_paths: "/foo/bar/absolute/path/to/deeplabv3_cityscapes_train/model.ckpt"

tt-leader on 3 Jul 2018

👍10 ❤5 🚀4

I got model.ckpt.data-00000-of-00001 from the link https://download.tensorflow.org/models/deeplabv3_cityscapes_train_2018_02_06.tar.gz and i renamed the model.ckpt.data-00000-of-00001 as just model.ckpt
and executed
python deeplab/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=1025 --eval_crop_size=2049 --dataset="cityscapes" --checkpoint_dir="/home/chowkam/deeplabv3_cityscapes_train/model.ckpt" --eval_logdir="/home/chowkam/cityscapes/leftImg8bit/val" --dataset_dir="/home/chowkam/cityscapes/tfrecord".

After executing this, i got into infinite loop with "Waiting for new checkpoint at /home/chowkam/deeplabv3_cityscapes_train? ". Kindly note iam not doing training but only evaluating the pretrained models downloaded from the link https://download.tensorflow.org/models/deeplabv3_cityscapes_train_2018_02_06.tar.gz.

Kindly suggest pls

chowkamlee81 on 3 Jul 2018

DO NOT rename the model file. Just leave it as model.ckpt.data-00000-of-000001 and model.ckpt.index. But in the checkpoint file, you should point it to model.ckpt, neither model.ckpt.index nor model.ckpt.data-00000-of-000001
Read Save and Restore for more detail.

tt-leader on 3 Jul 2018

👍4

From the 3 files
frozen_inference_graph.pb
model.ckpt.data-00000-of-00001
model.ckpt.index.
I don't have model.ckpt, is it possible to do the evaluation on cityscape dataset using deeplabv3 with xception_65 model. Kindly do let me know to go ahead pls.

chowkamlee81 on 3 Jul 2018

@tt-leader , Can you pls kindly help.

chowkamlee81 on 3 Jul 2018

i created a file called checkpoint and added model_checkpoint_path and all_model_checkpoint_paths.

Still iam getting the same infinite loop with "Waiting for new checkpoint at /home/chidanand/TensorFlow/models-master/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt?

Kindly pls help

chowkamlee81 on 3 Jul 2018

so if you want to eval or test an image alone, you may need to revise the code.

```
slim.evaluation.evaluation_loop(
master=FLAGS.master,
checkpoint_dir=FLAGS.checkpoint_dir,
logdir=FLAGS.eval_logdir,
num_evals=num_batches,
eval_op=list(metrics_to_updates.values()),
max_number_of_evaluations=num_eval_iters,
eval_interval_secs=FLAGS.eval_interval_secs)
1. ```
change this function with
evaluate_once(
master=FLAGS.master,
checkpoint_path=FLAGS.checkpoint_dir,
logdir=FLAGS.eval_logdir,
num_evals=num_batches,
)
then you can execute eval.py at any time after you have train your model.
and in vis.py at line 280
last_checkpoint = slim.evaluation.wait_for_new_checkpoint(
FLAGS.checkpoint_dir, last_checkpoint)
you just need to change it with:
last_checkpoint =FLAGS.checkpoint_dir
and when you execute vis.py, you have to set the FLAGS.checkpoint_dir as the specific model name, such as ./datasets/cityscapes/exp/train_on_train_set/train/model.ckpt
and then you will get the segement images.

heidongxianhau on 5 Jul 2018

👍16 ❤6 🎉6

Closing as this is resolved

wt-huang on 3 Nov 2018

@tt-leader @chowkamlee81 @heidongxianhau

i use the offical models deeplabv3_cityscapes_train/model.ckpt and fine-tune models train/model.ckpt-40000 to evaluate cityscapes val datas ,but i get very low mIoU, i dont know why?
is there anything i am ignored ? anyone have the similar result ?
please help me , thanks !

INFO:tensorflow:Restoring parameters from /home/rjw/tf-models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt-40000
miou_1.0[0.478293568]
INFO:tensorflow:Restoring parameters from /home/rjw/tf-models/research/deeplab/pretrain_models/deeplabv3_cityscapes_train/model.ckpt
miou_1.0[0.496331513]

ranjiewwen on 22 Dec 2018

Just add :+1:
--max_number_of_iterations=1 while invoking vis.py
and
--max_number_of_evaluations=1 while invoking eval.py

No other changes needed

manutom on 19 Feb 2019

👎7 👍5

Trying to train with new dataset:

In vis.py, based on the above comment we need to change this line to visualize results anytime.

last_checkpoint = slim.evaluation.wait_for_new_checkpoint(FLAGS.checkpoint_dir, last_checkpoint)

is this changed to checkpoint iterator now?

Do we need to change it something like this?

checkpoints_iterator = [FLAGS.checkpoint_dir]

SSaishruthi on 14 Mar 2019

heidongxianhau

New code of eval.py and vis.py is different. Can you have a look and tell how to resolve this?? last_checkpoint does not exist.

RajatGarg45 on 4 Jun 2019

👍5

@chowkamlee81
Hello, I met the same problem with you, have you solved it yet? If yes, can you share with me what should I do?
Thank you so much.

kiki911 on 7 May 2020

I'm using tf2 and running both train and eval(passing --checkpoint_dir flag to model_main_tf2.py) simultaneously works. I know it creates memory usage overhead but it is the only solution I could find without changing the source code.

However, notice that by default training saves a checkpoint in every 1000 steps and evaluation process sleeps for 300 seconds after every test operation with the new checkpoint it found. In other words, evaluating process (the one you called with --checkpoint_dir flag) makes a prediction using the newest checkpoint it could find and after reporting the results it waits for 5 mins.

Both of the features seems meaningless to me so you can change them by passing _"--checkpoint_every_n=5"_ flag to training process. So, it saves a checkpoint for every 5 steps. For the eval part you have to change to source code. Go to models/research/model_main_tf2.py file and change "wait_interval=300" parameter to whichever seconds you want.

Remember that you have to uninstall object-detection module and reinstall it for using the new code.

#!/bin/bash

!pip uninstall object-detection
!protoc object_detection/protos/*.proto --python_out=.
# Install TensorFlow Object Detection API.
!cp object_detection/packages/tf2/setup.py .
!python -m pip install --use-feature=2020-resolver .