Models: Stuck Waiting for new Checkpoint

Created on 13 Jun 2018 · 8Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 3.10.0-693.11.6.el7.x86_64
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below):1.8.0
Bazel version (if compiling from source):
CUDA/cuDNN version: cuda/9.0.176 cudnn/7.0
GPU model and memory: (not sure, running on university cluster)
Exact command to reproduce:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
TRAIN_LOGDIR="./datasets/exp/eval"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Describe the problem

I am using a custom dataset from a project. We have 960 jpg images with corresponding png masks. We also have 180 validation images (mask and image combo). We have two classes and all the png masks are converted to binary label images (checked in matlab and all the masks are just 0 or 1).

I exported these images as tfrecord using the scripts in the dataset folder. Though I had to hardcode this part to properly get the expected output:

FLAGS.image_format = "jpg"
FLAGS.label_format = "png"

I then trained the model using:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"

CKPT="./xception/model.ckpt"

NUM_ITERATIONS=20
python ./train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--output_stride=16 \
--train_crop_size=256 \
--train_crop_size=256 \
--train_batch_size=4 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--tf_initial_checkpoint="${CKPT}" \
--fine_tune_batch_norm=true \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${DATASET}"

The training output is what I expected :

INFO:tensorflow:Restoring parameters from ./xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./datasets/exp/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.3084 (6.827 sec/step)
INFO:tensorflow:global step 20: loss = 3.2449 (6.885 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

The issue arises when I try to evaluate the model.

I added the following lines to the segmentation_dataset.py file:

_SOYBEAN_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 960,
'val': 180,
},
num_classes=2,
ignore_label=255,
)
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'soybean': _SOYBEAN_INFORMATION
}

And I changed the eval.py dataset setting here:

flags.DEFINE_string('dataset', 'soybean',
'Name of the segmentation dataset.')

When I run the evaluation code the system just freezes waiting for a checkpoint:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
TRAIN_LOGDIR="./datasets/exp/eval"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Error

I get the following terminal output when i run the above batch file. And it simply hangs waiting.

INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 180
INFO:tensorflow:Eval batch size 1 and num batch 180
INFO:tensorflow:Waiting for new checkpoint at ./datasets/exp/eval

Question

What is not setup correctly to allow the evaluation to run, but (seemingly) correctly trains the model with a fairly low loss value?

I took most of the batch file code from the example files in the deeplab directory.

Source

kekeller

Most helpful comment

--checkpoint_dir=a dir, not a file

YHDING23 on 25 Apr 2019

👎6 👍6 ❤2

All 8 comments

Try lowering num_examples in your config file, it's probably not frozen but just taking way too long to run. If you're evaluating on a GPU you can check nvidia-smi to see that it's still doing something.

austinmw on 13 Jun 2018

Issue was a typo in the run evaluation script where the train checkpoint directory was not set correctly.

kekeller on 13 Jun 2018

@kekeller
Hello, I have three files in the checkpoints_dir as frozen_inference_graph.pb , model.ckpt.data-00000-of-00001 and model.ckpt.index, and I met the same problem, what should I do to load the checkpoint? Could you please give me a hand , many thanks for your help~

lizleo on 26 Jun 2018

@lizleo My issue was caused by a typo. Are you sure your evaluation file is checking the correct directories? Check that it's reading from the from the training directory, and writing to the eval directory. Check the local_test.sh for an example.

kekeller on 28 Jun 2018

@lizello,

I stuck with the same problem like you since iam also working on pretrained models given.kindly pls help

chowkamlee81 on 2 Jul 2018

@chowkamlee81

Do your add --max_number_of_evaluations=1 para in your eval code?

yiluheihei on 24 Oct 2018

👎6

https://github.com/tensorflow/models/issues/6275
i change to the https://github.com/tensorflow/models/tree/r1.12.0 branch and it worked
and if there has no module named deeplab on windows try write
import sys
sys.path.append("D:/Deeplab12/research/") ps: "path to your own "
on train.py eval.py ans vis.py