Models: Stuck Waiting for new Checkpoint

Created on 13 Jun 2018  路  8Comments  路  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: deeplab
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 3.10.0-693.11.6.el7.x86_64
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below):1.8.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: cuda/9.0.176 cudnn/7.0
  • GPU model and memory: (not sure, running on university cluster)
  • Exact command to reproduce:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
TRAIN_LOGDIR="./datasets/exp/eval"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Describe the problem

I am using a custom dataset from a project. We have 960 jpg images with corresponding png masks. We also have 180 validation images (mask and image combo). We have two classes and all the png masks are converted to binary label images (checked in matlab and all the masks are just 0 or 1).

I exported these images as tfrecord using the scripts in the dataset folder. Though I had to hardcode this part to properly get the expected output:

FLAGS.image_format = "jpg"
FLAGS.label_format = "png"

I then trained the model using:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"

CKPT="./xception/model.ckpt"

NUM_ITERATIONS=20
python ./train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--output_stride=16 \
--train_crop_size=256 \
--train_crop_size=256 \
--train_batch_size=4 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--tf_initial_checkpoint="${CKPT}" \
--fine_tune_batch_norm=true \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${DATASET}"

The training output is what I expected :

INFO:tensorflow:Restoring parameters from ./xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./datasets/exp/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.3084 (6.827 sec/step)
INFO:tensorflow:global step 20: loss = 3.2449 (6.885 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

The issue arises when I try to evaluate the model.

I added the following lines to the segmentation_dataset.py file:

_SOYBEAN_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 960,
'val': 180,
},
num_classes=2,
ignore_label=255,
)
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'soybean': _SOYBEAN_INFORMATION
}

And I changed the eval.py dataset setting here:

flags.DEFINE_string('dataset', 'soybean',
'Name of the segmentation dataset.')

When I run the evaluation code the system just freezes waiting for a checkpoint:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
TRAIN_LOGDIR="./datasets/exp/eval"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Error

I get the following terminal output when i run the above batch file. And it simply hangs waiting.

INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 180
INFO:tensorflow:Eval batch size 1 and num batch 180
INFO:tensorflow:Waiting for new checkpoint at ./datasets/exp/eval

Question

What is not setup correctly to allow the evaluation to run, but (seemingly) correctly trains the model with a fairly low loss value?

I took most of the batch file code from the example files in the deeplab directory.

Most helpful comment

--checkpoint_dir=a dir, not a file

All 8 comments

Try lowering num_examples in your config file, it's probably not frozen but just taking way too long to run. If you're evaluating on a GPU you can check nvidia-smi to see that it's still doing something.

Issue was a typo in the run evaluation script where the train checkpoint directory was not set correctly.

@kekeller
Hello, I have three files in the checkpoints_dir as frozen_inference_graph.pb , model.ckpt.data-00000-of-00001 and model.ckpt.index, and I met the same problem, what should I do to load the checkpoint? Could you please give me a hand , many thanks for your help~

@lizleo My issue was caused by a typo. Are you sure your evaluation file is checking the correct directories? Check that it's reading from the from the training directory, and writing to the eval directory. Check the local_test.sh for an example.

@lizello,

I stuck with the same problem like you since iam also working on pretrained models given.kindly pls help

@chowkamlee81

Do your add --max_number_of_evaluations=1 para in your eval code?

https://github.com/tensorflow/models/issues/6275
i change to the https://github.com/tensorflow/models/tree/r1.12.0 branch and it worked
and if there has no module named deeplab on windows try write
import sys
sys.path.append("D:/Deeplab12/research/") ps: "path to your own "
on train.py eval.py ans vis.py

--checkpoint_dir=a dir, not a file

Was this page helpful?
0 / 5 - 0 ratings

Related issues

walkerlala picture walkerlala  路  98Comments

Tsuihao picture Tsuihao  路  90Comments

pjeambrun picture pjeambrun  路  51Comments

VastoLorde95 picture VastoLorde95  路  57Comments

wkelongws picture wkelongws  路  78Comments