Models: Stuck Waiting for new Checkpoint in deeplab PASCAL VOC 2012

Created on 27 Feb 2019 · 16Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using:
deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
in local_test.sh
python "${WORK_DIR}"/train.py \
--logtostderr \
--train_split="trainval" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=4 \
.
.
.
I changed --train_batch_size=4 to --train_batch_size=1 because of ResourceExhausted Error.
I just have low GPU specification.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 18.04
TensorFlow installed from (source or binary):
pip
TensorFlow version (use command below):
tensorflow-gpu 1.11.0, python 2.7.15rc1
Bazel version (if compiling from source):
CUDA/cuDNN version:
CUDA: 9.0, cuDNN: 7.3.1
GPU model and memory:
GeForce GTX 1060b 6g
Exact command to reproduce:

Describe the problem

.
.
.
INFO:tensorflow:Visualizing batch 1448
INFO:tensorflow:Visualizing batch 1449
INFO:tensorflow:Visualizing batch 1450
INFO:tensorflow:Finished visualization at 2019-02-27-05:00:19
INFO:tensorflow:Waiting for new checkpoint at /home/jimmyjim/GitHub/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
^CTraceback (most recent call last):
File "/home/jimmyjim/GitHub/models/research/deeplab/vis.py", line 312, in
tf.app.run()
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jimmyjim/GitHub/models/research/deeplab/vis.py", line 267, in main
for checkpoint_path in checkpoints_iterator:
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 247, in checkpoints_iterator
checkpoint_dir, checkpoint_path, timeout=timeout)
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 196, in wait_for_new_checkpoint
time.sleep(seconds_to_sleep)
KeyboardInterrupt

when local_test.sh(just example PASCAL VOC 2012) run vis.py, it was stuck after that:
INFO:tensorflow:Waiting for new checkpoint at /home/jimmyjim/GitHub/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train

How can I do that??

research awaiting model gardener

Source

JimHeo

Most helpful comment

This behaviour i.e. (INFO:tensorflow:Waiting for new checkpoint at ...) will occur when there is no
checkpoint file in the directory specified by --checkpoint-dir for vis.py.
Adding a file named "checkpoint" with the following contents:

model_checkpoint_path: "model.ckpt"
all_model_checkpoint_paths: "model.ckpt"

fixes the issue.
When downloading the xception65_coco_voc_trainval model from https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md one should add the checkpoint file to the extracted directory. The model_checkpoint_path and all_model_checkpoint_paths values are based on the ckpt filenames without the extension.

m-nez on 25 Aug 2019

👀2 👍2

All 16 comments

I am having the same issue with a 4 Tesla V100 multi GPU machine
Precisely Sagemaker's ml.p3.8xlarge

System information

What is the top-level directory of the model you are using:
deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
in deeplab/train.py for compatibility
xrange=range
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Amazon Linux 4.14.77-70.82.amzn1.x86_64 GNU/Linux
TensorFlow installed from (source or binary):
amazon's default conda env on python36
TensorFlow version (use command below):
tensorflow-gpu v1.10.0-4-g0e53c66f33 1.10.0, Python 3.6.5
Bazel version (if compiling from source):
CUDA/cuDNN version:
Cuda compilation tools, release 9.0, V9.0.176, cuDNN:7.1.4
GPU model and memory:
4XTesla V100-SXM2
Exact command to reproduce:

Describe the problem

.
.
.
INFO:tensorflow:Visualizing batch 1383
INFO:tensorflow:Visualizing batch 1384
INFO:tensorflow:Visualizing batch 1385
INFO:tensorflow:Visualizing batch 1386
INFO:tensorflow:Visualizing batch 1387
INFO:tensorflow:Visualizing batch 1388
INFO:tensorflow:Visualizing batch 1389
INFO:tensorflow:Visualizing batch 1390
INFO:tensorflow:Visualizing batch 1391
INFO:tensorflow:Visualizing batch 1392
INFO:tensorflow:Visualizing batch 1393
INFO:tensorflow:Visualizing batch 1394
INFO:tensorflow:Visualizing batch 1395
INFO:tensorflow:Visualizing batch 1396
INFO:tensorflow:Visualizing batch 1397
INFO:tensorflow:Visualizing batch 1398
INFO:tensorflow:Visualizing batch 1399
INFO:tensorflow:Visualizing batch 1400
INFO:tensorflow:Visualizing batch 1401
INFO:tensorflow:Visualizing batch 1402
INFO:tensorflow:Visualizing batch 1403
INFO:tensorflow:Visualizing batch 1404
INFO:tensorflow:Visualizing batch 1405
INFO:tensorflow:Visualizing batch 1406
INFO:tensorflow:Visualizing batch 1407
INFO:tensorflow:Visualizing batch 1408
INFO:tensorflow:Visualizing batch 1409
INFO:tensorflow:Visualizing batch 1410
INFO:tensorflow:Visualizing batch 1411
INFO:tensorflow:Visualizing batch 1412
INFO:tensorflow:Visualizing batch 1413
INFO:tensorflow:Visualizing batch 1414
INFO:tensorflow:Visualizing batch 1415
INFO:tensorflow:Visualizing batch 1416
INFO:tensorflow:Visualizing batch 1417
INFO:tensorflow:Visualizing batch 1418
INFO:tensorflow:Visualizing batch 1419
INFO:tensorflow:Visualizing batch 1420
INFO:tensorflow:Visualizing batch 1421
INFO:tensorflow:Visualizing batch 1422
INFO:tensorflow:Visualizing batch 1423
INFO:tensorflow:Visualizing batch 1424
INFO:tensorflow:Visualizing batch 1425
INFO:tensorflow:Visualizing batch 1426
INFO:tensorflow:Visualizing batch 1427
INFO:tensorflow:Visualizing batch 1428
INFO:tensorflow:Visualizing batch 1429
INFO:tensorflow:Visualizing batch 1430
INFO:tensorflow:Visualizing batch 1431
INFO:tensorflow:Visualizing batch 1432
INFO:tensorflow:Visualizing batch 1433
INFO:tensorflow:Visualizing batch 1434
INFO:tensorflow:Visualizing batch 1435
INFO:tensorflow:Visualizing batch 1436
INFO:tensorflow:Visualizing batch 1437
INFO:tensorflow:Visualizing batch 1438
INFO:tensorflow:Visualizing batch 1439
INFO:tensorflow:Visualizing batch 1440
INFO:tensorflow:Visualizing batch 1441
INFO:tensorflow:Visualizing batch 1442
INFO:tensorflow:Visualizing batch 1443
INFO:tensorflow:Visualizing batch 1444
INFO:tensorflow:Visualizing batch 1445
INFO:tensorflow:Visualizing batch 1446
INFO:tensorflow:Visualizing batch 1447
INFO:tensorflow:Visualizing batch 1448
INFO:tensorflow:Visualizing batch 1449
INFO:tensorflow:Visualizing batch 1450
INFO:tensorflow:Finished visualization at 2019-02-27-12:11:55
INFO:tensorflow:Waiting for new checkpoint at /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
^CTraceback (most recent call last):
File "/home/ec2-user/SageMaker/models/research/deeplab/vis.py", line 312, in
tf.app.run()
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/ec2-user/SageMaker/models/research/deeplab/vis.py", line 267, in main
for checkpoint_path in checkpoints_iterator:
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 247, in checkpoints_iterator
checkpoint_dir, checkpoint_path, timeout=timeout)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 196, in wait_for_new_checkpoint
time.sleep(seconds_to_sleep)
KeyboardInterrupt

@12:40

sulydeni on 27 Feb 2019

Hello, it's not a bug, it's a feature!
For vis.py , add param --max_number_of_iterations=1 .
You can find the description at this line.

aspenstarss on 27 Feb 2019

Hello, it's not a bug, it's a feature!
For vis.py , add param --max_number_of_iterations=1 .
You can find the description at this line.

I am running this from research/deeplab/local_test.sh and --max_number_of_iteration=1 in there. The error is still there.

I went to check the folder for a checkpoint and it looks like this:
sh-4.2$ ls -alh /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train

total 1.4G
drwxrwxr-x 2 ec2-user ec2-user 4.0K Feb 27 16:07 .
drwxrwxr-x 6 ec2-user ec2-user 4.0K Feb 27 12:06 ..
-rw-rw-r-- 1 ec2-user ec2-user 739 Feb 27 16:07 checkpoint
-rw-rw-r-- 1 ec2-user ec2-user 36M Feb 27 12:08 events.out.tfevents.1551269221.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 33M Feb 27 13:10 events.out.tfevents.1551272969.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 33M Feb 27 16:07 events.out.tfevents.1551283556.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 16M Feb 27 16:07 graph.pbtxt
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 12:07 model.ckpt-0.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 12:07 model.ckpt-0.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 12:07 model.ckpt-0.index
-rw-rw-r-- 1 ec2-user ec2-user 10M Feb 27 12:07 model.ckpt-0.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 13:09 model.ckpt-10.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 13:09 model.ckpt-10.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 13:09 model.ckpt-10.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 13:09 model.ckpt-10.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 16:07 model.ckpt-20.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 16:07 model.ckpt-20.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 16:07 model.ckpt-20.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 16:07 model.ckpt-20.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 16:07 model.ckpt-30.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 16:07 model.ckpt-30.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 16:07 model.ckpt-30.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 16:07 model.ckpt-30.meta

sh-4.2$ cat /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/checkpoint

model_checkpoint_path: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-30"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-0"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-20"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-30"

sulydeni on 27 Feb 2019

@aspenstarss
I also run from local_test.sh.
In local_test.sh, the param for vis.py, max_number_of_iterations=1, is already used.
But still, that was stuck.

JimHeo on 28 Feb 2019

@JimHeo
I have tried https://github.com/tensorflow/models/tree/r1.12.0 branch and it worked
Try r.11

sulydeni on 28 Feb 2019

@sulydeni
got it!
r11 doesn't have research directory.
So, I also tried r1.12.0, then it worked for me.

I still don't know about this issue when I run from master branch.
But, it's not important now...

btw, thanks!

JimHeo on 1 Mar 2019

I think there might be a mistake here in file vis.py, line268:
if max_num_iteration > 0 and num_iteration > max_num_iteration:
break
num_iteration += 1
Notice that num_iteration is assigned 0. When we set max_num_iteration = 1, there will be actually 2 iterations for the loop to go.

summer-wing on 30 Apr 2019

@summer-wing
I think you're right.
And the code of vis.py in branch r1.12.0 is written like this:

# Loop to visualize the results when new checkpoint is created.
    num_iters = 0
    while (FLAGS.max_number_of_iterations <= 0 or
           num_iters < FLAGS.max_number_of_iterations):
      num_iters += 1
      last_checkpoint = slim.evaluation.wait_for_new_checkpoint(
          FLAGS.checkpoint_dir, last_checkpoint)
      start = time.time()
      tf.logging.info(
          'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))
      tf.logging.info('Visualizing with model %s', last_checkpoint)

      with sv.managed_session(FLAGS.master,
                              start_standard_services=False) as sess:
        sv.start_queue_runners(sess)
        sv.saver.restore(sess, last_checkpoint)

        image_id_offset = 0
        for batch in range(num_batches):
          tf.logging.info('Visualizing batch %d / %d', batch + 1, num_batches)
          _process_batch(sess=sess,
                         original_images=samples[common.ORIGINAL_IMAGE],
                         semantic_predictions=predictions,
                         image_names=samples[common.IMAGE_NAME],
                         image_heights=samples[common.HEIGHT],
                         image_widths=samples[common.WIDTH],
                         image_id_offset=image_id_offset,
                         save_dir=save_dir,
                         raw_save_dir=raw_save_dir,
                         train_id_to_eval_id=train_id_to_eval_id)
          image_id_offset += FLAGS.vis_batch_size

      tf.logging.info(
          'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))
      time_to_next_eval = start + FLAGS.eval_interval_secs - time.time()
      if time_to_next_eval > 0:
        time.sleep(time_to_next_eval)

and the code of vis.py in master branch is written like this:

    num_iteration = 0
    max_num_iteration = FLAGS.max_number_of_iterations

    checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
        FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
    for checkpoint_path in checkpoints_iterator:
      if max_num_iteration > 0 and num_iteration > max_num_iteration:
        break
      num_iteration += 1

      tf.logging.info(
          'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))
      tf.logging.info('Visualizing with model %s', checkpoint_path)

      scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
      session_creator = tf.train.ChiefSessionCreator(
          scaffold=scaffold,
          master=FLAGS.master,
          checkpoint_filename_with_path=checkpoint_path)
      with tf.train.MonitoredSession(
          session_creator=session_creator, hooks=None) as sess:
        batch = 0
        image_id_offset = 0

        while not sess.should_stop():
          tf.logging.info('Visualizing batch %d', batch + 1)
          _process_batch(sess=sess,
                         original_images=samples[common.ORIGINAL_IMAGE],
                         semantic_predictions=predictions,
                         image_names=samples[common.IMAGE_NAME],
                         image_heights=samples[common.HEIGHT],
                         image_widths=samples[common.WIDTH],
                         image_id_offset=image_id_offset,
                         save_dir=save_dir,
                         raw_save_dir=raw_save_dir,
                         train_id_to_eval_id=train_id_to_eval_id)
          image_id_offset += FLAGS.vis_batch_size
          batch += 1

      tf.logging.info(
          'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))

if we set the param --max_number_of_iterations=1 then, because of the line:

num_iteration = 0
for checkpoint_path in checkpoints_iterator:
      if max_num_iteration > 0 and num_iteration > max_num_iteration:
        break
      num_iteration += 1

there will be 2 iterations.

And also, iterator of for-loop is checkpoint_path, not num_iteration actually.
then, checkpoint_path invokes the wait-state such as "INFO:tensorflow:Waiting for new checkpoint at"

So, I modified this part of vis.py slightly in my local:

    num_iteration = 0
    max_num_iteration = FLAGS.max_number_of_iterations

    checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
        FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
    for checkpoint_path in checkpoints_iterator:
      num_iteration += 1
      tf.logging.info(
          'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))
      tf.logging.info('Visualizing with model %s', checkpoint_path)

      scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
      session_creator = tf.train.ChiefSessionCreator(
          scaffold=scaffold,
          master=FLAGS.master,
          checkpoint_filename_with_path=checkpoint_path)
      with tf.train.MonitoredSession(
          session_creator=session_creator, hooks=None) as sess:
        batch = 0
        image_id_offset = 0

        while not sess.should_stop():
          tf.logging.info('Visualizing batch %d', batch + 1)
          _process_batch(sess=sess,
                         original_images=samples[common.ORIGINAL_IMAGE],
                         semantic_predictions=predictions,
                         image_names=samples[common.IMAGE_NAME],
                         image_heights=samples[common.HEIGHT],
                         image_widths=samples[common.WIDTH],
                         image_id_offset=image_id_offset,
                         save_dir=save_dir,
                         raw_save_dir=raw_save_dir,
                         train_id_to_eval_id=train_id_to_eval_id)
          image_id_offset += FLAGS.vis_batch_size
          batch += 1

      tf.logging.info(
          'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                       time.gmtime()))
      if max_num_iteration > 0 and num_iteration >= max_num_iteration:
        break

then it worked.
I think the code needs to be updated.

JimHeo on 15 May 2019

Thank you, JimHeo and summer-wing, for reporting the issue.
@JimHeo Good job on fixing it. Would you like to do a Pull Request for this? Very appreciate your help.

Thanks!

aquariusjay on 15 May 2019

@aquariusjay My Pleasure :)

JimHeo on 16 May 2019

I am facing a similar issue with eval.py

INFO:tensorflow:Waiting for new checkpoint at ./deeplab/datasets/pascal_voc_seg/VOC2012/ImageSets/Segmentation\model.ckpt

How do I resolve this?

RajatGarg45 on 4 Jun 2019

The problem is still unsolved after @JimHeo Pulling Request
I managed it in this way:
` ##checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
## FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
##for checkpoint_path in checkpoints_iterator:
checkpoint_path=FLAGS.checkpoint_dir##
num_iteration += 1
tf.logging.info(
'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
tf.logging.info('Visualizing with model %s', checkpoint_path)

scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
session_creator = tf.train.ChiefSessionCreator(
    scaffold=scaffold,
    master=FLAGS.master,
    checkpoint_filename_with_path=checkpoint_path)
with tf.train.MonitoredSession(
    session_creator=session_creator, hooks=None) as sess:
  batch = 0
  image_id_offset = 0

  while not sess.should_stop():
    tf.logging.info('Visualizing batch %d', batch + 1)
    _process_batch(sess=sess,
                   original_images=samples[common.ORIGINAL_IMAGE],
                   semantic_predictions=predictions,
                   image_names=samples[common.IMAGE_NAME],
                   image_heights=samples[common.HEIGHT],
                   image_widths=samples[common.WIDTH],
                   image_id_offset=image_id_offset,
                   save_dir=save_dir,
                   raw_save_dir=raw_save_dir,
                   train_id_to_eval_id=train_id_to_eval_id)
    image_id_offset += FLAGS.vis_batch_size
    batch += 1

tf.logging.info(
    'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
                                                 time.gmtime()))
##if max_num_iteration > 0 and num_iteration >= max_num_iteration:
##  break`

Angus-Lee on 7 Jun 2019

model_checkpoint_path: "model.ckpt"
all_model_checkpoint_paths: "model.ckpt"

m-nez on 25 Aug 2019

👀2 👍2

@m-nez where can i get or how can i generate the checkpoint file model.ckpt from the download file, thanks.

pkuCactus on 26 Aug 2019

@pkuCactus You have to extract the deeplabv3_pascal_trainval_2018_01_04.tar.gz file.
The model.ckpt in the contents of "checkpoint" refers to model.ckpt.data-00000-of-00001 and model.ckpt.index files which are present in the extracted deeplabv3_pascal_trainval directory.

m-nez on 26 Aug 2019

👍1

@m-nez Thanks very much, and i'v resolved it according the description.

pkuCactus on 26 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings