python "${WORK_DIR}"/train.py \--logtostderr \--train_split="trainval" \--model_variant="xception_65" \--atrous_rates=6 \--atrous_rates=12 \--atrous_rates=18 \--output_stride=16 \--decoder_output_stride=4 \--train_crop_size=513 \--train_crop_size=513 \--train_batch_size=4 \...--train_batch_size=4 to --train_batch_size=1 because of ResourceExhausted Error..
.
.
INFO:tensorflow:Visualizing batch 1448
INFO:tensorflow:Visualizing batch 1449
INFO:tensorflow:Visualizing batch 1450
INFO:tensorflow:Finished visualization at 2019-02-27-05:00:19
INFO:tensorflow:Waiting for new checkpoint at /home/jimmyjim/GitHub/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
^CTraceback (most recent call last):
File "/home/jimmyjim/GitHub/models/research/deeplab/vis.py", line 312, in
tf.app.run()
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jimmyjim/GitHub/models/research/deeplab/vis.py", line 267, in main
for checkpoint_path in checkpoints_iterator:
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 247, in checkpoints_iterator
checkpoint_dir, checkpoint_path, timeout=timeout)
File "/home/jimmyjim/.local/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 196, in wait_for_new_checkpoint
time.sleep(seconds_to_sleep)
KeyboardInterrupt
when local_test.sh(just example PASCAL VOC 2012) run vis.py, it was stuck after that:
INFO:tensorflow:Waiting for new checkpoint at /home/jimmyjim/GitHub/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
How can I do that??
I am having the same issue with a 4 Tesla V100 multi GPU machine
Precisely Sagemaker's ml.p3.8xlarge
xrange=range.
.
.
INFO:tensorflow:Visualizing batch 1383
INFO:tensorflow:Visualizing batch 1384
INFO:tensorflow:Visualizing batch 1385
INFO:tensorflow:Visualizing batch 1386
INFO:tensorflow:Visualizing batch 1387
INFO:tensorflow:Visualizing batch 1388
INFO:tensorflow:Visualizing batch 1389
INFO:tensorflow:Visualizing batch 1390
INFO:tensorflow:Visualizing batch 1391
INFO:tensorflow:Visualizing batch 1392
INFO:tensorflow:Visualizing batch 1393
INFO:tensorflow:Visualizing batch 1394
INFO:tensorflow:Visualizing batch 1395
INFO:tensorflow:Visualizing batch 1396
INFO:tensorflow:Visualizing batch 1397
INFO:tensorflow:Visualizing batch 1398
INFO:tensorflow:Visualizing batch 1399
INFO:tensorflow:Visualizing batch 1400
INFO:tensorflow:Visualizing batch 1401
INFO:tensorflow:Visualizing batch 1402
INFO:tensorflow:Visualizing batch 1403
INFO:tensorflow:Visualizing batch 1404
INFO:tensorflow:Visualizing batch 1405
INFO:tensorflow:Visualizing batch 1406
INFO:tensorflow:Visualizing batch 1407
INFO:tensorflow:Visualizing batch 1408
INFO:tensorflow:Visualizing batch 1409
INFO:tensorflow:Visualizing batch 1410
INFO:tensorflow:Visualizing batch 1411
INFO:tensorflow:Visualizing batch 1412
INFO:tensorflow:Visualizing batch 1413
INFO:tensorflow:Visualizing batch 1414
INFO:tensorflow:Visualizing batch 1415
INFO:tensorflow:Visualizing batch 1416
INFO:tensorflow:Visualizing batch 1417
INFO:tensorflow:Visualizing batch 1418
INFO:tensorflow:Visualizing batch 1419
INFO:tensorflow:Visualizing batch 1420
INFO:tensorflow:Visualizing batch 1421
INFO:tensorflow:Visualizing batch 1422
INFO:tensorflow:Visualizing batch 1423
INFO:tensorflow:Visualizing batch 1424
INFO:tensorflow:Visualizing batch 1425
INFO:tensorflow:Visualizing batch 1426
INFO:tensorflow:Visualizing batch 1427
INFO:tensorflow:Visualizing batch 1428
INFO:tensorflow:Visualizing batch 1429
INFO:tensorflow:Visualizing batch 1430
INFO:tensorflow:Visualizing batch 1431
INFO:tensorflow:Visualizing batch 1432
INFO:tensorflow:Visualizing batch 1433
INFO:tensorflow:Visualizing batch 1434
INFO:tensorflow:Visualizing batch 1435
INFO:tensorflow:Visualizing batch 1436
INFO:tensorflow:Visualizing batch 1437
INFO:tensorflow:Visualizing batch 1438
INFO:tensorflow:Visualizing batch 1439
INFO:tensorflow:Visualizing batch 1440
INFO:tensorflow:Visualizing batch 1441
INFO:tensorflow:Visualizing batch 1442
INFO:tensorflow:Visualizing batch 1443
INFO:tensorflow:Visualizing batch 1444
INFO:tensorflow:Visualizing batch 1445
INFO:tensorflow:Visualizing batch 1446
INFO:tensorflow:Visualizing batch 1447
INFO:tensorflow:Visualizing batch 1448
INFO:tensorflow:Visualizing batch 1449
INFO:tensorflow:Visualizing batch 1450
INFO:tensorflow:Finished visualization at 2019-02-27-12:11:55
INFO:tensorflow:Waiting for new checkpoint at /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
^CTraceback (most recent call last):
File "/home/ec2-user/SageMaker/models/research/deeplab/vis.py", line 312, in
tf.app.run()
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/ec2-user/SageMaker/models/research/deeplab/vis.py", line 267, in main
for checkpoint_path in checkpoints_iterator:
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 247, in checkpoints_iterator
checkpoint_dir, checkpoint_path, timeout=timeout)
File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 196, in wait_for_new_checkpoint
time.sleep(seconds_to_sleep)
KeyboardInterrupt
@12:40
Hello, it's not a bug, it's a feature!
For vis.py , add param --max_number_of_iterations=1 .
You can find the description at this line.
Hello, it's not a bug, it's a feature!
For vis.py , add param --max_number_of_iterations=1 .
You can find the description at this line.
I am running this from research/deeplab/local_test.sh and --max_number_of_iteration=1 in there. The error is still there.
I went to check the folder for a checkpoint and it looks like this:
sh-4.2$ ls -alh /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train
total 1.4G
drwxrwxr-x 2 ec2-user ec2-user 4.0K Feb 27 16:07 .
drwxrwxr-x 6 ec2-user ec2-user 4.0K Feb 27 12:06 ..
-rw-rw-r-- 1 ec2-user ec2-user 739 Feb 27 16:07 checkpoint
-rw-rw-r-- 1 ec2-user ec2-user 36M Feb 27 12:08 events.out.tfevents.1551269221.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 33M Feb 27 13:10 events.out.tfevents.1551272969.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 33M Feb 27 16:07 events.out.tfevents.1551283556.ip-172-16-5-96
-rw-rw-r-- 1 ec2-user ec2-user 16M Feb 27 16:07 graph.pbtxt
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 12:07 model.ckpt-0.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 12:07 model.ckpt-0.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 12:07 model.ckpt-0.index
-rw-rw-r-- 1 ec2-user ec2-user 10M Feb 27 12:07 model.ckpt-0.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 13:09 model.ckpt-10.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 13:09 model.ckpt-10.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 13:09 model.ckpt-10.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 13:09 model.ckpt-10.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 16:07 model.ckpt-20.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 16:07 model.ckpt-20.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 16:07 model.ckpt-20.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 16:07 model.ckpt-20.meta
-rw-rw-r-- 1 ec2-user ec2-user 8 Feb 27 16:07 model.ckpt-30.data-00000-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 315M Feb 27 16:07 model.ckpt-30.data-00001-of-00002
-rw-rw-r-- 1 ec2-user ec2-user 55K Feb 27 16:07 model.ckpt-30.index
-rw-rw-r-- 1 ec2-user ec2-user 9.3M Feb 27 16:07 model.ckpt-30.meta
sh-4.2$ cat /home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/checkpoint
model_checkpoint_path: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-30"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-0"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-20"
all_model_checkpoint_paths: "/home/ec2-user/SageMaker/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-30"
@aspenstarss
I also run from local_test.sh.
In local_test.sh, the param for vis.py, max_number_of_iterations=1, is already used.
But still, that was stuck.
@JimHeo
I have tried https://github.com/tensorflow/models/tree/r1.12.0 branch and it worked
Try r.11
@sulydeni
got it!
r11 doesn't have research directory.
So, I also tried r1.12.0, then it worked for me.
I still don't know about this issue when I run from master branch.
But, it's not important now...
btw, thanks!
I think there might be a mistake here in file vis.py, line268:
if max_num_iteration > 0 and num_iteration > max_num_iteration:
break
num_iteration += 1
Notice that num_iteration is assigned 0. When we set max_num_iteration = 1, there will be actually 2 iterations for the loop to go.
@summer-wing
I think you're right.
And the code of vis.py in branch r1.12.0 is written like this:
# Loop to visualize the results when new checkpoint is created.
num_iters = 0
while (FLAGS.max_number_of_iterations <= 0 or
num_iters < FLAGS.max_number_of_iterations):
num_iters += 1
last_checkpoint = slim.evaluation.wait_for_new_checkpoint(
FLAGS.checkpoint_dir, last_checkpoint)
start = time.time()
tf.logging.info(
'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
tf.logging.info('Visualizing with model %s', last_checkpoint)
with sv.managed_session(FLAGS.master,
start_standard_services=False) as sess:
sv.start_queue_runners(sess)
sv.saver.restore(sess, last_checkpoint)
image_id_offset = 0
for batch in range(num_batches):
tf.logging.info('Visualizing batch %d / %d', batch + 1, num_batches)
_process_batch(sess=sess,
original_images=samples[common.ORIGINAL_IMAGE],
semantic_predictions=predictions,
image_names=samples[common.IMAGE_NAME],
image_heights=samples[common.HEIGHT],
image_widths=samples[common.WIDTH],
image_id_offset=image_id_offset,
save_dir=save_dir,
raw_save_dir=raw_save_dir,
train_id_to_eval_id=train_id_to_eval_id)
image_id_offset += FLAGS.vis_batch_size
tf.logging.info(
'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
time_to_next_eval = start + FLAGS.eval_interval_secs - time.time()
if time_to_next_eval > 0:
time.sleep(time_to_next_eval)
and the code of vis.py in master branch is written like this:
num_iteration = 0
max_num_iteration = FLAGS.max_number_of_iterations
checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
for checkpoint_path in checkpoints_iterator:
if max_num_iteration > 0 and num_iteration > max_num_iteration:
break
num_iteration += 1
tf.logging.info(
'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
tf.logging.info('Visualizing with model %s', checkpoint_path)
scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
session_creator = tf.train.ChiefSessionCreator(
scaffold=scaffold,
master=FLAGS.master,
checkpoint_filename_with_path=checkpoint_path)
with tf.train.MonitoredSession(
session_creator=session_creator, hooks=None) as sess:
batch = 0
image_id_offset = 0
while not sess.should_stop():
tf.logging.info('Visualizing batch %d', batch + 1)
_process_batch(sess=sess,
original_images=samples[common.ORIGINAL_IMAGE],
semantic_predictions=predictions,
image_names=samples[common.IMAGE_NAME],
image_heights=samples[common.HEIGHT],
image_widths=samples[common.WIDTH],
image_id_offset=image_id_offset,
save_dir=save_dir,
raw_save_dir=raw_save_dir,
train_id_to_eval_id=train_id_to_eval_id)
image_id_offset += FLAGS.vis_batch_size
batch += 1
tf.logging.info(
'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
if we set the param --max_number_of_iterations=1 then, because of the line:
num_iteration = 0
for checkpoint_path in checkpoints_iterator:
if max_num_iteration > 0 and num_iteration > max_num_iteration:
break
num_iteration += 1
there will be 2 iterations.
And also, iterator of for-loop is checkpoint_path, not num_iteration actually.
then, checkpoint_path invokes the wait-state such as "INFO:tensorflow:Waiting for new checkpoint at"
So, I modified this part of vis.py slightly in my local:
num_iteration = 0
max_num_iteration = FLAGS.max_number_of_iterations
checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
for checkpoint_path in checkpoints_iterator:
num_iteration += 1
tf.logging.info(
'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
tf.logging.info('Visualizing with model %s', checkpoint_path)
scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
session_creator = tf.train.ChiefSessionCreator(
scaffold=scaffold,
master=FLAGS.master,
checkpoint_filename_with_path=checkpoint_path)
with tf.train.MonitoredSession(
session_creator=session_creator, hooks=None) as sess:
batch = 0
image_id_offset = 0
while not sess.should_stop():
tf.logging.info('Visualizing batch %d', batch + 1)
_process_batch(sess=sess,
original_images=samples[common.ORIGINAL_IMAGE],
semantic_predictions=predictions,
image_names=samples[common.IMAGE_NAME],
image_heights=samples[common.HEIGHT],
image_widths=samples[common.WIDTH],
image_id_offset=image_id_offset,
save_dir=save_dir,
raw_save_dir=raw_save_dir,
train_id_to_eval_id=train_id_to_eval_id)
image_id_offset += FLAGS.vis_batch_size
batch += 1
tf.logging.info(
'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
if max_num_iteration > 0 and num_iteration >= max_num_iteration:
break
then it worked.
I think the code needs to be updated.
Thank you, JimHeo and summer-wing, for reporting the issue.
@JimHeo Good job on fixing it. Would you like to do a Pull Request for this? Very appreciate your help.
Thanks!
@aquariusjay My Pleasure :)
I am facing a similar issue with eval.py
INFO:tensorflow:Waiting for new checkpoint at ./deeplab/datasets/pascal_voc_seg/VOC2012/ImageSets/Segmentation\model.ckpt
How do I resolve this?
The problem is still unsolved after @JimHeo Pulling Request
I managed it in this way:
` ##checkpoints_iterator = tf.contrib.training.checkpoints_iterator(
## FLAGS.checkpoint_dir, min_interval_secs=FLAGS.eval_interval_secs)
##for checkpoint_path in checkpoints_iterator:
checkpoint_path=FLAGS.checkpoint_dir##
num_iteration += 1
tf.logging.info(
'Starting visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
tf.logging.info('Visualizing with model %s', checkpoint_path)
scaffold = tf.train.Scaffold(init_op=tf.global_variables_initializer())
session_creator = tf.train.ChiefSessionCreator(
scaffold=scaffold,
master=FLAGS.master,
checkpoint_filename_with_path=checkpoint_path)
with tf.train.MonitoredSession(
session_creator=session_creator, hooks=None) as sess:
batch = 0
image_id_offset = 0
while not sess.should_stop():
tf.logging.info('Visualizing batch %d', batch + 1)
_process_batch(sess=sess,
original_images=samples[common.ORIGINAL_IMAGE],
semantic_predictions=predictions,
image_names=samples[common.IMAGE_NAME],
image_heights=samples[common.HEIGHT],
image_widths=samples[common.WIDTH],
image_id_offset=image_id_offset,
save_dir=save_dir,
raw_save_dir=raw_save_dir,
train_id_to_eval_id=train_id_to_eval_id)
image_id_offset += FLAGS.vis_batch_size
batch += 1
tf.logging.info(
'Finished visualization at ' + time.strftime('%Y-%m-%d-%H:%M:%S',
time.gmtime()))
##if max_num_iteration > 0 and num_iteration >= max_num_iteration:
## break`
This behaviour i.e. (INFO:tensorflow:Waiting for new checkpoint at ...) will occur when there is no
checkpoint file in the directory specified by --checkpoint-dir for vis.py.
Adding a file named "checkpoint" with the following contents:
model_checkpoint_path: "model.ckpt"
all_model_checkpoint_paths: "model.ckpt"
fixes the issue.
When downloading the xception65_coco_voc_trainval model from https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md one should add the checkpoint file to the extracted directory. The model_checkpoint_path and all_model_checkpoint_paths values are based on the ckpt filenames without the extension.
@m-nez where can i get or how can i generate the checkpoint file model.ckpt from the download file, thanks.
@pkuCactus You have to extract the deeplabv3_pascal_trainval_2018_01_04.tar.gz file.
The model.ckpt in the contents of "checkpoint" refers to model.ckpt.data-00000-of-00001 and model.ckpt.index files which are present in the extracted deeplabv3_pascal_trainval directory.
@m-nez Thanks very much, and i'v resolved it according the description.
Most helpful comment
This behaviour i.e. (INFO:tensorflow:Waiting for new checkpoint at ...) will occur when there is no
checkpoint file in the directory specified by --checkpoint-dir for vis.py.
Adding a file named "checkpoint" with the following contents:
fixes the issue.
When downloading the xception65_coco_voc_trainval model from https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md one should add the checkpoint file to the extracted directory. The model_checkpoint_path and all_model_checkpoint_paths values are based on the ckpt filenames without the extension.