Models: [DeepLab] Error when starting cityscapes training with pre-trained ImageNet checkpoint

Created on 18 Jun 2018 · 2Comments · Source: tensorflow/models

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.8.0
Bazel version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

* tf_env.txt *

Problem

I am using this commit for tensorflow/models:

commit 2310bc34cc372122a61dd49eaea52e2684e74ae0
Merge: 1f82c22 e2e820c
Author: Yukun Zhu <[email protected]>
Date:   Thu Jun 14 22:24:53 2018 -0700

    Merge pull request #4534 from huihui-personal/master

    PiperOrigin-RevId: 200493322

I was able to run the model_test.py and local_test.sh without problems as here

However, when I tried to train cityscapes using ImageNet pre-train weights as here, I get an error message.

The ImageNet pretrained checkpoint is here
xception_65. Linke http://download.tensorflow.org/models/deeplabv3_xception_2018_01_04.tar.gz

I ran sh convert_cityscapes.sh already.
Perhaps this problem is related to #4464?
Which hash of this repo is used to generate the pre-trained checkpoint?

Source code / logs

PATH_TO_INITIAL_CHECKPOINT=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
PATH_TO_TRAIN_DIR=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/
PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord

root@ac31b3bca4bf:/notebooks/models/research# python deeplab/train.py \
>     --logtostderr \
>     --training_number_of_steps=90000 \
>     --train_split="train" \
>     --model_variant="xception_65" \
>     --atrous_rates=6 \
>     --atrous_rates=12 \
>     --atrous_rates=18 \
>     --output_stride=16 \
>     --decoder_output_stride=4 \
>     --train_crop_size=769 \
>     --train_crop_size=769 \
>     --train_batch_size=1 \
>     --dataset="cityscapes" \
>     --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
>     --train_logdir=${PATH_TO_TRAIN_DIR} \
>     --dataset_dir=${PATH_TO_DATASET}
INFO:tensorflow:Training on train set
INFO:tensorflow:Initializing model from path: /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
Traceback (most recent call last):
  File "deeplab/train.py", line 394, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "deeplab/train.py", line 384, in main
    ignore_missing_vars=True),
  File "/notebooks/models/research/deeplab/utils/train_utils.py", line 118, in get_model_init_fn
    ignore_missing_vars=ignore_missing_vars)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 674, in assign_from_checkpoint_fn
    reader = pywrap_tensorflow.NewCheckpointReader(model_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 290, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern), status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/

root@ac31b3bca4bf:/notebooks# ls /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
model.ckpt.data-00000-of-00001  model.ckpt.index

Source

rlan

👍4

Most helpful comment

The problem here is the value for --tf_initial_checkpoint. In the doc, calling it PATH_TO_INITIAL_CHECKPOINT is misleading. @aquariusjay @YknZhu @gpapan
It wants the file prefix and not the folder, nor the full file path.

The ImageNet pre-trained checkpoint:

/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception$ ls
model.ckpt.data-00000-of-00001  model.ckpt.index

The correct --tf_initial_checkpoint is .../xception/model.ckpt

Run log:

+ pwd
+ pwd
+ export PYTHONPATH=:/notebooks/models/research:/notebooks/models/research/slim
+ export CUDA_VISIBLE_DEVICES=3
+ export PATH_TO_INITIAL_CHECKPOINT=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt
+ export PATH_TO_TRAIN_DIR=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/
+ export PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
+ python deeplab/train.py --logtostderr --training_number_of_steps=90000 --train_split=train --model_variant=xception_65 --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=769 --train_crop_size=769 --train_batch_size=1 --dataset=cityscapes --tf_initial_checkpoint=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt --train_logdir=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/ --dataset_dir=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
INFO:tensorflow:Training on train set
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-19 07:33:10.282785: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 07:33:13.974927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-19 07:33:13.974990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 07:33:14.307660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 07:33:14.307737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-06-19 07:33:14.307746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-06-19 07:33:14.308155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.3251 (0.516 sec/step)
INFO:tensorflow:global step 20: loss = 3.2935 (0.507 sec/step)
INFO:tensorflow:global step 30: loss = 3.2400 (0.519 sec/step)

rlan on 19 Jun 2018

👍9

All 2 comments

The ImageNet pre-trained checkpoint:

/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception$ ls
model.ckpt.data-00000-of-00001  model.ckpt.index

The correct --tf_initial_checkpoint is .../xception/model.ckpt

Run log:

+ pwd
+ pwd
+ export PYTHONPATH=:/notebooks/models/research:/notebooks/models/research/slim
+ export CUDA_VISIBLE_DEVICES=3
+ export PATH_TO_INITIAL_CHECKPOINT=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt
+ export PATH_TO_TRAIN_DIR=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/
+ export PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
+ python deeplab/train.py --logtostderr --training_number_of_steps=90000 --train_split=train --model_variant=xception_65 --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=769 --train_crop_size=769 --train_batch_size=1 --dataset=cityscapes --tf_initial_checkpoint=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt --train_logdir=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/ --dataset_dir=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
INFO:tensorflow:Training on train set
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-19 07:33:10.282785: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 07:33:13.974927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-19 07:33:13.974990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 07:33:14.307660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 07:33:14.307737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-06-19 07:33:14.307746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-06-19 07:33:14.308155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.3251 (0.516 sec/step)
INFO:tensorflow:global step 20: loss = 3.2935 (0.507 sec/step)
INFO:tensorflow:global step 30: loss = 3.2400 (0.519 sec/step)

rlan on 19 Jun 2018

👍9

thanks for the detailed issue, helps me a lot.