Please go to Stack Overflow for help and support:
http://stackoverflow.com/questions/tagged/tensorflow
Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
* tf_env.txt *
I am using this commit for tensorflow/models:
commit 2310bc34cc372122a61dd49eaea52e2684e74ae0
Merge: 1f82c22 e2e820c
Author: Yukun Zhu <[email protected]>
Date: Thu Jun 14 22:24:53 2018 -0700
Merge pull request #4534 from huihui-personal/master
PiperOrigin-RevId: 200493322
I was able to run the model_test.py
and local_test.sh
without problems as here
However, when I tried to train cityscapes using ImageNet pre-train weights as here, I get an error message.
The ImageNet pretrained checkpoint is here
xception_65. Linke http://download.tensorflow.org/models/deeplabv3_xception_2018_01_04.tar.gz
I ran sh convert_cityscapes.sh
already.
Perhaps this problem is related to #4464?
Which hash of this repo is used to generate the pre-trained checkpoint?
PATH_TO_INITIAL_CHECKPOINT=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
PATH_TO_TRAIN_DIR=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/
PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
root@ac31b3bca4bf:/notebooks/models/research# python deeplab/train.py \
> --logtostderr \
> --training_number_of_steps=90000 \
> --train_split="train" \
> --model_variant="xception_65" \
> --atrous_rates=6 \
> --atrous_rates=12 \
> --atrous_rates=18 \
> --output_stride=16 \
> --decoder_output_stride=4 \
> --train_crop_size=769 \
> --train_crop_size=769 \
> --train_batch_size=1 \
> --dataset="cityscapes" \
> --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
> --train_logdir=${PATH_TO_TRAIN_DIR} \
> --dataset_dir=${PATH_TO_DATASET}
INFO:tensorflow:Training on train set
INFO:tensorflow:Initializing model from path: /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
Traceback (most recent call last):
File "deeplab/train.py", line 394, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 384, in main
ignore_missing_vars=True),
File "/notebooks/models/research/deeplab/utils/train_utils.py", line 118, in get_model_init_fn
ignore_missing_vars=ignore_missing_vars)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 674, in assign_from_checkpoint_fn
reader = pywrap_tensorflow.NewCheckpointReader(model_path)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 290, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
root@ac31b3bca4bf:/notebooks# ls /notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/
model.ckpt.data-00000-of-00001 model.ckpt.index
The problem here is the value for --tf_initial_checkpoint. In the doc, calling it PATH_TO_INITIAL_CHECKPOINT
is misleading. @aquariusjay @YknZhu @gpapan
It wants the file prefix and not the folder, nor the full file path.
The ImageNet pre-trained checkpoint:
/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception$ ls
model.ckpt.data-00000-of-00001 model.ckpt.index
The correct --tf_initial_checkpoint is .../xception/model.ckpt
Run log:
+ pwd
+ pwd
+ export PYTHONPATH=:/notebooks/models/research:/notebooks/models/research/slim
+ export CUDA_VISIBLE_DEVICES=3
+ export PATH_TO_INITIAL_CHECKPOINT=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt
+ export PATH_TO_TRAIN_DIR=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/
+ export PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
+ python deeplab/train.py --logtostderr --training_number_of_steps=90000 --train_split=train --model_variant=xception_65 --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --train_crop_size=769 --train_crop_size=769 --train_batch_size=1 --dataset=cityscapes --tf_initial_checkpoint=/notebooks/deeplab_checkpoints/imagenet_pretrain_xception_65_deeplabv3_xception_2018_01_04/xception/model.ckpt --train_logdir=/notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/ --dataset_dir=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
INFO:tensorflow:Training on train set
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-19 07:33:10.282785: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 07:33:13.974927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-19 07:33:13.974990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 07:33:14.307660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 07:33:14.307737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-06-19 07:33:14.307746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-06-19 07:33:14.308155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /notebooks/models/research/deeplab/datasets/cityscapes/exp/train_on_train_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.3251 (0.516 sec/step)
INFO:tensorflow:global step 20: loss = 3.2935 (0.507 sec/step)
INFO:tensorflow:global step 30: loss = 3.2400 (0.519 sec/step)
thanks for the detailed issue, helps me a lot.
Most helpful comment
The problem here is the value for --tf_initial_checkpoint. In the doc, calling it
PATH_TO_INITIAL_CHECKPOINT
is misleading. @aquariusjay @YknZhu @gpapanIt wants the file prefix and not the folder, nor the full file path.
The ImageNet pre-trained checkpoint:
The correct --tf_initial_checkpoint is
.../xception/model.ckpt
Run log: