Models: [DeepLab] eval.py returns "Data loss: not an sstable (bad magic number)" when loading from pre-trained cityscapes checkpoint

Created on 18 Jun 2018 · 2Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.8.0
Bazel version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

Problem

Running the same commit as in #4565, starting from pre-trained ImageNet checkpoint fails. So I try to start eval.py according to here from pretrained Cityscapes checkpoint, instead. At first the script would not continue pass:

INFO:tensorflow:Waiting for new checkpoint at

Then I created this following checkpoint file after looking at the one produced by sh local_test.sh here

cat checkpoint
model_checkpoint_path:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt.data-00000-of-00001"
all_model_checkpoint_paths:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt.data-00000-of-00001"

I already ran sh convert_cityscapes.sh.
The pretrained checkpoint is xception_cityscapes_trainfine. Link: http://download.tensorflow.org/models/deeplabv3_cityscapes_train_2018_02_06.tar.gz

The eval.py script continues with the following error:

Source code / logs

PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
PATH_TO_EVAL_DIR=/notebooks/logs/eval-0618-1/
PATH_TO_CHECKPOINT=/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/

root@ac31b3bca4bf:/notebooks/models/research# python deeplab/eval.py \
>     --logtostderr \
>     --eval_split="val" \
>     --model_variant="xception_65" \
>     --atrous_rates=6 \
>     --atrous_rates=12 \
>     --atrous_rates=18 \
>     --output_stride=16 \
>     --decoder_output_stride=4 \
>     --eval_crop_size=1025 \
>     --eval_crop_size=2049 \
>     --dataset="cityscapes" \
>     --checkpoint_dir=${PATH_TO_CHECKPOINT} \
>     --eval_logdir=${PATH_TO_EVAL_DIR} \
>     --dataset_dir=${PATH_TO_DATASET}

INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 500
INFO:tensorflow:Eval batch size 1 and num batch 500
INFO:tensorflow:Waiting for new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/
INFO:tensorflow:Found new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt.data-00000-of-00001
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from
tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-06-18 04:15:33.559165: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled t
o use: AVX2 FMA
2018-06-18 04:15:37.096650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-18 04:15:37.096709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-18 04:15:37.453972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-18 04:15:37.454046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-06-18 04:15:37.454056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-06-18 04:15:37.454427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 w
ith 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt.data-00000-of-000
01
2018-06-18 04:15:38.311862: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deep
labv3_cityscapes_train/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you n
eed to use a different restore operator?
2018-06-18 04:15:38.314369: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deep
labv3_cityscapes_train/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you n
eed to use a different restore operator?
2018-06-18 04:15:38.314834: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_tensor.cc:170 : Data loss: Unable to open table
file /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad ma
gic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_citysc
apes_train/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a
 different restore operator?
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[Node: save/RestoreV2/_301 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/
replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task
:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deeplab/eval.py", line 176, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "deeplab/eval.py", line 169, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
    session_creator=session_creator, hooks=hooks) as session:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 191, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_citysc
apes_train/model.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a
 different restore operator?
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[Node: save/RestoreV2/_301 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/
replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task
:0/device:GPU:0"]()]]

Caused by op 'save/RestoreV2', defined at:
  File "deeplab/eval.py", line 176, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "deeplab/eval.py", line 169, in main                                                                                                           [0/1812]
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
    session_creator=session_creator, hooks=hooks) as session:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 910, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

DataLossError (see above for traceback): Unable to open table file /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/mod
el.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different res
tore operator?
         [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
         [[Node: save/RestoreV2/_301 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/
replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task
:0/device:GPU:0"]()]]

Source

rlan

Most helpful comment

I took a second look at the checkpoint file created by local_test.sh here:

/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train$ ls
checkpoint                                   graph.pbtxt                       model.ckpt-0.index  model.ckpt-10.data-00000-of-00001  model.ckpt-10.meta
events.out.tfevents.1529290948.ac31b3bca4bf  model.ckpt-0.data-00000-of-00001  model.ckpt-0.meta   model.ckpt-10.index

and

$ cat checkpoint
model_checkpoint_path: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"
all_model_checkpoint_paths: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-0"
all_model_checkpoint_paths: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"

It looks like the checkpoint file refers to the file prefix, ie model.ckpt-10, rather than the actual files, ie model.ckpt-10.*. So I corrected my hand-created checkpoint:

$ cat checkpoint
model_checkpoint_path:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt"
all_model_checkpoint_paths:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt"

and was able to finish eval.py:

root@1c305347f325:/notebooks/models/research# ./eval_cityscapes.sh
+ pwd
+ pwd
+ export PYTHONPATH=:/notebooks/models/research:/notebooks/models/research/slim
+ export CUDA_VISIBLE_DEVICES=3
+ export PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
+ export PATH_TO_EVAL_DIR=/notebooks/logs/eval-0619-2/
+ export PATH_TO_CHECKPOINT=/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/
+ python deeplab/eval.py --logtostderr --eval_split=val --model_variant=xception_65 --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=1025 --eval_crop_size=2049 --dataset=cityscapes --checkpoint_dir=/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/ --eval_logdir=/notebooks/logs/eval-0619-2/ --dataset_dir=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 500
INFO:tensorflow:Eval batch size 1 and num batch 500
INFO:tensorflow:Waiting for new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/
INFO:tensorflow:Found new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-06-19 06:44:01.244513: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 06:44:04.786552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-19 06:44:04.786613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 06:44:05.145763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 06:44:05.145832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-06-19 06:44:05.145843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-06-19 06:44:05.146234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-06-19-06:44:12
INFO:tensorflow:Evaluation [50/500]
INFO:tensorflow:Evaluation [100/500]
INFO:tensorflow:Evaluation [150/500]
INFO:tensorflow:Evaluation [200/500]
INFO:tensorflow:Evaluation [250/500]
INFO:tensorflow:Evaluation [300/500]
INFO:tensorflow:Evaluation [350/500]
INFO:tensorflow:Evaluation [400/500]
INFO:tensorflow:Evaluation [450/500]
INFO:tensorflow:Evaluation [500/500]
INFO:tensorflow:Finished evaluation at 2018-06-19-06:48:07
miou_1.0[0.787332237]

rlan on 19 Jun 2018

👍3

All 2 comments

I took a second look at the checkpoint file created by local_test.sh here:

/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train$ ls
checkpoint                                   graph.pbtxt                       model.ckpt-0.index  model.ckpt-10.data-00000-of-00001  model.ckpt-10.meta
events.out.tfevents.1529290948.ac31b3bca4bf  model.ckpt-0.data-00000-of-00001  model.ckpt-0.meta   model.ckpt-10.index

and

$ cat checkpoint
model_checkpoint_path: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"
all_model_checkpoint_paths: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-0"
all_model_checkpoint_paths: "/notebooks/models/research/deeplab/datasets/pascal_voc_seg/exp/train_on_trainval_set/train/model.ckpt-10"

It looks like the checkpoint file refers to the file prefix, ie model.ckpt-10, rather than the actual files, ie model.ckpt-10.*. So I corrected my hand-created checkpoint:

$ cat checkpoint
model_checkpoint_path:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt"
all_model_checkpoint_paths:
"/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt"

and was able to finish eval.py:

root@1c305347f325:/notebooks/models/research# ./eval_cityscapes.sh
+ pwd
+ pwd
+ export PYTHONPATH=:/notebooks/models/research:/notebooks/models/research/slim
+ export CUDA_VISIBLE_DEVICES=3
+ export PATH_TO_DATASET=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
+ export PATH_TO_EVAL_DIR=/notebooks/logs/eval-0619-2/
+ export PATH_TO_CHECKPOINT=/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/
+ python deeplab/eval.py --logtostderr --eval_split=val --model_variant=xception_65 --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=1025 --eval_crop_size=2049 --dataset=cityscapes --checkpoint_dir=/notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/ --eval_logdir=/notebooks/logs/eval-0619-2/ --dataset_dir=/notebooks/models/research/deeplab/datasets/cityscapes/tfrecord
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 500
INFO:tensorflow:Eval batch size 1 and num batch 500
INFO:tensorflow:Waiting for new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/
INFO:tensorflow:Found new checkpoint at /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-06-19 06:44:01.244513: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-19 06:44:04.786552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-06-19 06:44:04.786613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-19 06:44:05.145763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-19 06:44:05.145832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-06-19 06:44:05.145843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-06-19 06:44:05.146234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /notebooks/deeplab_checkpoints/xception_cityscapes_trainfine/deeplabv3_cityscapes_train/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-06-19-06:44:12
INFO:tensorflow:Evaluation [50/500]
INFO:tensorflow:Evaluation [100/500]
INFO:tensorflow:Evaluation [150/500]
INFO:tensorflow:Evaluation [200/500]
INFO:tensorflow:Evaluation [250/500]
INFO:tensorflow:Evaluation [300/500]
INFO:tensorflow:Evaluation [350/500]
INFO:tensorflow:Evaluation [400/500]
INFO:tensorflow:Evaluation [450/500]
INFO:tensorflow:Evaluation [500/500]
INFO:tensorflow:Finished evaluation at 2018-06-19-06:48:07
miou_1.0[0.787332237]

rlan on 19 Jun 2018

👍3

Yep checkpoint has to be specified without the .data-00000-of-00001 that seems to be added to end of all checkpoints created in the V2 tf graph save methods.