Models: [deeplab] eval.py issue while evaluating on a custom dataset

Created on 16 Apr 2018 · 15Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.7.0
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0/7.0.4
GPU model and memory: GTX980M, 8GB
Exact command to reproduce:

Problem description

TL;DR

I am trying to fine-tune deeplabv3+ to work with my own dataset (5 classes, 862 training examples, 216 val examples). I have modified the code to do so, and I think I got all the main changes that are necessary to make it run. When I try to evaluate, however, the script breaks because my checkpoint is apparently missing a parameter (aspp1_depthwise/BatchNorm/beta):

NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
     [[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Anyone knows how to solve this?

Full Description

I am trying to fine-tune deeplab on my own segmentation dataset, which has 5 classes, 862 train examples and 216 val examples. To do this, I have completed the following steps:

I have transformed my dataset to tfrecord format by generating a new .py file, 'build_gdi_data.py', closely following 'build_ade20k_data.py'. This file generated 4 .tfrecord files, train-00000-of-00004.tfrecord to train-00003-of-00004.tfrecord (around 12MB each), and they are sitting in the folders 'gd_train_tfrecord' and 'gd_val_tfrecord', respectively.
In segmentation_dataset.py, I have added information about my dataset as follows:

_GRAPHIC_DESIGNS_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 862,
        'val': 216,
    },
    num_classes=5,
    ignore_label=255,  
)

and

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
    'graphic_designs':_GRAPHIC_DESIGNS_INFORMATION
}

In train.py, I have modified the flags to not initialize the last layer, use only logits as last layer, not fine tune batch norm and link to my dataset. Here is how I did it:

flags.DEFINE_boolean('initialize_last_layer', False,
                     'Initialize the last layer.')

flags.DEFINE_boolean('last_layers_contain_logits_only', True,
                     'Only consider logits as last layers or not.')

flags.DEFINE_boolean('fine_tune_batch_norm', False,
                     'Fine tune the batch norm parameters or not.')

flags.DEFINE_string('dataset', 'graphic_designs',
                    'Name of the segmentation dataset.')

In eval.py, I point to my custom dataset:

flags.DEFINE_string('dataset', 'graphic_designs',
                    'Name of the segmentation dataset.')

In train_utils.py, I modify the code so as to only exclude _LOGITS_SCOPE_NAME, that I import from deeplab.model:

from deeplab.model import _LOGITS_SCOPE_NAME

# Variables that will not be restored.
  exclude_list = [_LOGITS_SCOPE_NAME]
  if not initialize_last_layer:
    exclude_list.extend(last_layers)

I downloaded the xception checkpoint 'deeplabv3_xception_pascal_trainval_2018_01_04.tar.gz' and I am calling the train script with the model.ckpt.index file inside that tar file.
I run the training script with the xception architecture, output_stride=16, and just 30 training steps to debug. This is my full call (from inside a jupyter notebook):

PATH_TO_INITIAL_CHECKPOINT = "./deeplab_pretrained/deeplabv3_xception_pascal_trainval/model.ckpt.index"
PATH_TO_TRAIN_DIR = './train_logdir'
PATH_TO_DATASET = './gd_train_tfrecord'

%run ./models/research/deeplab/train.py \
    --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
    --train_logdir=${PATH_TO_TRAIN_DIR} \
    --dataset_dir=${PATH_TO_DATASET} \
    --logtostderr \
    --training_number_of_steps=30 \
    --train_split="train" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size=513 \
    --train_crop_size=513 \
    --train_batch_size=1 \
    --dataset="graphic_designs" \
    --model_variant="xception_65"

The training seems to work, although I get an enormous amount of WARNING: Variable missing in checkpoint, like so:

WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights/Momentum missing in checkpoint ./deeplab_pretrained/deeplabv3_xception_pascal_trainval/model.ckpt.index

It seems like all the momentums, moving_mean and gammas are missing. This may be normal. After all those warnings, I finally get some training:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./train_logdir\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 4.7966 (0.860 sec/step)
INFO:tensorflow:global step 20: loss = 4.7966 (0.841 sec/step)
INFO:tensorflow:global step 30: loss = 4.7966 (0.870 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

No decrease in loss, but that's a problem for later I guess. After that, I try to run the eval.py script by doing:

PATH_TO_CHECKPOINT = "./train_logdir" 
PATH_TO_EVAL_DIR = './eval_logdir'
PATH_TO_DATASET = './gd_val_tfrecord'

%run ./models/research/deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size=609 \
    --eval_crop_size=609 \
    --dataset="graphic_designs" \
    --checkpoint_dir=${PATH_TO_CHECKPOINT} \
    --eval_logdir=${PATH_TO_EVAL_DIR} \
    --dataset_dir=${PATH_TO_DATASET}

Now this breaks because of a NotFoundError, and I don't understand why. Anyone has thoughts? Here is the error I get:

NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
     [[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

And below is the full error traceback:

Error Traceback

WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Evaluating on val set
Original_image, processed_image, label shapes: (?, ?, 3) (?, ?, 3) (?, ?, 1)
Label: Tensor("case_1/cond/Merge:0", shape=(?, ?, 1), dtype=uint8)
INPUT_PREPROCESS.py: original_image.shape, processed_image.shape, label.shape: (?, ?, 3) (609, 609, 3) (609, 609, 1)
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 216
INFO:tensorflow:Eval num images 216
INFO:tensorflow:Eval batch size 1 and num batch 216
INFO:tensorflow:Eval batch size 1 and num batch 216
INFO:tensorflow:Waiting for new checkpoint at ./train_logdir
INFO:tensorflow:Waiting for new checkpoint at ./train_logdir
INFO:tensorflow:Found new checkpoint at ./train_logdir\model.ckpt-50
INFO:tensorflow:Found new checkpoint at ./train_logdir\model.ckpt-50
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./train_logdir\model.ckpt-50
INFO:tensorflow:Restoring parameters from ./train_logdir\model.ckpt-50
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1326     try:
-> 1327       return fn(*args)
   1328     except errors.OpError as e:

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1311       return self._call_tf_sessionrun(
-> 1312           options, feed_dict, fetch_list, target_list, run_metadata)
   1313 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1419             self._session, options, feed_dict, fetch_list, target_list,
-> 1420             status, run_metadata)
   1421 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    515             compat.as_text(c_api.TF_Message(self.status.status)),
--> 516             c_api.TF_GetCode(self.status.status))
    517     # Delete the underlying status object from memory otherwise it stays alive

NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
     [[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

NotFoundError                             Traceback (most recent call last)
C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py in <module>()
    178   flags.mark_flag_as_required('eval_logdir')
    179   flags.mark_flag_as_required('dataset_dir')
--> 180   tf.app.run()

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py in run(main, argv)
    124   # Call the main function, passing through any arguments
    125   # to the final program.
--> 126   _sys.exit(main(argv))
    127 
    128 

C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py in main(unused_argv)
    171         eval_op=list(metrics_to_updates.values()),
    172         max_number_of_evaluations=num_eval_iters,
--> 173         eval_interval_secs=FLAGS.eval_interval_secs)
    174 
    175 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\evaluation.py in evaluation_loop(master, checkpoint_dir, logdir, num_evals, initial_op, initial_op_feed_dict, init_fn, eval_op, eval_op_feed_dict, final_op, final_op_feed_dict, summary_op, summary_op_feed_dict, variables_to_restore, eval_interval_secs, max_number_of_evaluations, session_config, timeout, hooks)
    299       config=session_config,
    300       max_number_of_evaluations=max_number_of_evaluations,
--> 301       timeout=timeout)

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py in evaluate_repeatedly(checkpoint_dir, master, scaffold, eval_ops, feed_dict, final_ops, final_ops_feed_dict, eval_interval_secs, hooks, config, max_number_of_evaluations, timeout, timeout_fn)
    445 
    446     with monitored_session.MonitoredSession(
--> 447         session_creator=session_creator, hooks=hooks) as session:
    448       logging.info('Starting evaluation at ' + time.strftime(
    449           '%Y-%m-%d-%H:%M:%S', time.gmtime()))

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, session_creator, hooks, stop_grace_period_secs)
    793     super(MonitoredSession, self).__init__(
    794         session_creator, hooks, should_recover=True,
--> 795         stop_grace_period_secs=stop_grace_period_secs)
    796 
    797 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, session_creator, hooks, should_recover, stop_grace_period_secs)
    516         stop_grace_period_secs=stop_grace_period_secs)
    517     if should_recover:
--> 518       self._sess = _RecoverableSession(self._coordinated_creator)
    519     else:
    520       self._sess = self._coordinated_creator.create_session()

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, sess_creator)
    979     """
    980     self._sess_creator = sess_creator
--> 981     _WrappedSession.__init__(self, self._create_session())
    982 
    983   def _create_session(self):

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in _create_session(self)
    984     while True:
    985       try:
--> 986         return self._sess_creator.create_session()
    987       except _PREEMPTION_ERRORS as e:
    988         logging.info('An error was raised while a session was being created. '

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
    673       """Creates a coordinated session."""
    674       # Keep the tf_sess for unit testing.
--> 675       self.tf_sess = self._session_creator.create_session()
    676       # We don't want coordinator to suppress any exception.
    677       self.coord = coordinator.Coordinator(clean_stop_exception_types=[])

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
    444         init_op=self._scaffold.init_op,
    445         init_feed_dict=self._scaffold.init_feed_dict,
--> 446         init_fn=self._scaffold.init_fn)
    447 
    448 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\session_manager.py in prepare_session(self, master, init_op, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config, init_feed_dict, init_fn)
    273         wait_for_checkpoint=wait_for_checkpoint,
    274         max_wait_secs=max_wait_secs,
--> 275         config=config)
    276     if not is_loaded_from_checkpoint:
    277       if init_op is None and not init_fn and self._local_init_op is None:

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\session_manager.py in _restore_checkpoint(self, master, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config)
    189 
    190     if checkpoint_filename_with_path:
--> 191       saver.restore(sess, checkpoint_filename_with_path)
    192       return sess, True
    193 

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py in restore(self, sess, save_path)
   1773     else:
   1774       sess.run(self.saver_def.restore_op_name,
-> 1775                {self.saver_def.filename_tensor_name: save_path})
   1776 
   1777   @staticmethod

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    903     try:
    904       result = self._run(None, fetches, feed_dict, options_ptr,
--> 905                          run_metadata_ptr)
    906       if run_metadata:
    907         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1138     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1139       results = self._do_run(handle, final_targets, final_fetches,
-> 1140                              feed_dict_tensor, options, run_metadata)
   1141     else:
   1142       results = []

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1319     if handle is None:
   1320       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1321                            run_metadata)
   1322     else:
   1323       return self._do_call(_prun_fn, handle, feeds, fetches)

C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1338         except KeyError:
   1339           pass
-> 1340       raise type(e)(node_def, op, message)
   1341 
   1342   def _extend_graph(self):

NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
     [[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'save/RestoreV2', defined at:
  File "C:\Users\Camilo\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Camilo\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2827, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-17b437d1a5ce>", line 8, in <module>
    get_ipython().magic('run {WORK_DIR}/eval.py     --logtostderr     --eval_split="val"     --model_variant="xception_65"     --atrous_rates=6     --atrous_rates=12     --atrous_rates=18     --output_stride=16     --decoder_output_stride=4     --eval_crop_size=609     --eval_crop_size=609     --dataset="graphic_designs"     --fine_tune_batch_norm=False     --checkpoint_dir=${PATH_TO_CHECKPOINT}     --eval_logdir=${PATH_TO_EVAL_DIR}     --dataset_dir=${PATH_TO_DATASET}')
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2158, in magic
    return self.run_line_magic(magic_name, magic_arg_s)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2079, in run_line_magic
    result = fn(*args,**kwargs)
  File "<decorator-gen-58>", line 2, in run
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magic.py", line 188, in <lambda>
    call = lambda f, *a, **k: f(*a, **k)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magics\execution.py", line 742, in run
    run()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magics\execution.py", line 728, in run
    exit_ignore=exit_ignore)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2481, in safe_execfile
    self.compile if kw['shell_futures'] else None)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\utils\py3compat.py", line 186, in execfile
    exec(compiler(f.read(), fname, 'exec'), glob, loc)
  File "C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py", line 180, in <module>
    tf.app.run()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
    _sys.exit(main(argv))
  File "C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py", line 173, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py", line 447, in evaluate_repeatedly
    session_creator=session_creator, hooks=hooks) as session:
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 795, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 518, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 981, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 986, in _create_session
    return self._sess_creator.create_session()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 675, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 437, in create_session
    self._scaffold.finalize()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 212, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 884, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1311, in __init__
    self.build()
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1320, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1357, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 803, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 501, in _AddShardedRestoreOps
    name="restore_shard"))
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 448, in _AddRestoreOps
    restore_sequentially)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 860, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1541, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3290, in create_op
    op_def=op_def)
  File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
     [[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Source

cfosco

All 15 comments

May I ask you a question? When I modify the train_utils.py as what you do above, the model doesn't train and just model save.
The information when training is:
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 30000.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
So I still use this: exclude_list = ['global_step'] to train, will this cause problem? My project is 2 classes.
Thanks a lot.

qmy612 on 17 Apr 2018

This is interesting. I don't have a good answer for this off the top of my head.

I'm adding @yknzhu who might be able help.

Feel free to also seek help on StackOverflow with the tags "tensorflow" and "deeplab".

bitfort on 18 Apr 2018

@cfosco that is a weird error :( I think most warning messages are expected, and it is odd that batch norm beta is missing from checkpoint. One thing I notice is eval tries to load ./train_logdir\model.ckpt-50 but you are only training it for 30 steps, maybe double checking that is the right checkpoint?

@qmy612 that is due to you are restoring global step as well. just dropping global step variable should do.

YknZhu on 19 Apr 2018

Hi @YknZhu, thanks for your answer. Unfortunately that is not the error, I used to train for 50 epochs but I recently changed to 30, the output you are seeing is from an old traceback. I'll update it now.
If anyone manages to find the solution to this, or if anyone managed to train with their own dataset, I would love to see any available code to compare.

cfosco on 20 Apr 2018

Today I got a very similar error as you do. I've trained with my own pictures and configuration as you did except for the default settings in train.py. Got this message a few minutes ago:
```
2018-04-23 18:08:36.506034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./model/model.ckpt-21200
2018-04-23 18:08:40.428483: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
2018-04-23 18:08:40.428527: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/gamma not found in checkpoint
2018-04-23 18:08:40.428541: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/moving_mean not found in checkpoint
2018-04-23 18:08:40.429882: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/moving_variance not found in checkpoint
2018-04-23 18:08:40.430292: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/beta not found in checkpoint
2018-04-23 18:08:40.430379: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/depthwise_weights not found in checkpoint
2018-04-23 18:08:40.430672: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/gamma not found in checkpoint
2018-04-23 18:08:40.431735: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/moving_mean not found in checkpoint
2018-04-23 18:08:40.431815: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/weights not found in checkpoint
2018-04-23 18:08:40.432072: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/moving_variance not found in checkpoint
INFO:tensorflow:Error reported to Coordinator: , Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'save/RestoreV2_5', defined at:
File "vis.py", line 320, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "vis.py", line 263, in main
saver = tf.train.Saver(slim.get_variables_to_restore())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1239, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Traceback (most recent call last):
File "vis.py", line 320, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "vis.py", line 291, in main
sv.saver.restore(sess, last_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1686, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Finalrykku on 23 Apr 2018

Okay, got the solution for my problem: I just had to specify each parameter as in the instructions. This added the missing tensors. Don't know the exact details, but it works. :/

Finalrykku on 27 Apr 2018

@Finalrykku I have exactly the same error, with checkpoint not detecting the ~/beta key. Could you please explain in more detail what you modified to solve this issue? It would be a great help =).

That's the end of my error:

`NotFoundError (see above for traceback): Key decoder/decoder_conv0_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Zitzo on 28 Apr 2018

It seems that a lot of people are having that same error. An in depth
tutorial on how to build a custom dataset would be incredibly helpful. If
anyone managed to solve it, please let us know. When this issue is solved,
I'll try to build a step by step tutorial. For now, I switched to the keras
version: https://github.com/bonlime/keras-deeplab-v3-plus

On Fri, Apr 27, 2018 at 7:04 PM, Zitzo notifications@github.com wrote:

@Finalrykku https://github.com/Finalrykku I have exactly the same
error, with checkpoint not detecting the ~/beta key. Could you please
explain in more detail what you modified to solve this issue? It would be a
great help =).

That's the end of my error

`NotFoundError (see above for traceback): Key decoder/decoder_conv0_depthwise/BatchNorm/beta
not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT,
DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0,
save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

`

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3992#issuecomment-385115154,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMXKLqyE0Dn39-lnDbOTLuBFZq8AdscTks5ts6QVgaJpZM4TXCPV
.

cfosco on 1 May 2018

👎3

Sorry to hear this is a common issue for training on customize dataset :( Some hints:

could you remove / rename those checkpoint / train / eval dirs and run local_test.sh / local_test_mobilenet.sh and see everything passes?

If problem persists, could you provide a link to tensorboard train / eval graph or checkpoint files? We can debug from there. Thanks!

YknZhu on 1 May 2018

👎1

I have same problem with my own dataset.
@cfosco Do you got results with Keras?

marcosfelipp on 26 May 2018

Here is how I faced a similar error (Key aspp4_depthwise/BatchNorm/gamma not found in checkpoint), but seems for a different reason. I trained using atrous_rates=[6, 12, 18]. By mistake, I evaluated on atrous_rates=[6, 12, 18, 6, 12, 18]. Why the FLAGS list were duplicated? I accessed the FLAGS in the global section so params were evaluated twice (so be careful).

Side note for others in case helps: the code place where the key issue mistake happens is in: model.py - extract_features method - See line with text: scope = ASPP_SCOPE + str(i)

This scope asked for more entries (aspp4_*) that doesn't exist in the train model.

@YknZhu
It is written in the scripts that: "Note one could use different # atrous_rates/output_stride during training/evaluation."
Does this mean using the same list size but different values or even different list length are ok? If 2nd choice, then you know now one way to generate such issue on any dataset.

mostafa-saad on 17 Jul 2018

I'm facing the same issue. I have trained the model using my own dataset. It has 2 classes.

INFO:tensorflow:Waiting for new checkpoint at ./logs/
INFO:tensorflow:Found new checkpoint at ./logs/model.ckpt-1201
WARNING:tensorflow:From /home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-08-23 01:03:22.494642: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1201
2018-08-23 01:03:22.596262: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
Traceback (most recent call last):
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deeplab/eval.py", line 176, in <module>
    tf.app.run()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "deeplab/eval.py", line 169, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
    session_creator=session_creator, hooks=hooks) as session:
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
    return self._sess_creator.create_session()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
    init_fn=self._scaffold.init_fn)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    config=config)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 191, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "deeplab/eval.py", line 176, in <module>
    tf.app.run()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "deeplab/eval.py", line 169, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
    session_creator=session_creator, hooks=hooks) as session:
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
    return self._sess_creator.create_session()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
    self._scaffold.finalize()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 910, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
    restore_sequentially, reshape)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
    restore_sequentially)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
     [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I used Xception 41 pre trained weights and tuned the model to fir my dataset. The checkpoints and graph are saved in ./logs folder

arjunpmm on 23 Aug 2018

i have the same problem,
WARNING:tensorflow:From D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py:125: main (from __main__) is deprecated and will be removed in
a future version.
Instructions for updating:
Use object_detection/model_main.py.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-09-29 11:31:51.594799: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow
binary was not compiled to use: AVX2
2018-09-29 11:31:51.854634: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-09-29 11:31:51.862587: I T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for hos
t: DELL-Sea
2018-09-29 11:31:51.866719: I T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_diagnostics.cc:170] hostname: DELL-Sea
INFO:tensorflow:Restoring parameters from D:\tensorflow\WAGE\test\models\model\model_dir\model.ckpt-2000
INFO:tensorflow:Restoring parameters from D:\tensorflow\WAGE\test\models\model\model_dir\model.ckpt-2000
2018-09-29 11:31:52.905278: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1275] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not fou
nd: Key lr not found in checkpoint
Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call
return fn(*args)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1725, in restore
{self.saver_def.filename_tensor_name: save_path})
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 877, in run
run_metadata_ptr)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1272, in _do_run
run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1281, in __init__
self.build()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(args, *kwargs)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1737, in restore
checkpointable.OBJECT_GRAPH_PROTO_KEY)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 348, in get_tensor
status)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 276, in evaluate
losses_dict=losses_dict)
File "D:\tensorflow\tf.models\models\research\object_detection\eval_util.py", line 438, in repeated_checkpoint_run
losses_dict=losses_dict)
File "D:\tensorflow\tf.models\models\research\object_detection\eval_util.py", line 278, in _run_checkpoint_once
restore_fn(sess)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 255, in _restore_latest_checkpoint
saver.restore(sess, latest_checkpoint)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1743, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is
missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

cjr0106 on 29 Sep 2018

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [2] rhs shape= [21]

I did the 2-classifier and change the initialize_last_layer to false and set the exclude_list to empty but still got the error.How can I fix it ?

Lijiaying201812 on 18 Mar 2019

Okay, got the solution for my problem: I just had to specify each parameter as in the instructions. This added the missing tensors. Don't know the exact details, but it works. :/

Could you give a detailed example of what you did, like what parameters ???