I am trying to fine-tune deeplabv3+ to work with my own dataset (5 classes, 862 training examples, 216 val examples). I have modified the code to do so, and I think I got all the main changes that are necessary to make it run. When I try to evaluate, however, the script breaks because my checkpoint is apparently missing a parameter (aspp1_depthwise/BatchNorm/beta):
NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Anyone knows how to solve this?
I am trying to fine-tune deeplab on my own segmentation dataset, which has 5 classes, 862 train examples and 216 val examples. To do this, I have completed the following steps:
I have transformed my dataset to tfrecord format by generating a new .py file, 'build_gdi_data.py', closely following 'build_ade20k_data.py'. This file generated 4 .tfrecord files, train-00000-of-00004.tfrecord to train-00003-of-00004.tfrecord (around 12MB each), and they are sitting in the folders 'gd_train_tfrecord' and 'gd_val_tfrecord', respectively.
In segmentation_dataset.py, I have added information about my dataset as follows:
_GRAPHIC_DESIGNS_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 862,
'val': 216,
},
num_classes=5,
ignore_label=255,
)
and
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'graphic_designs':_GRAPHIC_DESIGNS_INFORMATION
}
flags.DEFINE_boolean('initialize_last_layer', False,
'Initialize the last layer.')
flags.DEFINE_boolean('last_layers_contain_logits_only', True,
'Only consider logits as last layers or not.')
flags.DEFINE_boolean('fine_tune_batch_norm', False,
'Fine tune the batch norm parameters or not.')
flags.DEFINE_string('dataset', 'graphic_designs',
'Name of the segmentation dataset.')
flags.DEFINE_string('dataset', 'graphic_designs',
'Name of the segmentation dataset.')
from deeplab.model import _LOGITS_SCOPE_NAME
# Variables that will not be restored.
exclude_list = [_LOGITS_SCOPE_NAME]
if not initialize_last_layer:
exclude_list.extend(last_layers)
I downloaded the xception checkpoint 'deeplabv3_xception_pascal_trainval_2018_01_04.tar.gz' and I am calling the train script with the model.ckpt.index file inside that tar file.
I run the training script with the xception architecture, output_stride=16, and just 30 training steps to debug. This is my full call (from inside a jupyter notebook):
PATH_TO_INITIAL_CHECKPOINT = "./deeplab_pretrained/deeplabv3_xception_pascal_trainval/model.ckpt.index"
PATH_TO_TRAIN_DIR = './train_logdir'
PATH_TO_DATASET = './gd_train_tfrecord'
%run ./models/research/deeplab/train.py \
--tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
--train_logdir=${PATH_TO_TRAIN_DIR} \
--dataset_dir=${PATH_TO_DATASET} \
--logtostderr \
--training_number_of_steps=30 \
--train_split="train" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=1 \
--dataset="graphic_designs" \
--model_variant="xception_65"
The training seems to work, although I get an enormous amount of WARNING: Variable missing in checkpoint, like so:
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights/Momentum missing in checkpoint ./deeplab_pretrained/deeplabv3_xception_pascal_trainval/model.ckpt.index
It seems like all the momentums, moving_mean and gammas are missing. This may be normal. After all those warnings, I finally get some training:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./train_logdir\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 4.7966 (0.860 sec/step)
INFO:tensorflow:global step 20: loss = 4.7966 (0.841 sec/step)
INFO:tensorflow:global step 30: loss = 4.7966 (0.870 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
No decrease in loss, but that's a problem for later I guess. After that, I try to run the eval.py script by doing:
PATH_TO_CHECKPOINT = "./train_logdir"
PATH_TO_EVAL_DIR = './eval_logdir'
PATH_TO_DATASET = './gd_val_tfrecord'
%run ./models/research/deeplab/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=609 \
--eval_crop_size=609 \
--dataset="graphic_designs" \
--checkpoint_dir=${PATH_TO_CHECKPOINT} \
--eval_logdir=${PATH_TO_EVAL_DIR} \
--dataset_dir=${PATH_TO_DATASET}
Now this breaks because of a NotFoundError, and I don't understand why. Anyone has thoughts? Here is the error I get:
NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
And below is the full error traceback:
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Evaluating on val set
Original_image, processed_image, label shapes: (?, ?, 3) (?, ?, 3) (?, ?, 1)
Label: Tensor("case_1/cond/Merge:0", shape=(?, ?, 1), dtype=uint8)
INPUT_PREPROCESS.py: original_image.shape, processed_image.shape, label.shape: (?, ?, 3) (609, 609, 3) (609, 609, 1)
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 216
INFO:tensorflow:Eval num images 216
INFO:tensorflow:Eval batch size 1 and num batch 216
INFO:tensorflow:Eval batch size 1 and num batch 216
INFO:tensorflow:Waiting for new checkpoint at ./train_logdir
INFO:tensorflow:Waiting for new checkpoint at ./train_logdir
INFO:tensorflow:Found new checkpoint at ./train_logdir\model.ckpt-50
INFO:tensorflow:Found new checkpoint at ./train_logdir\model.ckpt-50
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./train_logdir\model.ckpt-50
INFO:tensorflow:Restoring parameters from ./train_logdir\model.ckpt-50
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1326 try:
-> 1327 return fn(*args)
1328 except errors.OpError as e:
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1311 return self._call_tf_sessionrun(
-> 1312 options, feed_dict, fetch_list, target_list, run_metadata)
1313
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1419 self._session, options, feed_dict, fetch_list, target_list,
-> 1420 status, run_metadata)
1421
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
515 compat.as_text(c_api.TF_Message(self.status.status)),
--> 516 c_api.TF_GetCode(self.status.status))
517 # Delete the underlying status object from memory otherwise it stays alive
NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
During handling of the above exception, another exception occurred:
NotFoundError Traceback (most recent call last)
C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py in <module>()
178 flags.mark_flag_as_required('eval_logdir')
179 flags.mark_flag_as_required('dataset_dir')
--> 180 tf.app.run()
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py in run(main, argv)
124 # Call the main function, passing through any arguments
125 # to the final program.
--> 126 _sys.exit(main(argv))
127
128
C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py in main(unused_argv)
171 eval_op=list(metrics_to_updates.values()),
172 max_number_of_evaluations=num_eval_iters,
--> 173 eval_interval_secs=FLAGS.eval_interval_secs)
174
175
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\evaluation.py in evaluation_loop(master, checkpoint_dir, logdir, num_evals, initial_op, initial_op_feed_dict, init_fn, eval_op, eval_op_feed_dict, final_op, final_op_feed_dict, summary_op, summary_op_feed_dict, variables_to_restore, eval_interval_secs, max_number_of_evaluations, session_config, timeout, hooks)
299 config=session_config,
300 max_number_of_evaluations=max_number_of_evaluations,
--> 301 timeout=timeout)
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py in evaluate_repeatedly(checkpoint_dir, master, scaffold, eval_ops, feed_dict, final_ops, final_ops_feed_dict, eval_interval_secs, hooks, config, max_number_of_evaluations, timeout, timeout_fn)
445
446 with monitored_session.MonitoredSession(
--> 447 session_creator=session_creator, hooks=hooks) as session:
448 logging.info('Starting evaluation at ' + time.strftime(
449 '%Y-%m-%d-%H:%M:%S', time.gmtime()))
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, session_creator, hooks, stop_grace_period_secs)
793 super(MonitoredSession, self).__init__(
794 session_creator, hooks, should_recover=True,
--> 795 stop_grace_period_secs=stop_grace_period_secs)
796
797
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, session_creator, hooks, should_recover, stop_grace_period_secs)
516 stop_grace_period_secs=stop_grace_period_secs)
517 if should_recover:
--> 518 self._sess = _RecoverableSession(self._coordinated_creator)
519 else:
520 self._sess = self._coordinated_creator.create_session()
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in __init__(self, sess_creator)
979 """
980 self._sess_creator = sess_creator
--> 981 _WrappedSession.__init__(self, self._create_session())
982
983 def _create_session(self):
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in _create_session(self)
984 while True:
985 try:
--> 986 return self._sess_creator.create_session()
987 except _PREEMPTION_ERRORS as e:
988 logging.info('An error was raised while a session was being created. '
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
673 """Creates a coordinated session."""
674 # Keep the tf_sess for unit testing.
--> 675 self.tf_sess = self._session_creator.create_session()
676 # We don't want coordinator to suppress any exception.
677 self.coord = coordinator.Coordinator(clean_stop_exception_types=[])
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
444 init_op=self._scaffold.init_op,
445 init_feed_dict=self._scaffold.init_feed_dict,
--> 446 init_fn=self._scaffold.init_fn)
447
448
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\session_manager.py in prepare_session(self, master, init_op, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config, init_feed_dict, init_fn)
273 wait_for_checkpoint=wait_for_checkpoint,
274 max_wait_secs=max_wait_secs,
--> 275 config=config)
276 if not is_loaded_from_checkpoint:
277 if init_op is None and not init_fn and self._local_init_op is None:
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\session_manager.py in _restore_checkpoint(self, master, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config)
189
190 if checkpoint_filename_with_path:
--> 191 saver.restore(sess, checkpoint_filename_with_path)
192 return sess, True
193
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py in restore(self, sess, save_path)
1773 else:
1774 sess.run(self.saver_def.restore_op_name,
-> 1775 {self.saver_def.filename_tensor_name: save_path})
1776
1777 @staticmethod
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
903 try:
904 result = self._run(None, fetches, feed_dict, options_ptr,
--> 905 run_metadata_ptr)
906 if run_metadata:
907 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1138 if final_fetches or final_targets or (handle and feed_dict_tensor):
1139 results = self._do_run(handle, final_targets, final_fetches,
-> 1140 feed_dict_tensor, options, run_metadata)
1141 else:
1142 results = []
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1319 if handle is None:
1320 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1321 run_metadata)
1322 else:
1323 return self._do_call(_prun_fn, handle, feeds, fetches)
C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1338 except KeyError:
1339 pass
-> 1340 raise type(e)(node_def, op, message)
1341
1342 def _extend_graph(self):
NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'save/RestoreV2', defined at:
File "C:\Users\Camilo\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\Camilo\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
app.start()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
super(ZMQIOLoop, self).start()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2827, in run_ast_nodes
if self.run_code(code, result):
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-17b437d1a5ce>", line 8, in <module>
get_ipython().magic('run {WORK_DIR}/eval.py --logtostderr --eval_split="val" --model_variant="xception_65" --atrous_rates=6 --atrous_rates=12 --atrous_rates=18 --output_stride=16 --decoder_output_stride=4 --eval_crop_size=609 --eval_crop_size=609 --dataset="graphic_designs" --fine_tune_batch_norm=False --checkpoint_dir=${PATH_TO_CHECKPOINT} --eval_logdir=${PATH_TO_EVAL_DIR} --dataset_dir=${PATH_TO_DATASET}')
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2158, in magic
return self.run_line_magic(magic_name, magic_arg_s)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2079, in run_line_magic
result = fn(*args,**kwargs)
File "<decorator-gen-58>", line 2, in run
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magic.py", line 188, in <lambda>
call = lambda f, *a, **k: f(*a, **k)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magics\execution.py", line 742, in run
run()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\magics\execution.py", line 728, in run
exit_ignore=exit_ignore)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2481, in safe_execfile
self.compile if kw['shell_futures'] else None)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\IPython\utils\py3compat.py", line 186, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py", line 180, in <module>
tf.app.run()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "C:\Users\Camilo\Dropbox\Graduate Studies\Harvard\AC299r Independent Research\deeplab_playground\models\research\deeplab\eval.py", line 173, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\contrib\training\python\training\evaluation.py", line 447, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 795, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 518, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 981, in __init__
_WrappedSession.__init__(self, self._create_session())
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 986, in _create_session
return self._sess_creator.create_session()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 675, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 437, in create_session
self._scaffold.finalize()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 884, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1311, in __init__
self.build()
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1320, in build
self._build(self._filename, build_save=True, build_restore=True)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1357, in _build
build_save=build_save, build_restore=build_restore)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 803, in _build_internal
restore_sequentially, reshape)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 501, in _AddShardedRestoreOps
name="restore_shard"))
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 448, in _AddRestoreOps
restore_sequentially)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 860, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1541, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3290, in create_op
op_def=op_def)
File "C:\Users\Camilo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_84_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
May I ask you a question? When I modify the train_utils.py as what you do above, the model doesn't train and just model save.
The information when training is:
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 30000.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
So I still use this: exclude_list = ['global_step'] to train, will this cause problem? My project is 2 classes.
Thanks a lot.
This is interesting. I don't have a good answer for this off the top of my head.
I'm adding @yknzhu who might be able help.
Feel free to also seek help on StackOverflow with the tags "tensorflow" and "deeplab".
@cfosco that is a weird error :( I think most warning messages are expected, and it is odd that batch norm beta is missing from checkpoint. One thing I notice is eval tries to load ./train_logdir\model.ckpt-50 but you are only training it for 30 steps, maybe double checking that is the right checkpoint?
@qmy612 that is due to you are restoring global step as well. just dropping global step variable should do.
Hi @YknZhu, thanks for your answer. Unfortunately that is not the error, I used to train for 50 epochs but I recently changed to 30, the output you are seeing is from an old traceback. I'll update it now.
If anyone manages to find the solution to this, or if anyone managed to train with their own dataset, I would love to see any available code to compare.
Today I got a very similar error as you do. I've trained with my own pictures and configuration as you did except for the default settings in train.py. Got this message a few minutes ago:
```
2018-04-23 18:08:36.506034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./model/model.ckpt-21200
2018-04-23 18:08:40.428483: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
2018-04-23 18:08:40.428527: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/gamma not found in checkpoint
2018-04-23 18:08:40.428541: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/moving_mean not found in checkpoint
2018-04-23 18:08:40.429882: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/BatchNorm/moving_variance not found in checkpoint
2018-04-23 18:08:40.430292: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/beta not found in checkpoint
2018-04-23 18:08:40.430379: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_depthwise/depthwise_weights not found in checkpoint
2018-04-23 18:08:40.430672: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/gamma not found in checkpoint
2018-04-23 18:08:40.431735: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/moving_mean not found in checkpoint
2018-04-23 18:08:40.431815: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/weights not found in checkpoint
2018-04-23 18:08:40.432072: W tensorflow/core/framework/op_kernel.cc:1198] Not found: Key aspp1_pointwise/BatchNorm/moving_variance not found in checkpoint
INFO:tensorflow:Error reported to Coordinator:
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'save/RestoreV2_5', defined at:
File "vis.py", line 320, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "vis.py", line 263, in main
saver = tf.train.Saver(slim.get_variables_to_restore())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1239, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Traceback (most recent call last):
File "vis.py", line 320, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "vis.py", line 291, in main
sv.saver.restore(sess, last_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1686, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op u'save/RestoreV2_5', defined at:
File "vis.py", line 320, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "vis.py", line 263, in main
saver = tf.train.Saver(slim.get_variables_to_restore())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1239, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1248, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1284, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 765, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 428, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 268, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1031, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_5 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_5/tensor_names, save/RestoreV2_5/shape_and_slices)]]
[[Node: save/RestoreV2_174/_1769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1502_save/RestoreV2_174", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
``
```
Okay, got the solution for my problem: I just had to specify each parameter as in the instructions. This added the missing tensors. Don't know the exact details, but it works. :/
@Finalrykku I have exactly the same error, with checkpoint not detecting the ~/beta key. Could you please explain in more detail what you modified to solve this issue? It would be a great help =).
That's the end of my error:
`NotFoundError (see above for traceback): Key decoder/decoder_conv0_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
`
It seems that a lot of people are having that same error. An in depth
tutorial on how to build a custom dataset would be incredibly helpful. If
anyone managed to solve it, please let us know. When this issue is solved,
I'll try to build a step by step tutorial. For now, I switched to the keras
version: https://github.com/bonlime/keras-deeplab-v3-plus
On Fri, Apr 27, 2018 at 7:04 PM, Zitzo notifications@github.com wrote:
@Finalrykku https://github.com/Finalrykku I have exactly the same
error, with checkpoint not detecting the ~/beta key. Could you please
explain in more detail what you modified to solve this issue? It would be a
great help =).That's the end of my error
`NotFoundError (see above for traceback): Key decoder/decoder_conv0_depthwise/BatchNorm/beta
not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT,
DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0,
save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]`
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/3992#issuecomment-385115154,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMXKLqyE0Dn39-lnDbOTLuBFZq8AdscTks5ts6QVgaJpZM4TXCPV
.
Sorry to hear this is a common issue for training on customize dataset :( Some hints:
could you remove / rename those checkpoint / train / eval dirs and run local_test.sh / local_test_mobilenet.sh and see everything passes?
If problem persists, could you provide a link to tensorboard train / eval graph or checkpoint files? We can debug from there. Thanks!
I have same problem with my own dataset.
@cfosco Do you got results with Keras?
Here is how I faced a similar error (Key aspp4_depthwise/BatchNorm/gamma not found in checkpoint), but seems for a different reason. I trained using atrous_rates=[6, 12, 18]. By mistake, I evaluated on atrous_rates=[6, 12, 18, 6, 12, 18]. Why the FLAGS list were duplicated? I accessed the FLAGS in the global section so params were evaluated twice (so be careful).
Side note for others in case helps: the code place where the key issue mistake happens is in: model.py - extract_features method - See line with text: scope = ASPP_SCOPE + str(i)
This scope asked for more entries (aspp4_*) that doesn't exist in the train model.
==
@YknZhu
It is written in the scripts that: "Note one could use different # atrous_rates/output_stride during training/evaluation."
Does this mean using the same list size but different values or even different list length are ok? If 2nd choice, then you know now one way to generate such issue on any dataset.
I'm facing the same issue. I have trained the model using my own dataset. It has 2 classes.
INFO:tensorflow:Waiting for new checkpoint at ./logs/
INFO:tensorflow:Found new checkpoint at ./logs/model.ckpt-1201
WARNING:tensorflow:From /home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-08-23 01:03:22.494642: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO:tensorflow:Restoring parameters from ./logs/model.ckpt-1201
2018-08-23 01:03:22.596262: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
Traceback (most recent call last):
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "deeplab/eval.py", line 176, in <module>
tf.app.run()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
init_fn=self._scaffold.init_fn)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
config=config)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 191, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Caused by op 'save/RestoreV2', defined at:
File "deeplab/eval.py", line 176, in <module>
tf.app.run()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "deeplab/eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
self._scaffold.finalize()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 910, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
self.build()
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1347, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
build_save=build_save, build_restore=build_restore)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
restore_sequentially, reshape)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
name="restore_shard"))
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
restore_sequentially)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/mallik/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Key MobilenetV2/Conv/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
I used Xception 41 pre trained weights and tuned the model to fir my dataset. The checkpoints and graph are saved in ./logs folder
i have the same problem,
WARNING:tensorflow:From D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py:125: main (from __main__) is deprecated and will be removed in
a future version.
Instructions for updating:
Use object_detection/model_main.py.
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-09-29 11:31:51.594799: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow
binary was not compiled to use: AVX2
2018-09-29 11:31:51.854634: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-09-29 11:31:51.862587: I T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for hos
t: DELL-Sea
2018-09-29 11:31:51.866719: I T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_diagnostics.cc:170] hostname: DELL-Sea
INFO:tensorflow:Restoring parameters from D:\tensorflow\WAGE\test\models\model\model_dir\model.ckpt-2000
INFO:tensorflow:Restoring parameters from D:\tensorflow\WAGE\test\models\model\model_dir\model.ckpt-2000
2018-09-29 11:31:52.905278: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1275] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not fou
nd: Key lr not found in checkpoint
Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1278, in _do_call
return fn(*args)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1725, in restore
{self.saver_def.filename_tensor_name: save_path})
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 877, in run
run_metadata_ptr)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1272, in _do_run
run_metadata)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\client\session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1281, in __init__
self.build()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(args, *kwargs)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1737, in restore
checkpointable.OBJECT_GRAPH_PROTO_KEY)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 348, in get_tensor
status)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 276, in evaluate
losses_dict=losses_dict)
File "D:\tensorflow\tf.models\models\research\object_detection\eval_util.py", line 438, in repeated_checkpoint_run
losses_dict=losses_dict)
File "D:\tensorflow\tf.models\models\research\object_detection\eval_util.py", line 278, in _run_checkpoint_once
restore_fn(sess)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 255, in _restore_latest_checkpoint
saver.restore(sess, latest_checkpoint)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1743, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is
missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Caused by op 'save/RestoreV2', defined at:
File "eval.py", line 147, in
tf.app.run()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 272, in new_func
return func(args, *kwargs)
File "eval.py", line 143, in main
graph_hook_fn=graph_rewriter_fn)
File "D:\tensorflow\tf.models\models\research\object_detection\legacy\evaluator.py", line 251, in evaluate
saver = tf.train.Saver(variables_to_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1281, in __init__
self.build()
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1293, in build
self._build(self._filename, build_save=True, build_restore=True)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 1330, in _build
build_save=build_save, build_restore=build_restore)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 778, in _build_internal
restore_sequentially, reshape)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 397, in _AddRestoreOps
restore_sequentially)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\training\saver.py", line 829, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
return func(args, *kwargs)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
op_def=op_def)
File "D:\tensorflow\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from
the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Key lr not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [2] rhs shape= [21]
I did the 2-classifier and change the initialize_last_layer to false and set the exclude_list to empty but still got the error.How can I fix it ?
Okay, got the solution for my problem: I just had to specify each parameter as in the instructions. This added the missing tensors. Don't know the exact details, but it works. :/
Could you give a detailed example of what you did, like what parameters ???