Models: [deeplab] Key aspp1_depthwise/BatchNorm/beta not found in checkpoint

Created on 13 Jun 2018  路  12Comments  路  Source: tensorflow/models

I am getting an issue running eval on a custom dataset. When I actually load the checkpoints into the model, I get an error about missing parameters (Momentum, etc) in the checkpoint file.

System information

  • What is the top-level directory of the model you are using: deeplab
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 3.10.0-693.11.6.el7.x86_64
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below):1.8.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: cuda/9.0.176 cudnn/7.0
  • GPU model and memory: (not sure, running on university cluster)
  • Exact command to reproduce:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
EVAL_LOGDIR="./datasets/exp/val"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Describe the problem

I am using a custom dataset from a project. We have 960 jpg images with corresponding png masks. We also have 180 validation images (mask and image combo). We have two classes and all the png masks are converted to binary label images (checked in matlab and all the masks are just 0 or 1).

I exported these images as tfrecord using the scripts in the dataset folder. Though I had to hardcode this part to properly get the expected output:

FLAGS.image_format = "jpg"
FLAGS.label_format = "png"

I then trained the model using:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"

CKPT="./xception/model.ckpt"

NUM_ITERATIONS=20
python ./train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--output_stride=16 \
--train_crop_size=256 \
--train_crop_size=256 \
--train_batch_size=4 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--tf_initial_checkpoint="${CKPT}" \
--fine_tune_batch_norm=true \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${DATASET}"

The training output is what I expected :

INFO:tensorflow:Training on train set
WARNING:tensorflow:From /cluster/home/kellerke/.local/lib64/python3.6/site-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Initializing model from path: ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/biases missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/biases/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:From /cluster/home/kellerke/.local/lib64/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-13 22:48:58.842532: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./xception_65/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./datasets/exp/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.2193 (4.939 sec/step)
INFO:tensorflow:global step 20: loss = 3.1794 (4.912 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

The issue arises when I try to evaluate the model.

I added the following lines to the segmentation_dataset.py file:

_SOYBEAN_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 960,
'val': 180,
},
num_classes=2,
ignore_label=255,
)
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'soybean': _SOYBEAN_INFORMATION
}

And I changed the eval.py dataset setting here:

flags.DEFINE_string('dataset', 'soybean',
'Name of the segmentation dataset.')

When I run the evaluation code the system just freezes waiting for a checkpoint:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
EVAL_LOGDIR="./datasets/exp/val"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Issue

INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 180
INFO:tensorflow:Eval batch size 1 and num batch 180
INFO:tensorflow:Waiting for new checkpoint at ./datasets/exp/train
INFO:tensorflow:Found new checkpoint at ./datasets/exp/train/model.ckpt-21
WARNING:tensorflow:From /cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-06-13 21:47:51.795975: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./datasets/exp/train/model.ckpt-21
2018-06-13 21:47:52.012550: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
Traceback (most recent call last):
File "./eval.py", line 176, in
tf.app.run()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "./eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
init_fn=self._scaffold.init_fn)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
config=config)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 191, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at:
File "./eval.py", line 176, in
tf.app.run()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "./eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
self._scaffold.finalize()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 910, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
self.build()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1347, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
build_save=build_save, build_restore=build_restore)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
restore_sequentially, reshape)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
name="restore_shard"))
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
restore_sequentially)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

stalled awaiting maintainer awaiting response

Most helpful comment

The root cause is that during training, you did not set atrous_rates = [6, 12, 18], while during evaluation, you set atrous_rates. The model was trained without the ASPP module.

Regarding this kind of issue, please

  1. Double check you have used the same flags during both training and evaluation.
  2. You could check the scripts (local_test.sh and local_test_mobilenetv2.sh) we have provided for reference.
  3. Remember to change the experimental folder in case you may be loading the checkpoint used in previous experiments.

Hope that helps.

All 12 comments

This is the same issue as : #3992 and there was no final solution decided upon.

I exported the tensorlist from the original checkpoint (xception_65) as well as the checkpoint created after 20 iterations using that as the initial checkpoint for the model.

Neither of them mention "aspp1_depthwise/BatchNorm/beta" or have any reference to "aspp1".

If you go an look though in the 'deeplabv3_pascal_train_aug_2018_01_04.tar.gz' trained model that is used during the local_test.sh setup, there is a pbtxt file downloaded as an initial model config file, and if you scan through the pbtxt file all the missing variables from the import are present in the graph description.

So even though we are using xception_65 that was not trained, we still are trying to use the graph structure of the trained pascal xception_65 model.

I'm not sure how this situation is resolved. Somewhere in the deeplab codebase there must be a description of the model?

20_iteration_Chkpt.txt
Original_Chkpt.txt

Even iam facing the same problem
OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key xception_65/entry_flow/block1/unit_1/xception_module/separable_conv1_depthwise/BatchNorm/beta not found in checkpoint

Kindly help me to solve this problem

The root cause is that during training, you did not set atrous_rates = [6, 12, 18], while during evaluation, you set atrous_rates. The model was trained without the ASPP module.

Regarding this kind of issue, please

  1. Double check you have used the same flags during both training and evaluation.
  2. You could check the scripts (local_test.sh and local_test_mobilenetv2.sh) we have provided for reference.
  3. Remember to change the experimental folder in case you may be loading the checkpoint used in previous experiments.

Hope that helps.

Is this still an issue?

Im facing the same problem now.
Is this issue resolved?

I met the same problem, and i have solved it now. I think it is caused by the implicit use of default graph. So, to solve it, just change the line of "tf.get_default_graph()" to "tf.Graph()" . What to say, it really works for me. It may help to solve yours.

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Hi!
Sorry, I can't help you with streingth answer for your case, but I face up with similar problem on cityscapes + mobilenet_v3_large_seg

In general I'm fully solve my issue with help from this link
https://github.com/tensorflow/models/issues/7869

I mean you must be carefully and pay extra attention to parameters pushed to test, eval, vis scripts. It is problem not in model, checkpoint or tf version, but in parameters of scripts.

Hi @trialzuki , Could you please share your commands for "cityscapes + mobilenet_v3_large_seg"? I am getting the same error although using the similar flags in both training and evaluation.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

Closing as stale. Please reopen if you'd like to work on this further.

Was this page helpful?
0 / 5 - 0 ratings