Models: [deeplab] Key aspp1_depthwise/BatchNorm/beta not found in checkpoint

Created on 13 Jun 2018 · 12Comments · Source: tensorflow/models

I am getting an issue running eval on a custom dataset. When I actually load the checkpoints into the model, I get an error about missing parameters (Momentum, etc) in the checkpoint file.

System information

What is the top-level directory of the model you are using: deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux 3.10.0-693.11.6.el7.x86_64
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below):1.8.0
Bazel version (if compiling from source):
CUDA/cuDNN version: cuda/9.0.176 cudnn/7.0
GPU model and memory: (not sure, running on university cluster)
Exact command to reproduce:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
EVAL_LOGDIR="./datasets/exp/val"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Describe the problem

I am using a custom dataset from a project. We have 960 jpg images with corresponding png masks. We also have 180 validation images (mask and image combo). We have two classes and all the png masks are converted to binary label images (checked in matlab and all the masks are just 0 or 1).

I exported these images as tfrecord using the scripts in the dataset folder. Though I had to hardcode this part to properly get the expected output:

FLAGS.image_format = "jpg"
FLAGS.label_format = "png"

I then trained the model using:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"

CKPT="./xception/model.ckpt"

NUM_ITERATIONS=20
python ./train.py \
--logtostderr \
--train_split="train" \
--model_variant="xception_65" \
--output_stride=16 \
--train_crop_size=256 \
--train_crop_size=256 \
--train_batch_size=4 \
--training_number_of_steps="${NUM_ITERATIONS}" \
--tf_initial_checkpoint="${CKPT}" \
--fine_tune_batch_norm=true \
--train_logdir="${TRAIN_LOGDIR}" \
--dataset_dir="${DATASET}"

The training output is what I expected :

INFO:tensorflow:Training on train set
WARNING:tensorflow:From /cluster/home/kellerke/.local/lib64/python3.6/site-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Initializing model from path: ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/gamma missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/moving_mean missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/moving_variance missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/weights missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/biases missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/gamma/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/weights/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:Variable logits/semantic/biases/Momentum missing in checkpoint ./xception_65/model.ckpt
WARNING:tensorflow:From /cluster/home/kellerke/.local/lib64/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-13 22:48:58.842532: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./xception_65/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./datasets/exp/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 10: loss = 3.2193 (4.939 sec/step)
INFO:tensorflow:global step 20: loss = 3.1794 (4.912 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

The issue arises when I try to evaluate the model.

I added the following lines to the segmentation_dataset.py file:

_SOYBEAN_INFORMATION = DatasetDescriptor(
splits_to_sizes={
'train': 960,
'val': 180,
},
num_classes=2,
ignore_label=255,
)
_DATASETS_INFORMATION = {
'cityscapes': _CITYSCAPES_INFORMATION,
'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
'ade20k': _ADE20K_INFORMATION,
'soybean': _SOYBEAN_INFORMATION
}

And I changed the eval.py dataset setting here:

flags.DEFINE_string('dataset', 'soybean',
'Name of the segmentation dataset.')

When I run the evaluation code the system just freezes waiting for a checkpoint:

DATASET="./datasets/tfrecord"

TRAIN_LOGDIR="./datasets/exp/train"
EVAL_LOGDIR="./datasets/exp/val"

CKPT="./xception/model.ckpt"

python ./eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=256 \
--eval_crop_size=256 \
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

Issue

INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 180
INFO:tensorflow:Eval batch size 1 and num batch 180
INFO:tensorflow:Waiting for new checkpoint at ./datasets/exp/train
INFO:tensorflow:Found new checkpoint at ./datasets/exp/train/model.ckpt-21
WARNING:tensorflow:From /cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py:301: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
2018-06-13 21:47:51.795975: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./datasets/exp/train/model.ckpt-21
2018-06-13 21:47:52.012550: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
Traceback (most recent call last):
File "./eval.py", line 176, in
tf.app.run()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "./eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 467, in create_session
init_fn=self._scaffold.init_fn)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
config=config)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 191, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1802, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at:
File "./eval.py", line 176, in
tf.app.run()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "./eval.py", line 169, in main
eval_interval_secs=FLAGS.eval_interval_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
timeout=timeout)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/evaluation.py", line 445, in evaluate_repeatedly
session_creator=session_creator, hooks=hooks) as session:
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 539, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1002, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1007, in _create_session
return self._sess_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 696, in create_session
self.tf_sess = self._session_creator.create_session()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 458, in create_session
self._scaffold.finalize()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 910, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1338, in __init__
self.build()
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1347, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1384, in _build
build_save=build_save, build_restore=build_restore)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
restore_sequentially, reshape)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
name="restore_shard"))
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 472, in _AddRestoreOps
restore_sequentially)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 886, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1463, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/cluster/home/kellerke/models/research/deeplab/env/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

stalled awaiting maintainer awaiting response

Source

kekeller

Most helpful comment

The root cause is that during training, you did not set atrous_rates = [6, 12, 18], while during evaluation, you set atrous_rates. The model was trained without the ASPP module.

Regarding this kind of issue, please

Double check you have used the same flags during both training and evaluation.
You could check the scripts (local_test.sh and local_test_mobilenetv2.sh) we have provided for reference.
Remember to change the experimental folder in case you may be loading the checkpoint used in previous experiments.

Hope that helps.

aquariusjay on 3 Aug 2018

❤1 👍1

All 12 comments

This is the same issue as : #3992 and there was no final solution decided upon.

kekeller on 13 Jun 2018

I exported the tensorlist from the original checkpoint (xception_65) as well as the checkpoint created after 20 iterations using that as the initial checkpoint for the model.

Neither of them mention "aspp1_depthwise/BatchNorm/beta" or have any reference to "aspp1".

If you go an look though in the 'deeplabv3_pascal_train_aug_2018_01_04.tar.gz' trained model that is used during the local_test.sh setup, there is a pbtxt file downloaded as an initial model config file, and if you scan through the pbtxt file all the missing variables from the import are present in the graph description.

So even though we are using xception_65 that was not trained, we still are trying to use the graph structure of the trained pascal xception_65 model.

I'm not sure how this situation is resolved. Somewhere in the deeplab codebase there must be a description of the model?

20_iteration_Chkpt.txt
Original_Chkpt.txt

kekeller on 14 Jun 2018

Even iam facing the same problem
OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key xception_65/entry_flow/block1/unit_1/xception_module/separable_conv1_depthwise/BatchNorm/beta not found in checkpoint

Kindly help me to solve this problem

chowkamlee81 on 3 Aug 2018

The root cause is that during training, you did not set atrous_rates = [6, 12, 18], while during evaluation, you set atrous_rates. The model was trained without the ASPP module.

Regarding this kind of issue, please

Double check you have used the same flags during both training and evaluation.
You could check the scripts (local_test.sh and local_test_mobilenetv2.sh) we have provided for reference.
Remember to change the experimental folder in case you may be loading the checkpoint used in previous experiments.

Hope that helps.

aquariusjay on 3 Aug 2018

❤1 👍1

Is this still an issue?

ymodak on 9 Nov 2018

Im facing the same problem now.
Is this issue resolved?

danny95333 on 22 Jan 2019

I met the same problem, and i have solved it now. I think it is caused by the implicit use of default graph. So, to solve it, just change the line of "tf.get_default_graph()" to "tf.Graph()" . What to say, it really works for me. It may help to solve yours.

jeinlee1991 on 22 Aug 2019

👀1

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

tensorflowbutler on 30 Jan 2020

Hi!
Sorry, I can't help you with streingth answer for your case, but I face up with similar problem on cityscapes + mobilenet_v3_large_seg

In general I'm fully solve my issue with help from this link
https://github.com/tensorflow/models/issues/7869

I mean you must be carefully and pay extra attention to parameters pushed to test, eval, vis scripts. It is problem not in model, checkpoint or tf version, but in parameters of scripts.

trialzuki on 2 Feb 2020

👍1

Hi @trialzuki , Could you please share your commands for "cityscapes + mobilenet_v3_large_seg"? I am getting the same error although using the similar flags in both training and evaluation.