Bert: fine-tuning bert large uncased_L-24_H-1024_A-16 got checkpoint restoring error

Created on 24 Jan 2019  路  2Comments  路  Source: google-research/bert

I tested on CPU with uncased_L-24_H-1024_A-16 got the following issue but no issue when use bert base uncased_L-12_H-768_A-12. TF version is 1.11. Anyone got the similar issue?

Command to run:
python run_pretraining.py \
--input_file=../data/tmp/test.tfrecord \
--output_dir=../data/tmp/pretraining_output \
--do_train=True \
--do_eval=True \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--learning_rate=2e-5

Log:
INFO:tensorflow:Restoring parameters from ../data/tmp/pretraining_output/model.ckpt-20
2019-01-24 18:27:58.869475: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key bert/encoder/layer_12/attention/output/LayerNorm/beta not found in checkpoint
INFO:tensorflow:Error recorded from evaluation_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key bert/encoder/layer_12/attention/output/LayerNorm/beta not found in checkpoint
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Most helpful comment

I was having the same issue, the problem was that I had already trained a model with a checkpoint t in the same output_dir, so it was trying to load those weights for a different architecture. In your case, try and clear ../data/tmp/pretraining_output and see if it works.

All 2 comments

I was having the same issue, the problem was that I had already trained a model with a checkpoint t in the same output_dir, so it was trying to load those weights for a different architecture. In your case, try and clear ../data/tmp/pretraining_output and see if it works.

Thanks a lot, it works.

Was this page helpful?
0 / 5 - 0 ratings