INFO:tensorflow: name = bert/encoder/layer_23/output/LayerNorm/gamma:0, shape = (1024,), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (1024, 1024), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (1024,), INIT_FROM_CKPT
INFO:tensorflow: name = output_weights:0, shape = (2, 1024)
INFO:tensorflow: name = output_bias:0, shape = (2,)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from output/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into output/model.ckpt.
It gets struck after this am I doing something wrong.
python run_classifier.py \
--task_name=MRPC \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/MRPC \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/tmp/mrpc_output/
that's pretty normal from my experience. I left it to run overnight and check it on the second day, it solves
refer to #212
I'm pretty confident the lag is not caused by saving the checkpoint itself although the logging says saving checkpoint.... I checked my bucket and I found that the checkpoint gets saved within a minute. I also tried running with only 20 steps and the run completes just fine. Maybe logging should be the fix here; such that instead of logging saving..., it should log after the save is done and it says saved.... That way we're not mislead to debug the saving. Also extra logging to signal that the run is still alive would be very helpful!
Most helpful comment
that's pretty normal from my experience. I left it to run overnight and check it on the second day, it solves