Models: BERT - AssertionError: Some objects had attributes which were not restored

Created on 8 Aug 2019 · 12Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: bert
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 19.1
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.0.0-dev20190808 (latest tf-nightly-2.0-preview)
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A
Exact command to reproduce: python bert/scripts/run_training.py

Describe the problem

Specifying the init_checkpoint by pointing it to the pre-trained bert checkpoint fails to load some components.

EDIT: Also, pre-trained models were downloaded from the links in BERT research repo

Source code / logs

2019-08-08 12:22:05.326882: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-08 12:22:05.349384: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2112000000 Hz
2019-08-08 12:22:05.349992: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557cdc5d2850 executing computations on platform Host. Devices:
2019-08-08 12:22:05.350014: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
W0808 12:22:05.351007 140572354307712 cross_device_ops.py:1209] There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
I0808 12:22:05.351552 140572354307712 run_training.py:212] Training using customized training loop TF 2.0 with distrubutedstrategy.
I0808 12:22:07.030467 140572354307712 training.py:237] Checkpoint file /home/jason/code/python/bert_poc/model_assets/current_model/bert_model.ckpt found and restoring from initial checkpoint for core model.
W0808 12:22:07.035190 140572354307712 deprecation.py:323] From /home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/tensorflow_core/python/training/tracking/util.py:1243: NameBasedSaverStatus.__init__ (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
Traceback (most recent call last):
  File "bert/scripts/run_training.py", line 297, in <module>
    app.run(main)
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "bert/scripts/run_training.py", line 262, in main
    run_bert(strategy, input_meta_data)
  File "bert/scripts/run_training.py", line 229, in run_bert
    run_eagerly=FLAGS.run_eagerly
  File "bert/scripts/run_training.py", line 150, in run_customized_training
    run_eagerly=run_eagerly)
  File "/home/jason/code/python/bert_poc/bert/utils/models/training.py", line 239, in run_customized_training_loop
    checkpoint.restore(init_checkpoint).assert_consumed()
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 937, in assert_consumed
    (unused_attributes,))
AssertionError: Some objects had attributes which were not restored: {MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/word_embeddings/embeddings:0' shape=(30522, 768) dtype=float32, numpy=
array([[-0.00663621,  0.03823964, -0.01097937, ..., -0.0048779 ,
         0.00939153, -0.01687311],
       [ 0.01313792,  0.00559815,  0.00938186, ..., -0.0064949 ,
        -0.02506432, -0.02421579],
       [-0.01420996,  0.02229128, -0.01568796, ..., -0.01148546,
        -0.03540878,  0.03208613],
       ...,
       [ 0.00906128,  0.01069895,  0.00192638, ..., -0.02777208,
        -0.01992673, -0.01442556],
       [-0.00120432,  0.01587117, -0.00391243, ..., -0.00188635,
        -0.01371268,  0.00056052],
       [ 0.01843435,  0.00429037, -0.00945545, ...,  0.01043763,
        -0.01208154, -0.00506586]], dtype=float32)>
}: ['bert_model/word_embeddings/embeddings'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/embedding_postprocessor/type_embeddings:0' shape=(2, 768) dtype=float32, numpy=
array([[-0.02382992,  0.0284374 ,  0.03485439, ..., -0.00448464,
        -0.01138918,  0.02226768],
       [ 0.03432631,  0.00748088,  0.00935707, ..., -0.03541335,
         0.02503476, -0.01519339]], dtype=float32)>
}: ['bert_model/embedding_postprocessor/type_embeddings'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/embedding_postprocessor/position_embeddings:0' shape=(512, 768) dtype=float32, numpy=
array([[-0.0243349 , -0.01334215, -0.0246487 , ..., -0.02907648,
        -0.01122028, -0.00622248],
       [-0.00537879,  0.00461607,  0.00743089, ...,  0.01881546,
         0.00727143, -0.02386224],
       [-0.03852841, -0.03120534,  0.0227373 , ..., -0.00507259,
        -0.01455521, -0.01340351],
       ...,
       [-0.00368215, -0.0110319 , -0.00871077, ...,  0.00128342,
        -0.01734554,  0.02089777],
       [-0.00857777,  0.00298624,  0.00668371, ...,  0.0047777 ,
         0.02726714,  0.00596354],
       [-0.00322871, -0.01083235, -0.0029519 , ...,  0.01701725,
         0.00803378, -0.00552202]], dtype=float32)>
}: ['bert_model/embedding_postprocessor/position_embeddings'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/pooler_transform/kernel:0' shape=(768, 768) dtype=float32, numpy=
array([[-0.02066252,  0.00877059,  0.0030486 , ..., -0.01429707,
        -0.01191622, -0.00383769],
       [-0.00089017,  0.02977573,  0.01499144, ...,  0.0107305 ,
         0.01953771, -0.02430641],
       [ 0.00764155,  0.02311237,  0.01478866, ...,  0.00922052,
         0.02205642,  0.02981877],
       ...,
       [ 0.00451655, -0.01770904, -0.00652671, ..., -0.01666754,
        -0.00634123,  0.02065968],
       [-0.02414763,  0.03848261,  0.02823355, ..., -0.00049913,
         0.00823175,  0.02343463],
       [-0.00461377, -0.00193397, -0.02001752, ..., -0.03454453,
        -0.00390934, -0.03594256]], dtype=float32)>
}: ['bert_model/pooler_transform/kernel'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/pooler_transform/bias:0' shape=(768,) dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>
}: ['bert_model/pooler_transform/bias'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/embedding_postprocessor/layer_norm/gamma:0' shape=(768,) dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.], dtype=float32)>
}: ['bert_model/embedding_postprocessor/layer_norm/gamma'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'bert_model/embedding_postprocessor/layer_norm/beta:0' shape=(768,) dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>
}: ['bert_model/embedding_postprocessor/layer_norm/beta'], MirroredVariable:{
  0 /job:localhost/replica:0/task:0/device:CPU:0: <tf.Variable 'save_counter:0' shape=() dtype=int64, numpy=0>
}: ['save_counter']}

Source

jmwoloso

All 12 comments

Hi, Sorry for the issue.
The TF1 name-based checkpoint downloaded from google-research/bert is not compatible with TF2 Bert model here directly. tf.train.Checkpoint will load name-based checkpoint by matching variable names but the keras Bert model does not match all variable names.
Here is the TF2 checkpoint we converted from TF1 checkpoint (the tensor values are the same): https://github.com/tensorflow/models/blob/master/official/bert/benchmark/bert_benchmark.py#L38

We can publish TF2 checkpoint soon.

saberkun on 9 Aug 2019

Thank you for the fast response @saberkun

Using the checkpoints you referenced results in the following traceback:

2019-08-09 12:32:27.485811: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-09 12:32:27.506873: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2112000000 Hz
2019-08-09 12:32:27.507516: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b171acf9e0 executing computations on platform Host. Devices:
2019-08-09 12:32:27.507547: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
W0809 12:32:27.509049 140181893019264 cross_device_ops.py:1209] There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
I0809 12:32:27.509610 140181893019264 <input>:212] Training using customized training loop TF 2.0 with distrubutedstrategy.
2019-08-09 12:32:28.156549: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 93763584 exceeds 10% of system memory.
2019-08-09 12:32:28.264038: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 93763584 exceeds 10% of system memory.
2019-08-09 12:32:28.272990: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 93763584 exceeds 10% of system memory.
I0809 12:32:29.372153 140181893019264 training.py:237] Checkpoint file /home/jason/code/python/bert_poc/model_assets/gcs/bert_model.ckpt found and restoring from initial checkpoint for core model.
2019-08-09 12:32:29.381041: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 93763584 exceeds 10% of system memory.
Traceback (most recent call last):
  File "<input>", line 297, in <module>
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "<input>", line 262, in main
  File "<input>", line 229, in run_bert
  File "<input>", line 150, in run_customized_training
  File "/home/jason/code/python/bert_poc/bert/utils/models/training.py", line 239, in run_customized_training_loop
    checkpoint.restore(init_checkpoint).assert_consumed()
  File "/home/jason/.virtualenvs/bert_poc/lib/python3.7/site-packages/tensorflow_core/python/training/tracking/util.py", line 709, in assert_consumed
    .format(pretty_printer.node_names[node_id], node))
AssertionError: Unresolved object in checkpoint (root).model.layer_with_weights-0.encoder.layer0.attention_layer: children {
  node_id: 130
  local_name: "query_dense"
}
children {
  node_id: 131
  local_name: "key_dense"
}
children {
  node_id: 132
  local_name: "value_dense"
}
children {
  node_id: 133
  local_name: "attention_probs_dropout"
}

jmwoloso on 9 Aug 2019

This seems like it might apply https://github.com/tensorflow/tensorflow/issues/27937

I'll check out the checkpoint guide referenced in that issue and see if it helps in the mean time.

The gist of what I read, if I'm reading it correctly, is that I'll need to adjust the code in the training script to create those variables after loading the checkpoint. Is that accurate?

jmwoloso on 9 Aug 2019

The training script compatible objected-based checkpoints are released: https://github.com/tensorflow/models/blob/master/official/bert/README.md#pre-trained-models

Yeah, your understanding is correct for name-based checkpoints. In TF2, checkpoint format changed: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Checkpoint
This will not rely on string matching.

I converted name-based checkpoint to object-based checkpoint. So you should be able to use them now. Of course, we have a demo one hosting on GCP storage bucket.

saberkun on 14 Aug 2019

🎉1

Thanks @saberkun. I'll check it out. Very much appreciated!

jmwoloso on 14 Aug 2019

@saberkun Is this a correct link for one of the checkpoints that you converted over that is TF 2.0-compatible?

jmwoloso on 11 Sep 2019

Yes, it is. Let me know if you have any trouble.
It think it should work as we have daily perfzero regression tests on GPU

saberkun on 11 Sep 2019

@saberkun It seemed to work, though the model that was loaded had no variables attribute, but maybe that is the new paradigm in TF 2.0? Or maybe I had to do something else in addition to just calling checkpoint.restore(...) in order to get the variables to populate?

jmwoloso on 12 Sep 2019

input was the following:

init_checkpoint = "<path>/<to>/<ckpt>"
model = tf.keras.Model()
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(init_checkpoint)

and output was:

2019-09-11 13:53:24.252401: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-11 13:53:24.274923: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2112000000 Hz
2019-09-11 13:53:24.275791: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5612b806a0b0 executing computations on platform Host. Devices:
2019-09-11 13:53:24.275842: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
<tensorflow.python.training.tracking.util.CheckpointLoadStatus object at 0x7f66aed53048>

additionally:

>>> model.variables
[]

jmwoloso on 12 Sep 2019

init_checkpoint = "//"
model = modeling.get_bert_model() # This gives you a keras.Model
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(init_checkpoint).run_restore_ops() # This will force the restoration happen.

First, I feel "model = tf.keras.Model()" will fail because checkpoint does not store model information.
Second, to really restore values, you need run_restore_ops(), otherwise the lazy behavior will defer the restoration to the real tensor usage.

saberkun on 12 Sep 2019

👍1

Ahhh, ok thank you very much! I'll give the run_restore_ops() method a try and take a look at the get_bert_model function for model construction.

I really appreciate the guidance!

jmwoloso on 12 Sep 2019

Hi, Sorry for the issue.
The TF1 name-based checkpoint downloaded from google-research/bert is not compatible with TF2 Bert model here directly. tf.train.Checkpoint will load name-based checkpoint by matching variable names but the keras Bert model does not match all variable names.
Here is the TF2 checkpoint we converted from TF1 checkpoint (the tensor values are the same): https://github.com/tensorflow/models/blob/master/official/bert/benchmark/bert_benchmark.py#L38

We can publish TF2 checkpoint soon.

@saberkun Hi, is there a script to convert a TF1.x checkpoint to TF2.x compatible one? Since my checkpoint is produced by model trained on custom data, which is not the same as you public checkpoints of TF1.x