Bert: How to get distributed checkpoints to reduce the size of model only for prediction

Created on 11 Nov 2018 · 9Comments · Source: google-research/bert

Source

lixinsu

Most helpful comment

@ZizhenWang I have exactly the same issue here, that I would like the size of the model to be small when doing the prediction. As stated by @jacobdevlin-google in #63 , the weight file contains momentum ('adam_m') and variance ('adam_v'). Then I found a solution here to exclude all Adam variables in this link

sess = tf.Session()
imported_meta = tf.train.import_meta_graph('./model.ckpt-322.meta')
imported_meta.restore(sess, './model.ckpt-322')
my_vars = []
for var in tf.all_variables():
    if 'adam_v' not in var.name and 'adam_m' not in var.name:
        my_vars.append(var)
saver = tf.train.Saver(my_vars)
saver.save(sess, './model.ckpt')

There must be some tidier solutions, but at least this one works for me, and the size of the weight file drops from 1.3GB to 400MB.

ymcdull on 27 Nov 2018

👍14

All 9 comments

I'm not sure what this means, the BERT-Base model is about 110M parameters and 440MB which should fit comfortably on most devices.

jacobdevlin-google on 12 Nov 2018

@jacobdevlin-google yes the released model is small, but after run run_classifier.py I get a 1.2G model, how to reduce its size to 400M?

ZizhenWang on 13 Nov 2018

@ZizhenWang here is the reason we get bigger model file https://github.com/google-research/bert/issues/63

xwzhong on 14 Nov 2018

sess = tf.Session()
imported_meta = tf.train.import_meta_graph('./model.ckpt-322.meta')
imported_meta.restore(sess, './model.ckpt-322')
my_vars = []
for var in tf.all_variables():
    if 'adam_v' not in var.name and 'adam_m' not in var.name:
        my_vars.append(var)
saver = tf.train.Saver(my_vars)
saver.save(sess, './model.ckpt')

There must be some tidier solutions, but at least this one works for me, and the size of the weight file drops from 1.3GB to 400MB.

ymcdull on 27 Nov 2018

👍14

@ymcdull Good solution to strip out adam-related variables from ckpt file. The shrinked ckpt works well in inference mode (estimator.predict()). However, when i try to take it as the latest ckpt within the model_dir to resume training, it raises:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
         [[node save/RestoreV2 (defined at /home/xuanhua/zhangjinhe/berts/bert_recipes/recipes/recipes/ner/berts.py:937)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

any idea?

longbowking on 16 Apr 2019

👍3

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
         [[node save/RestoreV2 (defined at /home/xuanhua/zhangjinhe/berts/bert_recipes/recipes/recipes/ner/berts.py:937)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

any idea?

Hi @longbowking In my understanding, if you wanna continue training, you will need adam-related variables, since they are part of the optimizer. This stripping out adam variables way is only useful when you wanna serve the model without any more training.

ymcdull on 21 Apr 2019

Your solution works really perfect in tf 1.x verisons.BUt in tf 2.x i don't have ckpt.meta file in my checkpoint folder,BECAUSE OF EAGER EXECUTION.Do you how to do the same steps above in tf 2.1x without .meta file in checkpoint folder?

divyag11 on 4 Feb 2020

Your solution works really perfect in tf 1.x verisons.BUt in tf 2.x i don't have ckpt.meta file in my checkpoint folder,BECAUSE OF EAGER EXECUTION.Do you how to do the same steps above in tf 2.1x without .meta file in checkpoint folder?

Did you find a solution to this?

silpara on 28 May 2020

@ymcdull using your code snippets, model reduces to 390MB, but when to reload the new small checkpoint and convert it to SavedModel format. Got following errors
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value opt/bert/embeddings/word_embeddings/Adam
I try to print tf.global_variables, still got adam related variables. Any solutions?