A pretrained bert large model's ckpt file is about 1.3GB, after finetuning on downstream task, the saved ckpt file become 3.8GB. How did this happen?
I have the same problem with BERT base which becomes ~1.3 GB.
The distributed checkpoints only include the actual model weights, but the checkpoints written during training include the Adam momentum and variance variables for each weight variable, which are not actually part of the model are needed to be able to pause and resume training in the middle. So the training checkpoints are 3x the size of the distributed checkpoint.
The distributed checkpoints only include the actual model weights, but the checkpoints written during training include the Adam
momentumandvariancevariables for each weight variable, which are not actually part of the model are needed to be able to pause and resume training in the middle. So the training checkpoints are 3x the size of the distributed checkpoint.
Thank you for your advice. Could you tell me how to only save model weights (not include momentum and variance), just like the pretreated model you provide?
The distributed checkpoints only include the actual model weights, but the checkpoints written during training include the Adam
momentumandvariancevariables for each weight variable, which are not actually part of the model are needed to be able to pause and resume training in the middle. So the training checkpoints are 3x the size of the distributed checkpoint.Thank you for your advice. Could you tell me how to only save model weights (not include momentum and variance), just like the pretreated model you provide?
@zhezhaoa I have a solution here: https://github.com/google-research/bert/issues/99
I guess there must be some better and tidier solutions, but at least this one works for me, and the size of the weight file drops from 1.3GB to 400MB.
Most helpful comment
The distributed checkpoints only include the actual model weights, but the checkpoints written during training include the Adam
momentumandvariancevariables for each weight variable, which are not actually part of the model are needed to be able to pause and resume training in the middle. So the training checkpoints are 3x the size of the distributed checkpoint.