Hi pytorch-pretrained-BERT developers,
I have been using TensorFlow BERT since it came out, recently I wanted to switch to PyTorch because it is a great library. For this, I did a bunch of tests to compare training specs between Google's TF BERT and your implementation. To my surprise, this is a lot slower and can only afford small batch size before OOM error. I really want to know if this is a correct observation because I was really hoping to transition to PyTorch.
Here is my setup:
Speed difference:
Memory:
I want to know if there is anything that I could have done wrong?
Thank you very much!
n
Yes, this library is not made for training a model from scratch.
You should use one of the libraries I referred to here: https://github.com/huggingface/pytorch-pretrained-BERT/issues/543#issuecomment-491207121
I might give it a look one day but not in the short-term.
@thomwolf Thank you so much for the info! :)
Just to share, I quickly did a benchmark of XLM (this one fits my needs the most out of your three recommendations).
Sentences/s (for the specs I mentioned above):
Batch size | Official TF BERT | HuggingFace PyTorch BERT | XLM PyTorch BERT
-- | -- | -- | --
128 over 1 GPU | 610 | 288 | 575
250 over 1 GPU | 647 | OOM | 625
500 over 1 GPU | 665 | OOM | 650
700 over 1 GPU | N/A | OOM | OOM
900 over 1 GPU | 667 | OOM | OOM
1000 over 1 GPU | OOM | OOM | OOM
128 over 4 GPUs | 889 (1.5x) | 779 (2.7x) | N/A
512 over 4 GPUs | 1522 (2.3x) | 1018 (3.?x) | N/A
1000 over 4 GPUs | 1798 (2.?x) | OOM | N/A
2000 over 4 GPUs | 1946 (2.?x) | OOM | N/A
3600 over 4 GPUs | 1991 (3.0x) | OOM | N/A
4000 over 4 GPUs | OOM | OOM | N/A
Note: Only spent 2 hours on XLM, not sure if I set the vocab to be exactly the same size as the others, but they should be in the same ballpark.
I haven't got a chance to benchmark the multi-GPU XLM. But in general, it looks like:
n
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @thomwolf ,
I was trying to fine-tune pytorch-transformers's gpt2 (124M) on a V100 16GB GPU. But I am not able to accommodate more than the batch_size of 2. I am using seq-length of 1024 tokens.
This might be evident from above comments but I am new to training NNs so wanted to confirm if fine tuning would also cause OOM as in training from scratch? If so, then is only option available to finetune gpt2 is to use original tensorfolow implementation?
Thanks
Hi @SKRohit, with the GPT-2 model you can either fine-tune it with a batch size of 4 and a sequence of 512 tokens, or a batch size of 2 and a sequence of 1024 tokens, like what you've tried. We have had good results with a batch size of 4 and a sequence of 512 in our experiments.
If you want a bigger batch size, you can set up gradient accumulation, which would allow you to put larger to much larger batch sizes. You can find an example of gradient accumulation applied to fine-tuning in our language model fine-tuning example.
Yes, @LysandreJik I am using gradient accumulation. I found max possible batch_size = 2 to be too small given this comment so asked to make sure there is no error in my code or any issue with my gcloud gpu.
Also, have you finetuned gpt2 architectures using mixed_precision (mp) training? Did you find any difference in performance of mp trained gpt2 in comparison to without mp?
And I am referring to fine-tuning script provided in pytorch_transformers
repo 馃憤 .
Thanks.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
mark
@thomwolf What is the bottleneck in HuggingFace transformers pretraining comparing to Tensorflow and other PyTorch implementations?
Most helpful comment
@thomwolf Thank you so much for the info! :)
Just to share, I quickly did a benchmark of XLM (this one fits my needs the most out of your three recommendations).
Sentences/s (for the specs I mentioned above):
Batch size | Official TF BERT | HuggingFace PyTorch BERT | XLM PyTorch BERT
-- | -- | -- | --
128 over 1 GPU | 610 | 288 | 575
250 over 1 GPU | 647 | OOM | 625
500 over 1 GPU | 665 | OOM | 650
700 over 1 GPU | N/A | OOM | OOM
900 over 1 GPU | 667 | OOM | OOM
1000 over 1 GPU | OOM | OOM | OOM
128 over 4 GPUs | 889 (1.5x) | 779 (2.7x) | N/A
512 over 4 GPUs | 1522 (2.3x) | 1018 (3.?x) | N/A
1000 over 4 GPUs | 1798 (2.?x) | OOM | N/A
2000 over 4 GPUs | 1946 (2.?x) | OOM | N/A
3600 over 4 GPUs | 1991 (3.0x) | OOM | N/A
4000 over 4 GPUs | OOM | OOM | N/A
Note: Only spent 2 hours on XLM, not sure if I set the vocab to be exactly the same size as the others, but they should be in the same ballpark.
I haven't got a chance to benchmark the multi-GPU XLM. But in general, it looks like:
n