Transformers: Slower and more memory hungry than the TensorFlow BERT?

Created on 3 Jul 2019  路  9Comments  路  Source: huggingface/transformers

Hi pytorch-pretrained-BERT developers,

I have been using TensorFlow BERT since it came out, recently I wanted to switch to PyTorch because it is a great library. For this, I did a bunch of tests to compare training specs between Google's TF BERT and your implementation. To my surprise, this is a lot slower and can only afford small batch size before OOM error. I really want to know if this is a correct observation because I was really hoping to transition to PyTorch.

Here is my setup:

  1. Custom size of 3 layer by 320 hidden dimension.
  2. English uncased vocab.
  3. Sequence length is set to be constant 125.
  4. Running on Tesla P40.
  5. Running finetune_on_pregenerated.py
  6. I changed finetune_on_pregenerated.py a little to just initialize a blank model of my size.

Speed difference:

  • TensorFlow: 809 sentences/s on 1 GPU.
  • TensorFlow: 2350 sentences/s on 4 GPUs.
  • PyTorch: 275 sentences/s on 1 GPU.
  • PyTorch: 991 sentences/s on 4 GPUs.

Memory:

  • My P40 has 22GB memory.
  • TensorFlow can run batch size of 1000 or more (didn't probe upper limit).
  • PyTorch is OOM for batch size 250 or above. OK with 125.
  • I ran 30 epochs on a test data set of only 17MB. It shouldn't be a data loading problem.

I want to know if there is anything that I could have done wrong?

Thank you very much!
n

wontfix

Most helpful comment

@thomwolf Thank you so much for the info! :)

Just to share, I quickly did a benchmark of XLM (this one fits my needs the most out of your three recommendations).

Sentences/s (for the specs I mentioned above):

Batch size | Official TF BERT | HuggingFace PyTorch BERT | XLM PyTorch BERT
-- | -- | -- | --
128 over 1 GPU | 610 | 288 | 575
250 over 1 GPU | 647 | OOM | 625
500 over 1 GPU | 665 | OOM | 650
700 over 1 GPU | N/A | OOM | OOM
900 over 1 GPU | 667 | OOM | OOM
1000 over 1 GPU | OOM | OOM | OOM
128 over 4 GPUs | 889 (1.5x) | 779 (2.7x) | N/A
512 over 4 GPUs | 1522 (2.3x) | 1018 (3.?x) | N/A
1000 over 4 GPUs | 1798 (2.?x) | OOM | N/A
2000 over 4 GPUs | 1946 (2.?x) | OOM | N/A
3600 over 4 GPUs | 1991 (3.0x) | OOM | N/A
4000 over 4 GPUs | OOM | OOM | N/A

Note: Only spent 2 hours on XLM, not sure if I set the vocab to be exactly the same size as the others, but they should be in the same ballpark.

I haven't got a chance to benchmark the multi-GPU XLM. But in general, it looks like:

  1. The TensorFlow implementation uses memory more efficiently.
  2. PyTorch's multi-GPU scaling seems better.
  3. PyTorch itself is not slower than TF.

n

All 9 comments

Yes, this library is not made for training a model from scratch.

You should use one of the libraries I referred to here: https://github.com/huggingface/pytorch-pretrained-BERT/issues/543#issuecomment-491207121

I might give it a look one day but not in the short-term.

@thomwolf Thank you so much for the info! :)

Just to share, I quickly did a benchmark of XLM (this one fits my needs the most out of your three recommendations).

Sentences/s (for the specs I mentioned above):

Batch size | Official TF BERT | HuggingFace PyTorch BERT | XLM PyTorch BERT
-- | -- | -- | --
128 over 1 GPU | 610 | 288 | 575
250 over 1 GPU | 647 | OOM | 625
500 over 1 GPU | 665 | OOM | 650
700 over 1 GPU | N/A | OOM | OOM
900 over 1 GPU | 667 | OOM | OOM
1000 over 1 GPU | OOM | OOM | OOM
128 over 4 GPUs | 889 (1.5x) | 779 (2.7x) | N/A
512 over 4 GPUs | 1522 (2.3x) | 1018 (3.?x) | N/A
1000 over 4 GPUs | 1798 (2.?x) | OOM | N/A
2000 over 4 GPUs | 1946 (2.?x) | OOM | N/A
3600 over 4 GPUs | 1991 (3.0x) | OOM | N/A
4000 over 4 GPUs | OOM | OOM | N/A

Note: Only spent 2 hours on XLM, not sure if I set the vocab to be exactly the same size as the others, but they should be in the same ballpark.

I haven't got a chance to benchmark the multi-GPU XLM. But in general, it looks like:

  1. The TensorFlow implementation uses memory more efficiently.
  2. PyTorch's multi-GPU scaling seems better.
  3. PyTorch itself is not slower than TF.

n

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Hi @thomwolf ,
I was trying to fine-tune pytorch-transformers's gpt2 (124M) on a V100 16GB GPU. But I am not able to accommodate more than the batch_size of 2. I am using seq-length of 1024 tokens.

This might be evident from above comments but I am new to training NNs so wanted to confirm if fine tuning would also cause OOM as in training from scratch? If so, then is only option available to finetune gpt2 is to use original tensorfolow implementation?

Thanks

Hi @SKRohit, with the GPT-2 model you can either fine-tune it with a batch size of 4 and a sequence of 512 tokens, or a batch size of 2 and a sequence of 1024 tokens, like what you've tried. We have had good results with a batch size of 4 and a sequence of 512 in our experiments.

If you want a bigger batch size, you can set up gradient accumulation, which would allow you to put larger to much larger batch sizes. You can find an example of gradient accumulation applied to fine-tuning in our language model fine-tuning example.

Yes, @LysandreJik I am using gradient accumulation. I found max possible batch_size = 2 to be too small given this comment so asked to make sure there is no error in my code or any issue with my gcloud gpu.
Also, have you finetuned gpt2 architectures using mixed_precision (mp) training? Did you find any difference in performance of mp trained gpt2 in comparison to without mp?
And I am referring to fine-tuning script provided in pytorch_transformers repo 馃憤 .

Thanks.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mark

@thomwolf What is the bottleneck in HuggingFace transformers pretraining comparing to Tensorflow and other PyTorch implementations?

Was this page helpful?
0 / 5 - 0 ratings