Transformers: Benchmarking Prediction Speed

Created on 18 Dec 2018 · 17Comments · Source: huggingface/transformers

In reference to following tweet:

Would it be possible to do a benchmark on the speed of prediction? I was working with the tensorflow version of BERT, but it uses the new Estimators and I'm struggling to find a straight-forward way to benchmark it since it all gets hidden in layers of computation graph. I'd imagine pytorch being more forgiving in this regard.

Discussion wontfix

Source

jaderabbit

Most helpful comment

Hi Jade,

I did some benchmarking on a V100 GPU. You can check the script I used on the benchmark branch (mostly added timing to run_squad).

Here are the results:
prediction_speed_bert_1

max_seq_length | fp32 | fp16
-- | -- | --
384 | 140 | 352
256 | 230 | 751
128 | 488 | 1600
64 | 1030 | 3663

I will give a look on an older K80 (without fp16 support) when I have time.

thomwolf on 19 Dec 2018

👍7 ❤3

All 17 comments

Do you have a dataset in mind for the benchmark?
We can do a simple benchmark by timing the duration of evaluation on the SQuAD dev set for example.

thomwolf on 18 Dec 2018

Yes, that would be perfect! Ideally, it would exclude loading and setting up the model (something that the tf implementation literally does not allow for :P)

jaderabbit on 18 Dec 2018

Hi Jade,

I did some benchmarking on a V100 GPU. You can check the script I used on the benchmark branch (mostly added timing to run_squad).

Here are the results:
prediction_speed_bert_1

max_seq_length | fp32 | fp16
-- | -- | --
384 | 140 | 352
256 | 230 | 751
128 | 488 | 1600
64 | 1030 | 3663

I will give a look on an older K80 (without fp16 support) when I have time.

thomwolf on 19 Dec 2018

👍7 ❤3

This is fantastic! Thank you so so so so much!

If you get a chance to do the K80, that would be brilliant. I'll try run it when I get time. Currently doing a cost versus speed comparison just to get a feel.

jaderabbit on 19 Dec 2018

You can run it like this for fp32 (just remove --do_train):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

And like this for fp16 (add --predict_fp16):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Adjust predict_batch_size 128 to fill your GPU around 50% at least and adjust --max_seq_length 384 to test with various sequence lengths. For small sequences (under 64 tokens) we should desactivate the windowing (related to doc_stride). I didn't take time to do that so the dataset reading didn't work (hence the absence of datapoint).

thomwolf on 19 Dec 2018

Fantastic. Tomorrow I'm going to run it for some smaller max sequence lengths (useful for my use case) and on some other GPUS: The Tesla M60 and then the K80

jaderabbit on 19 Dec 2018

Managed to replicate your results on the V100. :)

Also, I've done the experiments below for sequences of length 64 on different GPUS. Will do the other sequence lengths when I get a chance.

|GPU | max_seq_length | fp32 | fp16 |
| -- | -- | -- | -- |
| Tesla M60 | 64 | 210 | N/A |
| Tesla K80 | 64 | 143 | N/A |

jaderabbit on 2 Jan 2019

👍4 ❤2

@thomwolf @jaderabbit Thank you for the experiments.

I think these results deserves more visibility, maybe a dedicated markdown page or a section in the README.md?

rodgzilla on 7 Jan 2019

Your are right Gregory.
The readme is starting to be too big in my opinion.
I will try to setup a sphinx/ReadTheDocs online doc later this month (feel free to start a PR if you have experience in these kind of stuff).

thomwolf on 7 Jan 2019

I'm more or less new to sphinx but I would be happy to work on it with you.

rodgzilla on 7 Jan 2019

Sure, if you want help that could definitely speed up the process.

The first step would be to create a new branch to work on with a docfolder and then generate the doc in the folder using sphinx.

Good introductions to sphinx and readthedoc are here: http://www.ericholscher.com/blog/2016/jul/1/sphinx-and-rtd-for-writers/
and here: https://docs.readthedocs.io/en/latest/intro/getting-started-with-sphinx.html

We will need to add some dependencies for the but we should strive to keep it as light as possible.
Here is an example of repo I've worked on recently (still a draft but the doc is functional) https://github.com/huggingface/adversarialnlp

thomwolf on 7 Jan 2019

Hi @thomwolf ,
I am looking to deploy a pre-trained squad-bert model to make predictions in real-time.
Right now when I run:
python run_squad.py \ --bert_model bert-base-uncased \ --do_predict \ --do_lower_case \ --train_file $SQUAD_DIR/train-v1.1.json \ --predict_file $SQUAD_DIR/test.json \ --predict_batch_size 128 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/
it takes 22 seconds to generate the prediction. Is there a way to reduce the amount off time taken to less than a second?

The "test.json" has one context and 1 question on the same. It looks like this:
{ "data": [ { "title": "Arjun", "paragraphs": [ { "context": "Arjun died in 1920. The American Football Club (AFC) celebrated this death. Arjun now haunts NFC. He used to love playing football. But nobody liked him.", "qas": [ { "question": "When did Arjun die?", "id": "56be4db0acb8001400a502ed" } ] } ] } ] }

Please help me with this. I switched to using the PyTorch implementation hoping that getting a saved model and making predictions using the saved model will be easier in PyTorch.

apurvaasf on 9 Jan 2019

@apurvaasf Might be worth opening another ticket since that's slightly different to this. It shouldn't be too hard to write your own code for deployment. The trick is to make sure it does all the loading once, and just calls predict each time you need a prediction.

jaderabbit on 11 Jan 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 5 May 2019

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

hamediramin on 25 Jun 2019

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

mitsuix on 6 Aug 2019

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

albanD commented on 25 Mar
Hi,

We use github issues only for bugs or feature requests.
Please use the forum to ask questions: https://discuss.pytorch.org/ as mentionned in the template you used.

Note that in your case, you are most likely missing torch.cuda.syncrhonize() when timing your GPU code which makes the copy look much slower than it is because it has to wait for the rest of the work to be done.