Bert: how is the perfermance when inference？if it cost much time, may a smaller model reach a good result?

Created on 1 Nov 2018 · 10Comments · Source: google-research/bert

Source

gxdalu-yaya

👍2

Most helpful comment

@elyase @loretoparisi I optimized the feature extraction part and wrapped it into a standalone service. On single Tesla M40 24GB with BERT pretrained-Chinese model (12 Layers, max_seq_len=40), the speed is ~2000sample/s
please check it out: https://github.com/hanxiao/bert-as-service

hanxiao on 15 Nov 2018

👍16 🎉6 ❤3

All 10 comments

We haven't benchmarked inference time since it depends on the hardware used (especially CPU vs. GPU vs. TPU), as well as a lot of other factors (batch size, sequence length, etc).

Our primary focus has been maximizing accuracy, with the idea that other techniques (such as Knowledge Distillation/semi-supervised learning) can be used when it's time to create a production model.

We show in the Table 6 of the paper that smaller models do significantly worse. For example, a 6-layer version of BERT loses about 3% absolute compared to the 12-layer version, and 5% compared to the 24-layer version on MultiNLI. Of course, the inference speed of the 6-layer version will be about 2x faster than the 12-layer version.

I could upload some of the smaller models (e.g., the 3-layer and 6-layer models from the paper), but this would be English-only.

My suggestion would be to run speed benchmarks in TensorFlow (which you can approximate by just creating a new BertModel creating a loop to feed it fake data), and then train the largest model that runs in a time you find acceptable on the hardware you plan to use.

jacobdevlin-google on 1 Nov 2018

👍8

@jacobdevlin-google please upload these smaller models 👍 , that would help a lot in comparing inference time by our side. Thanks a lot.

loretoparisi on 1 Nov 2018

It looks like it is very slow for production. I did some informal benchmarks using the pytorch version and the multilingual model (12 layers, seq_length=128) on a DGX-1:

| batch_size | cpu threads | samples / s |
| ------------- | ------------- | ------------- |
| 1 | 1 | 2.37 |
| 1 | 4 | 6.82 |
| 1 | 8 | 9.17 |
| 4 | 8 | 14.51 |
| 8 | 1 | 2.89 |
| 8 | 40 | 23.43 |

elyase on 6 Nov 2018

😕8

For extracting features, I'm getting 75 samples/s on Tesla M40 24GB with BERT pretrained-Chinese model (12 Layers, max_seq_len=200, batch_size=256)

Surely not fast enough for production.

hanxiao on 9 Nov 2018

😕3

hanxiao on 15 Nov 2018

👍16 🎉6 ❤3

@hanxiao that's amazing! can't wait to try it out.

loretoparisi on 15 Nov 2018

@hanxiao that looks great! What speed do you get with batch_size = 1?

elyase on 15 Nov 2018

👍1

@hanxiao , just had it chance to try it myself. On a V100, inference with one sample at a time I am getting 135 samples / s which is a lot faster than what I got with the Pytorch version. Is not exactly the same test as I was testing a classifier (so a couple more ops) but still it doesn't explain the big difference. On closer look I think I made mistake and didn't use torch.no_grad() in my benchmark. Gotta repeat the tests and post if I find something.

elyase on 16 Nov 2018

Hi there, I am trying to use pretrained multilingual model. in modeling function I send is_training=False. So it didn't take that much memory from GPU. But While running the code bert is taking a lot of memory. Last I checked I used google cloud 50 GB RAM with 16 CPU and 16 GB V100 GPU and failed. I there anything I am doing wrong. Note though BERT model is not trainable, I have sent BERT representation to an LSTM which is pretty small.
@hanxiao @loretoparisi
I tried to calculate total number of parameter and get 182913846. though I am not training BERT it is taking a lot of RAM and not training, ended in freezing the OS.