We haven't benchmarked inference time since it depends on the hardware used (especially CPU vs. GPU vs. TPU), as well as a lot of other factors (batch size, sequence length, etc).
Our primary focus has been maximizing accuracy, with the idea that other techniques (such as Knowledge Distillation/semi-supervised learning) can be used when it's time to create a production model.
We show in the Table 6 of the paper that smaller models do significantly worse. For example, a 6-layer version of BERT loses about 3% absolute compared to the 12-layer version, and 5% compared to the 24-layer version on MultiNLI. Of course, the inference speed of the 6-layer version will be about 2x faster than the 12-layer version.
I could upload some of the smaller models (e.g., the 3-layer and 6-layer models from the paper), but this would be English-only.
My suggestion would be to run speed benchmarks in TensorFlow (which you can approximate by just creating a new BertModel creating a loop to feed it fake data), and then train the largest model that runs in a time you find acceptable on the hardware you plan to use.
@jacobdevlin-google please upload these smaller models 馃憤 , that would help a lot in comparing inference time by our side. Thanks a lot.
It looks like it is very slow for production. I did some informal benchmarks using the pytorch version and the multilingual model (12 layers, seq_length=128) on a DGX-1:
| batch_size | cpu threads | samples / s |
| ------------- | ------------- | ------------- |
| 1 | 1 | 2.37 |
| 1 | 4 | 6.82 |
| 1 | 8 | 9.17 |
| 4 | 8 | 14.51 |
| 8 | 1 | 2.89 |
| 8 | 40 | 23.43 |
For extracting features, I'm getting 75 samples/s on Tesla M40 24GB with BERT pretrained-Chinese model (12 Layers, max_seq_len=200, batch_size=256)
Surely not fast enough for production.
@elyase @loretoparisi I optimized the feature extraction part and wrapped it into a standalone service. On single Tesla M40 24GB with BERT pretrained-Chinese model (12 Layers, max_seq_len=40), the speed is ~2000sample/s
please check it out: https://github.com/hanxiao/bert-as-service
@hanxiao that's amazing! can't wait to try it out.
@hanxiao that looks great! What speed do you get with batch_size = 1?
@hanxiao , just had it chance to try it myself. On a V100, inference with one sample at a time I am getting 135 samples / s which is a lot faster than what I got with the Pytorch version. Is not exactly the same test as I was testing a classifier (so a couple more ops) but still it doesn't explain the big difference. On closer look I think I made mistake and didn't use torch.no_grad() in my benchmark. Gotta repeat the tests and post if I find something.
Hi there, I am trying to use pretrained multilingual model. in modeling function I send is_training=False. So it didn't take that much memory from GPU. But While running the code bert is taking a lot of memory. Last I checked I used google cloud 50 GB RAM with 16 CPU and 16 GB V100 GPU and failed. I there anything I am doing wrong. Note though BERT model is not trainable, I have sent BERT representation to an LSTM which is pretty small.
@hanxiao @loretoparisi
I tried to calculate total number of parameter and get 182913846. though I am not training BERT it is taking a lot of RAM and not training, ended in freezing the OS.
Update: At first I was using tf1.5. After updating to tf1.12 the problem solves.
Most helpful comment
@elyase @loretoparisi I optimized the feature extraction part and wrapped it into a standalone service. On single Tesla M40 24GB with BERT pretrained-Chinese model (12 Layers, max_seq_len=40), the speed is ~2000sample/s
please check it out: https://github.com/hanxiao/bert-as-service