Fairseq: Does Dynamic Quantization speed up Fairseq's Transfomer?

Created on 24 Mar 2020 · 8Comments · Source: pytorch/fairseq

What is your question?

As per this tutorial in torch, quantize_dynamic gives speed up of models (though it supports Linear and LSTM as of now).
But I am unable to observe speed-up in fairseq's Transformer case whereas Huggingface's BERT gives good speed up.

Is there any suggestion or am I missing something?

Code

model is passed after make_generation_fast in fairseq_cli/generate.py

            quantized_model = torch.quantization.quantize_dynamic(
            model, {torch.nn.Linear}, dtype=torch.qint8
            )

What have you tried?

The example in the model gives around 2X irrespective of no of threads.

What's your environment?

fairseq Version (e.g., 1.0 or master): v0.9.0
PyTorch Version (e.g., 1.0) - 1.0.4
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip in a clean conda environment
Python version: 3.7.6
Installed Intel MKL via pip and export LD_PRELOAD=../env/lib/libiomp5.so
Device: Intel CPU

question

Source

gvskalyan

Most helpful comment

What kind of model are you trying to quantize? It sounds like you're trying to speed up translation models, since you're modifying generate.py. A lot of the runtime here is due to overhead from beam search, since we're generating one token at a time.

BERT doesn't have this issue, since it gets the whole input all at once and produces the entire output all at once. Thus you should generally see bigger speedup with encoder-only models (e.g., BERT) than models with decoders (e.g., translation, language modeling).

Can you try with the fairseq RoBERTa implementation and see if you get a speedup?

myleott on 25 Mar 2020

👍2

All 8 comments

Can you try with the fairseq RoBERTa implementation and see if you get a speedup?

myleott on 25 Mar 2020

👍2

Just want to follow-up on this issue. In my understanding, in generate.py, by default it loads an ensemble (transformers in a list) and you need to change each individual transformer INPLACE to do evaluation on the quantized model. However, when I ran inference on a quantized model, I encountered an error in the forward method in fairseq/modules/multihead_attention.py:

torch.cat((self.q_proj.bias, self.k_proj.bias, self.v_proj.bias)),
TypeError: expected Tensor as element 0 in argument 0, but got method.

After checking the documentation for quantize_dynamic(), seems like the weights and biases of each trainable layer will be wrapped and can only accessed via LinearPackedParams._weight_bias(). So I changed the multihead_attention.py file to properly access the weights and biases. Then I got this error in torch.nn.functional.py:

output = input.matmul(weight.t())
RuntimeError: Could not run 'aten::mm' with arguments from the 'QuantizedCPUTensorId' backend. 'aten::mm' is only available for these backends: [CUDATensorId, SparseCPUTensorId, VariableTensorId, CPUTensorId, SparseCUDATensorId].

Does this mean quantization is not yet supported in Fairseq? Is there any workaround to run quantization simulation for Transformer models? Thanks in advance!

Edwardlzy on 2 Apr 2020

Though we generate one token at time, for each token generation in decoder side the time would decrease (for non-auto regressive seq-seq), that it supposed to decrease the overall time since transformers are made up of self-attention and linear layers, as we are also quantizing linear layers.

@Edwardlzy I got the same error when ran quantize_dynamic on RoBERTa.

But for a transformer model from here, when ran on GPU:
RuntimeError: Could not run 'aten::quantize_per_tensor' with arguments from the 'CUDATensorId' backend. 'aten::quantize_per_tensor' is only available for these backends: [CPUTensorId, VariableTensorId]
And on CPU it runs fine without any error.

gvskalyan on 9 Apr 2020

@gvskalyan Thanks for the update. Do you mind sharing your code change in details? For example, in generate.py, did you make sure that your quantized model replaced the original one? And did you make any change to other related files like multihead_attention.py and functional.py? I noticed that there has been a new commit for Fairseq yesterday related to quantization. I tried the updated Fairseq and still stuck on the same error: torch.cat((self.q_proj.bias, self.k_proj.bias, self.v_proj.bias)), TypeError: expected Tensor as element 0 in argument 0, but got method.

It will be great if @myleott can let us know the right way to do quantization simulations in Fairseq. Thanks!

Edwardlzy on 9 Apr 2020

I have just added model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) here,

I have seen the no of params using p.numel(), ideally no of parameters should not differ - they should get quantized, may be numel can not count these - an indication model was quantized
I have printed the model - instead of linear it is DynamicQuantizedLinear everywhere.
How did the correct variable get replaced and only quantized_model is being used? I have gone through the code and also when --cpu flag is removed (to run on GPU) it gives the above mentioned error

And I think the recent commit relating to quantization was intended to be compatible / targeting for torch 1.5.0 release.

Since it is listed as experimental under torch tutorials, maybe @driazati or Ning Dong can know it better.

gvskalyan on 10 Apr 2020

@gvskalyan Thank you for the clarification. This is what I tried initially. When I ran evaluation on the "quantized" model in 8-bit, I had exactly the same BLEU score as the float32 model. This may suggest that the model you just quantized does not replace the one in models. What happened was that you created the quantized model in a new variable you called model and it was dropped after this iteration. If you try replacing the for model in models with for i in range(len(models)) and quantizing models[i], I believe you will get the error I encountered.

Quantization is for sure not supported in the current Fairseq version. I have only been able to run inference after rewriting functions in mainly multihead_attention.py from Fairseq and functional.py in torch.nn. It will be great if someone from Fairseq can add support to quantization and I will be happy to help.

Edwardlzy on 13 Apr 2020

👍1

yes, I could verify that it gives the same error as you mentioned.
Sorry for overlooking through that iteration part.

gvskalyan on 14 Apr 2020

For transformer model which using transformer.wmt14.en-fr dataset, the gemm(linear) only make up about 41% of the whole workload, so even using dynamic quantization, it can not have such high speed up as BERT(gemm can make up 70%~80%, int8 path can get 1.8x speed up).
Pre-trianed model: https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2
dataset: https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2
run: fairseq-generate data-bin/wmt14.en-fr.joined-dict.newstest2014 --path data-bin/wmt14.en-fr.joined-dict.transformer/model.pt --batch-size 128 --beam 5 --remove-bpe --cpu --quiet

profiler result:

Name                           Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls 
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
aten::mm                       34.30%           83.842s          34.38%           84.041s          1.583ms          53092
aten::index_select             23.71%           57.961s          23.83%           58.261s          2.043ms          28516
aten::bmm                      7.09%            17.332s          7.18%            17.556s          682.373us        25728
aten::_log_softmax             6.38%            15.589s          6.38%            15.602s          14.718ms         1060
aten::_cat                     5.70%            13.941s          5.78%            14.125s          1.110ms          12722
aten::topk                     5.40%            13.191s          10.86%           26.549s          6.333ms          4192
aten::nonzero                  4.44%            10.842s          4.44%            10.842s          974.253us        11129
aten::add                      1.24%            3.021s           1.28%            3.127s           72.651us         43036
aten::_softmax                 0.95%            2.316s           0.97%            2.366s           183.941us        12864
aten::copy_                    0.86%            2.091s           0.86%            2.092s           2.981us          701615
aten::add_                     0.84%            2.042s           0.84%            2.042s           38.176us         53488
aten::eq                       0.63%            1.547s           1.61%            3.940s           7.203us          546994
aten::uniform_                 0.62%            1.510s           0.62%            1.510s           5.698ms          265
aten::lt                       0.61%            1.484s           1.97%            4.805s           8.880us          541132
aten::div_                     0.56%            1.374s           0.58%            1.414s           534.152us        2648
aten::to                       0.53%            1.292s           1.04%            2.537s           5.341us          474982
aten::ne                       0.52%            1.270s           1.05%            2.566s           156.480us        16396
aten::select                   0.50%            1.214s           0.63%            1.542s           2.003us          769526
aten::view                     0.50%            1.214s           0.50%            1.214s           2.360us          514185
aten::_local_scalar_dense      0.48%            1.181s           0.48%            1.181s           0.971us          1216111
aten::empty                    0.38%            940.632ms        0.38%            940.632ms        1.033us          910608
aten::native_layer_norm        0.37%            907.179ms        0.41%            1.005s           51.896us         19368
aten::item                     0.36%            881.843ms        0.84%            2.063s           1.696us          1216111
aten::threshold                0.30%            744.394ms        0.32%            780.607ms        120.020us        6504
aten::normal_                  0.28%            682.135ms        0.28%            682.135ms        341.067ms        2
aten::as_strided               0.24%            576.000ms        0.24%            576.000ms        0.556us          1035151
aten::resize_                  0.21%            502.591ms        0.21%            502.591ms        0.767us          655405
aten::matmul                   0.19%            470.515ms        34.83%           85.133s          1.604ms          53092
aten::empty_strided            0.15%            374.766ms        0.15%            374.766ms        0.901us          416104
aten::masked_fill_             0.15%            373.153ms        0.15%            373.153ms        57.373us         6504
aten::is_nonzero               0.14%            351.842ms        0.54%            1.325s           2.444us          542072
aten::transpose                0.14%            331.905ms        0.19%            464.422ms        4.293us          108184
aten::t                        0.12%            291.702ms        0.19%            471.598ms        8.883us          53092
aten::mul_                     0.11%            265.367ms        0.19%            457.671ms        35.980us         12720
aten::slice                    0.10%            245.073ms        0.13%            318.686ms        3.248us          98110
aten::unbind                   0.09%            220.028ms        0.23%            559.496ms        46.578us         12012
aten::_unsafe_view             0.07%            179.628ms        0.14%            348.094ms        6.556us          53092
aten::index                    0.07%            159.936ms        0.09%            214.620ms        22.426us         9570
aten::mul                      0.06%            149.981ms        0.07%            167.732ms        61.848us         2712
aten::is_floating_point        0.04%            98.035ms         0.04%            98.035ms         0.282us          347256
aten::floor_divide             0.04%            96.478ms         0.05%            118.777ms        6.249us          19008
aten::softmax                  0.04%            88.753ms         1.01%            2.477s           192.542us        12864
aten::layer_norm               0.04%            88.604ms         0.45%            1.101s           56.828us         19368
aten::unsqueeze                0.04%            85.932ms         0.04%            99.115ms         6.851us          14468
aten::contiguous               0.03%            79.695ms         0.08%            191.545ms        2.001us          95724
aten::_index_put_impl_         0.03%            76.442ms         4.48%            10.951s          1.984ms          5519
aten::cat                      0.03%            74.303ms         5.81%            14.200s          1.116ms          12722
aten::sum                      0.03%            72.356ms         0.05%            114.064ms        35.512us         3212
aten::stride                   0.03%            71.803ms         0.03%            71.803ms         0.312us          230025
aten::expand                   0.02%            60.596ms         0.03%            72.795ms         4.571us          15924
aten::masked_fill              0.02%            56.244ms         0.33%            800.293ms        123.046us        6504
aten::narrow                   0.02%            49.892ms         0.05%            121.161ms        4.868us          24890
aten::clone                    0.02%            44.050ms         0.13%            328.826ms        34.588us         9507
aten::any                      0.02%            43.576ms         0.02%            57.998ms         11.374us         5099
aten::gather                   0.02%            41.513ms         0.02%            50.882ms         12.278us         4144
aten::relu                     0.01%            31.306ms         0.33%            811.913ms        124.833us        6504
aten::type_as                  0.01%            29.910ms         0.02%            50.471ms         3.302us          15284
aten::sort                     0.01%            28.267ms         0.03%            64.748ms         21.561us         3003
aten::fill_                    0.01%            28.139ms         0.01%            28.139ms         2.796us          10063
aten::bitwise_not              0.01%            26.371ms         0.02%            55.272ms         6.669us          8288
aten::empty_like               0.01%            24.928ms         0.02%            49.340ms         6.284us          7852
aten::reshape                  0.01%            23.228ms         0.02%            39.985ms         2.186us          18293
aten::arange                   0.01%            22.469ms         0.02%            45.343ms         5.857us          7742
aten::masked_select            0.01%            21.942ms         0.03%            67.980ms         34.264us         1984
aten::fmod                     0.01%            21.843ms         0.02%            47.664ms         22.483us         2120
aten::index_put_               0.01%            19.417ms         4.49%            10.970s          1.988ms          5519
aten::bitwise_and              0.01%            16.534ms         0.01%            32.741ms         7.722us          4240
aten::gt                       0.01%            15.082ms         0.01%            25.092ms         11.574us         2168
aten::sub_                     0.01%            12.618ms         0.01%            23.736ms         22.392us         1060
aten::log_softmax              0.00%            11.895ms         6.39%            15.615s          14.731ms         1060
aten::ge                       0.00%            11.715ms         0.01%            18.908ms         9.126us          2072
aten::mean                     0.00%            10.056ms         0.04%            105.505ms        99.533us         1060
aten::embedding                0.00%            7.471ms          0.05%            128.512ms        118.553us        1084
aten::all                      0.00%            7.244ms          0.00%            10.563ms         10.196us         1036
aten::sub                      0.00%            6.810ms          0.00%            8.088ms          8.754us          924
aten::_shape_as_tensor         0.00%            6.142ms          0.00%            10.460ms         9.650us          1084
aten::div                      0.00%            5.818ms          0.00%            8.726ms          60.601us         144
detach_                        0.00%            5.345ms          0.00%            5.345ms          1.007us          5310
aten::detach_                  0.00%            4.205ms          0.00%            9.550ms          1.798us          5310
aten::__and__                  0.00%            3.779ms          0.01%            24.847ms         11.720us         2120
aten::sin                      0.00%            1.880ms          0.00%            3.744ms          936.035us        4
aten::ones                     0.00%            1.296ms          0.00%            2.932ms          7.405us          396
aten::zeros                    0.00%            1.014ms          0.00%            3.510ms          16.248us         216
aten::cos                      0.00%            674.841us        0.00%            1.343ms          335.792us        4
aten::zero_                    0.00%            590.303us        0.00%            2.296ms          9.334us          246
aten::repeat                   0.00%            430.778us        0.00%            1.497ms          62.367us         24
aten::_cumsum                  0.00%            404.558us        0.00%            544.045us        22.669us         24
detach                         0.00%            370.571us        0.00%            370.571us        1.333us          278
aten::cumsum                   0.00%            271.373us        0.00%            1.202ms          50.074us         24
aten::dropout                  0.00%            265.072us        0.00%            265.072us        1.841us          144
aten::unfold                   0.00%            241.803us        0.00%            361.766us        7.537us          48
aten::set_                     0.00%            226.966us        0.00%            226.966us        1.220us          186
aten::detach                   0.00%            220.686us        0.00%            591.257us        2.127us          278
aten::exp                      0.00%            215.900us        0.00%            418.325us        104.581us        4
aten::alias                    0.00%            34.403us         0.00%            34.403us         1.433us          24
aten::expand_as                0.00%            28.514us         0.00%            72.917us         3.038us          24
-----------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 244.442s