Transformers: Dynamic Quantization on ALBERT (pytorch)

Created on 16 Jan 2020 · 9Comments · Source: huggingface/transformers

❓ Questions & Help

Hi,
Thank you for providing great documentation on quantization:
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html
I am trying similar steps on Albert Pytorch model, converted "albert-base-v1" to quantized one by applying dynamic quantization on linear layers. At inference stage (with quantized model), I get following error:
w = ( self.dense.weight.t()
.view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
.to(context_layer.dtype)
AttributeError: 'function' object has no attribute 't'
Any pointers about how to solve this error ?

wontfix

Source

Rachnas

👀2

Most helpful comment

for those still looking for a workaround solution to this issue: you may try following changes to AlbertAttention.forward()

    ...
    # Should find a better way to do this
    # w = (
    #     self.dense.weight.t()
    #     .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
    #     .to(context_layer.dtype)
    # )
    # b = self.dense.bias.to(context_layer.dtype)
    #
    # note that dequantize() is required as quantized tensor with dtype.qint8 cannot be converted to
    # dtype.float32 by calling .to(context_layer.dtype).
    #
    # Different from self.dense.weight(), self.dense.bias() returns regular tensor not quantized tensor
    w = (
        (self.dense.weight().t().dequantize() if callable(self.dense.weight) else self.dense.weight.t())
        .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
        .to(context_layer.dtype)
    )
    b = (self.dense.bias() if callable(self.dense.bias) else self.dense.bias) \
        .to(context_layer.dtype)

hailusong on 7 Aug 2020

👍2

All 9 comments

I know this doesn't directly answer the question, but I have been playing around with quantization of BERT and everything is good until I want to load the model into my notebook. The size of the model is inflated back to over 400 MB from under 200 MB, and the accuracy takes a huge hit. I noticed this when I tried to load the quantized model in the notebook of the pytorch tutorial as well. Have you been able to successful load in and use a quantized model in the first place?

ElektrikSpark on 16 Jan 2020

I tested albert-base-v1 as well, since I can't get albert-base-v2 to work (created an issue), and I can confirm that I am getting the same error. When outputs = quantized_model(input_ids, labels=labels) is run, the error occurs.

ElektrikSpark on 16 Jan 2020

@ElektrikSpark , I can evaluate using quantized bert model as shown in the documentation. Accuracy is low as compared to original Bert. After saving quantized model, I tried loading it from command line, it is not working for me.
With Albert, quantization step is not completing.

Rachnas on 17 Jan 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 17 Mar 2020

Any solution to this error?

sasaadi on 16 Jun 2020

Same issue here.

yi-nlmatics on 5 Jul 2020

I found issue while loading the quantized bert model, accuracy score decreases significantly. Does this mean, we can't use quantized BERT for production? I am not sure then why this tutorial was provided

rameshjes on 7 Jul 2020

I was able to solve this issue by using this

model = torch.quantization.quantize_dynamic( big_model, {torch.nn.Bilinear}, dtype=torch.qint8 )

Notice I used Bilinear instead of Linear, now dont ask me why, I just saw someone do something similar while quantizing GPT2 model

ramaneswaran on 10 Jul 2020

👍1

for those still looking for a workaround solution to this issue: you may try following changes to AlbertAttention.forward()

    ...
    # Should find a better way to do this
    # w = (
    #     self.dense.weight.t()
    #     .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
    #     .to(context_layer.dtype)
    # )
    # b = self.dense.bias.to(context_layer.dtype)
    #
    # note that dequantize() is required as quantized tensor with dtype.qint8 cannot be converted to
    # dtype.float32 by calling .to(context_layer.dtype).
    #
    # Different from self.dense.weight(), self.dense.bias() returns regular tensor not quantized tensor
    w = (
        (self.dense.weight().t().dequantize() if callable(self.dense.weight) else self.dense.weight.t())
        .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
        .to(context_layer.dtype)
    )
    b = (self.dense.bias() if callable(self.dense.bias) else self.dense.bias) \
        .to(context_layer.dtype)

hailusong on 7 Aug 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings