Hi,
Thank you for providing great documentation on quantization:
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html
I am trying similar steps on Albert Pytorch model, converted "albert-base-v1" to quantized one by applying dynamic quantization on linear layers. At inference stage (with quantized model), I get following error:
w = ( self.dense.weight.t()
.view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
.to(context_layer.dtype)
AttributeError: 'function' object has no attribute 't'
Any pointers about how to solve this error ?
I know this doesn't directly answer the question, but I have been playing around with quantization of BERT and everything is good until I want to load the model into my notebook. The size of the model is inflated back to over 400 MB from under 200 MB, and the accuracy takes a huge hit. I noticed this when I tried to load the quantized model in the notebook of the pytorch tutorial as well. Have you been able to successful load in and use a quantized model in the first place?
I tested albert-base-v1 as well, since I can't get albert-base-v2 to work (created an issue), and I can confirm that I am getting the same error. When outputs = quantized_model(input_ids, labels=labels) is run, the error occurs.
@ElektrikSpark , I can evaluate using quantized bert model as shown in the documentation. Accuracy is low as compared to original Bert. After saving quantized model, I tried loading it from command line, it is not working for me.
With Albert, quantization step is not completing.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Any solution to this error?
Same issue here.
I found issue while loading the quantized bert model, accuracy score decreases significantly. Does this mean, we can't use quantized BERT for production? I am not sure then why this tutorial was provided
I was able to solve this issue by using this
model = torch.quantization.quantize_dynamic(
big_model, {torch.nn.Bilinear}, dtype=torch.qint8
)
Notice I used Bilinear instead of Linear, now dont ask me why, I just saw someone do something similar while quantizing GPT2 model
for those still looking for a workaround solution to this issue: you may try following changes to AlbertAttention.forward()
...
# Should find a better way to do this
# w = (
# self.dense.weight.t()
# .view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
# .to(context_layer.dtype)
# )
# b = self.dense.bias.to(context_layer.dtype)
#
# note that dequantize() is required as quantized tensor with dtype.qint8 cannot be converted to
# dtype.float32 by calling .to(context_layer.dtype).
#
# Different from self.dense.weight(), self.dense.bias() returns regular tensor not quantized tensor
w = (
(self.dense.weight().t().dequantize() if callable(self.dense.weight) else self.dense.weight.t())
.view(self.num_attention_heads, self.attention_head_size, self.hidden_size)
.to(context_layer.dtype)
)
b = (self.dense.bias() if callable(self.dense.bias) else self.dense.bias) \
.to(context_layer.dtype)
Most helpful comment
for those still looking for a workaround solution to this issue: you may try following changes to AlbertAttention.forward()