Hi guys, first of all, thank you a lot for the great API, I'm using a lot pytorch-transformers, you guys are really doing a good job!
I have recently fine-tuned a XLNetForQuestionAnswering on SQuAD1.10, results looks good, however the model is taking ~ 2.0 seconds (in a MacBook Pro) to do a forward in a reasonable small "facts/passage" text.
I had some some weight pruning in the past (in a small network), and I was wondering if you guys heard of any paper/idea to do weight pruning in transformer based networks such as BERT or XLNet?
Any other ideas to optimize model forward look for inferencing? I'm thinking to put these model in prod but ~1-2 seconds is still too high.
I'm willing to help and work on this issue, but it will be great if you guys can point some directions on best way to do this?
Currently the forward times of trained BertForQuestionAnswering and XLNetForQuestionAnswering are too high, I'm searching for options to reduce forward time on QA task for both networks (results below running on a MacBook Pro 2.9GHz Corei7, 16GB RAM):
BertForQuestionAnswering: 1.48 s 卤 52.4 ms per loop (mean 卤 std. dev. of 3 runs, 1 loop each)
XLNetForQuestionAnswering: 2.14 s 卤 45.5 ms per loop (mean 卤 std. dev. of 3 runs, 1 loop each)
I'm interested in this as well. I've seen similar inference times of nearly 1.5 seconds running BERT for inference on a fine-tuned classification task on TF Serving and would like to improve it without paying for a GPU.
I'm not associated with the following work, but found the paper interesting:
"tranformers.zip: Compressing Transformers with Pruning and Quantization"
http://web.stanford.edu/class/cs224n/reports/custom/15763707.pdf
The open source corresponding to the paper above has been published in a branch of OpenNMT here:
https://github.com/robeld/ERNIE
I think we could speed up significantly XLNet by refactoring the tensorflow code to use Embeddings instead of multiplication of static matrices with one-hot vectors as it's currently done in several places. We could also reduce the use of torch.einsum and replace them with matrix multiplications. We'll experiment with that in the coming months.
Might even just dropping in opt_einsum as a substitute for the torch.einsum be an easy speedup?
I'm doing some time profiling here, it looks like the time bottleneck in the forward loop of the transformer. In this case my overall forward loop for XLNetForQuestionAnswering is taking 2.5 s 卤 310 ms per loop (mean 卤 std. dev. of 3 runs, 1 loop each). Please see below a breakdown for each forward step (in seconds). Looks like the large chunk of the time is spent in the chunk of the code below ~2.33 seconds. Will start doing some optimizations on XLNetRelativeAttention and XLNetFeedForward to see what happens.
Causal attention mask: 7e-05
Data mask: 3e-05
Word Embedding: 0.00073
Segment Embedding: 5e-05
___ Pos encoding - 1 : 0.0099
___ Pos encoding - 2 : 0.00012
**___ Pos encoding - 3: 2.33072**
Positional encoding: 2.34084
Prepare output: 0.00025
Transformer time: 2.3420751094818115
___ Pos encoding - 3 - Code chunk
new_mems = ()
if mems is None:
mems = [None] * len(self.layer)
attentions = []
hidden_states = []
for i, layer_module in enumerate(self.layer):
# cache new mems
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
if self.output_hidden_states:
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
outputs = layer_module(output_h, output_g, attn_mask_h=non_tgt_mask, attn_mask_g=attn_mask,
r=pos_emb, seg_mat=seg_mat, mems=mems[i], target_mapping=target_mapping,
head_mask=head_mask[i])
output_h, output_g = outputs[:2]
if self.output_attentions:
attentions.append(outputs[2])
@MiroFurtado it looks like Torch.Einsum is already as optimized as opt_einsum - see attached an example of multiplication of 1024x1024 matrix using torch.einsum, torch.matmul,np.einsum and opt_einsum. Looks like in fact np.einsum is not optimized after all.
I modified the code to include opt_einsum using contract and actually it tooked ~3x more! 5.79 s 卤 0 ns per loop (mean 卤 std. dev. of 1 run, 1 loop each)
Einsum Comparison - Torch Einsum, Matmul, Numpy, Opt Contract
Just FYI, a relevant blog post about this topic, will investigate: https://blog.rasa.com/compressing-bert-for-faster-prediction-2/
More related information, freshly released: https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/?refid=52&__tn__=*s-R
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I think we could speed up significantly XLNet by refactoring the tensorflow code to use Embeddings instead of multiplication of static matrices with one-hot vectors as it's currently done in several places. We could also reduce the use of
torch.einsumand replace them with matrix multiplications. We'll experiment with that in the coming months.