Could you please consider implementing Accelerating Neural Transformer via an Average Attention Network
Similar thing is done by Marian NMT
It is claimed that implementing this would accelerate transformer.
Thanks.
You could experiment with our models from "Pay Less Attention" (https://openreview.net/pdf?id=SkVhlh09tX) which we measure to be faster than the AAN. Pretrained models and commands to reproduce here: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper
Hi @huihuifan,
The comparison between AAN and "Pay Less Attention" is very impressive, but I notice that the model parameter number of AAN is significantly larger than that of the baseline Transformer (260M vs. 210M in Table 3).
This may indicate that when implementing the AAN, you also include the FFN sub layer which introduces many parameters and is heavy in terms of computation. As suggested in the original implementation, this sub layer can be safely removed without loss of quality.
Can you check if this sub layer is included in your implementation? Besides, perhaps combining the "Pay Less Attention" encoder with AAN decoder could make the model much faster.
Hi @bzhangGo, thank you for pointing that out in the readme! I didn't see that note before. I will check it out.
https://github.com/pytorch/translate/pull/576 Solves this