Fairseq: Implementation of Average Attention Network

Created on 26 Feb 2019 · 4Comments · Source: pytorch/fairseq

Could you please consider implementing Accelerating Neural Transformer via an Average Attention Network

Similar thing is done by Marian NMT

It is claimed that implementing this would accelerate transformer.

Thanks.

Source

gvskalyan

All 4 comments

You could experiment with our models from "Pay Less Attention" (https://openreview.net/pdf?id=SkVhlh09tX) which we measure to be faster than the AAN. Pretrained models and commands to reproduce here: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper

huihuifan on 26 Feb 2019

👍1

Hi @huihuifan,

The comparison between AAN and "Pay Less Attention" is very impressive, but I notice that the model parameter number of AAN is significantly larger than that of the baseline Transformer (260M vs. 210M in Table 3).

This may indicate that when implementing the AAN, you also include the FFN sub layer which introduces many parameters and is heavy in terms of computation. As suggested in the original implementation, this sub layer can be safely removed without loss of quality.

Can you check if this sub layer is included in your implementation? Besides, perhaps combining the "Pay Less Attention" encoder with AAN decoder could make the model much faster.