Could you please implement the Self-Attention with Relative Position Representations
It was done in tensor2tensor.
Relative position representations outperforms the origin Transformer by about 1 BLEU.
Thanks
This is currently not on our roadmap. We welcome any contributions by pull request! Note that we have some models trained in fairseq that have quite strong BLEU results. For example, this paper: https://openreview.net/pdf?id=SkVhlh09tX
models linked here: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper
FWIW, we actually experimented internally with relative positional embeddings in fairseq and found them to be only marginally better and quite a bit slower, so we never pushed it public. cc @alexeib: any interest in pushing your branch public?
Thanks for your reply :)
I've implemented the relative positional embeddings roughly just as a mimic of what T2T did but got no better performance. In T2T it actually performed better by about 0.5 BLEU scores in my own dataset. The training speed was slower too but I think it's normal as additional computation will happen.
I think the relative positional embeddings may be just an additional part in transformer not a new architecture, since it only changes the dot product function to plus an additional embedding and sets no absolute positional embeddings. Maybe can we set flags to control whether to use it or not?
@myleott @alexeib @gxzks could you share the implementations of the relative positional embeddings? It may be useful for other scenarios, even if they have not been thoroughly tested in the last fairseq version.
Most helpful comment
@myleott @alexeib @gxzks could you share the implementations of the relative positional embeddings? It may be useful for other scenarios, even if they have not been thoroughly tested in the last fairseq version.