Fairseq: Implementation of Self-Attention with Relative Position Representations

Created on 5 Mar 2019 · 4Comments · Source: pytorch/fairseq

Could you please implement the Self-Attention with Relative Position Representations

It was done in tensor2tensor.

Relative position representations outperforms the origin Transformer by about 1 BLEU.

Thanks

Source

gxzks

👍1

Most helpful comment

@myleott @alexeib @gxzks could you share the implementations of the relative positional embeddings? It may be useful for other scenarios, even if they have not been thoroughly tested in the last fairseq version.

noe on 17 Sep 2019

👍6

All 4 comments

This is currently not on our roadmap. We welcome any contributions by pull request! Note that we have some models trained in fairseq that have quite strong BLEU results. For example, this paper: https://openreview.net/pdf?id=SkVhlh09tX

models linked here: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper

huihuifan on 7 Mar 2019

👍2

FWIW, we actually experimented internally with relative positional embeddings in fairseq and found them to be only marginally better and quite a bit slower, so we never pushed it public. cc @alexeib: any interest in pushing your branch public?

myleott on 7 Mar 2019

Thanks for your reply :)
I've implemented the relative positional embeddings roughly just as a mimic of what T2T did but got no better performance. In T2T it actually performed better by about 0.5 BLEU scores in my own dataset. The training speed was slower too but I think it's normal as additional computation will happen.
I think the relative positional embeddings may be just an additional part in transformer not a new architecture, since it only changes the dot product function to plus an additional embedding and sets no absolute positional embeddings. Maybe can we set flags to control whether to use it or not?

gxzks on 8 Mar 2019