Attention is all you need paper https://arxiv.org/pdf/1706.03762.pdf uses fixed sinusoidal positional embedding instead of learned ones.
They claim that learned or sinusoidal embedding work similarly. They also claim that sinusoidal embeddings has the advantage of extrapolating positional information to sequence sizes larger than ones seen in training.
Has anyone tried it for fairseq, if not on what tasks should this be ideally evaluated before replacing the learned ones with sinusoidal?
Anyone ?
Here's a gist for sinusoidal positional embeddings: https://gist.github.com/myleott/051b909422df94d6cf91767b8e8e22a6. You should be able to use it as a drop-in replacement for LearnedPositionalEmbeddings. We haven't tested it with fconv yet, but if you do please report back!
@myleott Thanks for the implementation.
I'm running a grammar error correction task. Will drop in and see how it works, and report back soon.
@myleott
Tried it as drop in replacement for my task, gradients shoot up too high.
epoch 001: 1000 / 40587 loss=570820.218, ppl=inf, wps=5199, ups=9.0, wpb=574, bsz=32, num_updates=1001, lr=0.25, gnorm=5697814810.512, clip=100%, oom=0, sample_size=574.154
The model is converging. Will report the final performance vs learned for my task when it finishes.
And there is a problem with
def max_positions(self):
"""Maximum number of supported positions."""
return int(1e5) # an arbitrary large number
This max_positions is used to stop decoder if stop token is not generated. Having arbitrary large number means it continues decoding wastefully.
This max_positions is used to stop decoder if stop token is not generated. Having arbitrary large number means it continues decoding wastefully.
Every module can specify its own max_positions and we use the minimum such value as an upper-bound on the generation length. If all modules have large "max" values, then you can constrain the max output length with the --max-len-a and --max-len-b options to generate.py:
https://github.com/facebookresearch/fairseq-py/blob/66ee3df9071ea17b32f2ccc40d6dd092a82a551f/fairseq/options.py#L207-L212
Sinusoidal Grammar Corrector (Seq2Seq) - Evaluated using m2scorer in conll 14 task
Precision : 0.5882
Recall : 0.2081
F_0.5 : 0.4309
Learnt Pos Embedding Grammar Corrector (Seq2Seq) - Evaluated using m2scorer in conll 14 task
Precision : 0.5966
Recall : 0.2451
F_0.5 : 0.4636
There is a significant drop using sinusoidal embeddings, it could be due to improper normalization, given the high gradient norm throughout the training, but not sure.
I don't have the time or compute to pursue it further right now. But will try without pos embedding and report back in future.
This is now in master, along with a full transformer implementation :)
Most helpful comment
This is now in master, along with a full transformer implementation :)