Fairseq: Sinusoidal position embeddings

Created on 8 Mar 2018  路  8Comments  路  Source: pytorch/fairseq

Attention is all you need paper https://arxiv.org/pdf/1706.03762.pdf uses fixed sinusoidal positional embedding instead of learned ones.

They claim that learned or sinusoidal embedding work similarly. They also claim that sinusoidal embeddings has the advantage of extrapolating positional information to sequence sizes larger than ones seen in training.

Has anyone tried it for fairseq, if not on what tasks should this be ideally evaluated before replacing the learned ones with sinusoidal?

Most helpful comment

This is now in master, along with a full transformer implementation :)

All 8 comments

Anyone ?

Here's a gist for sinusoidal positional embeddings: https://gist.github.com/myleott/051b909422df94d6cf91767b8e8e22a6. You should be able to use it as a drop-in replacement for LearnedPositionalEmbeddings. We haven't tested it with fconv yet, but if you do please report back!

@myleott Thanks for the implementation.

I'm running a grammar error correction task. Will drop in and see how it works, and report back soon.

@myleott

Tried it as drop in replacement for my task, gradients shoot up too high.

epoch 001: 1000 / 40587 loss=570820.218, ppl=inf, wps=5199, ups=9.0, wpb=574, bsz=32, num_updates=1001, lr=0.25, gnorm=5697814810.512, clip=100%, oom=0, sample_size=574.154

The model is converging. Will report the final performance vs learned for my task when it finishes.

And there is a problem with

    def max_positions(self):
        """Maximum number of supported positions."""
        return int(1e5) # an arbitrary large number

This max_positions is used to stop decoder if stop token is not generated. Having arbitrary large number means it continues decoding wastefully.

This max_positions is used to stop decoder if stop token is not generated. Having arbitrary large number means it continues decoding wastefully.

Every module can specify its own max_positions and we use the minimum such value as an upper-bound on the generation length. If all modules have large "max" values, then you can constrain the max output length with the --max-len-a and --max-len-b options to generate.py:
https://github.com/facebookresearch/fairseq-py/blob/66ee3df9071ea17b32f2ccc40d6dd092a82a551f/fairseq/options.py#L207-L212

Sinusoidal Grammar Corrector (Seq2Seq) - Evaluated using m2scorer in conll 14 task

Precision   : 0.5882
Recall      : 0.2081
F_0.5       : 0.4309

Learnt Pos Embedding Grammar Corrector (Seq2Seq) - Evaluated using m2scorer in conll 14 task

Precision   : 0.5966
Recall      : 0.2451
F_0.5       : 0.4636

There is a significant drop using sinusoidal embeddings, it could be due to improper normalization, given the high gradient norm throughout the training, but not sure.

I don't have the time or compute to pursue it further right now. But will try without pos embedding and report back in future.

This is now in master, along with a full transformer implementation :)

Was this page helpful?
0 / 5 - 0 ratings